Re: [xml] Parse webpage HTML

From: Liam R E Quin <liam holoweb net>
To: Julius Hamilton <juliushamilton100 gmail com>, xml gnome org
Subject: Re: [xml] Parse webpage HTML
Date: Fri, 17 Sep 2021 10:00:44 -0400

On Fri, 2021-09-17 at 12:15 +0200, Julius Hamilton via xml wrote:

Hey,

I would like to write a script which extracts article text content
from
webpage HTML.


You might want to look at xidel for that.


I believe I should first inspect the HTML tree, i.e. the raw HTML
returned
by wget, to figure out which nodes tend to have the text content I am
seeking. Should I do that in Firefox or Chrome, or is there a good
standalone tool for that?


The browsers will *modify* the HTML. For example, they will insert
tbody elements into tables, and they will change element nesting in
some cases.


But it's in relatively few cases, so the element inspector in the
browser isn' a bad start.

Liam


-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

References:
- [xml] Parse webpage HTML
  - From: Julius Hamilton

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]