[xml] Parse webpage HTML

From: Julius Hamilton <juliushamilton100 gmail com>
To: xml gnome org
Subject: [xml] Parse webpage HTML
Date: Fri, 17 Sep 2021 12:15:01 +0200

Hey,

I would like to write a script which extracts article text content from webpage HTML. The webpages have similar structure because they are all documentation pages from the same website, Microsoft Visual Basic for Applications homepage.

I believe I should first inspect the HTML tree, i.e. the raw HTML returned by wget, to figure out which nodes tend to have the text content I am seeking. Should I do that in Firefox or Chrome, or is there a good standalone tool for that?

Then, could I use this xml parsing library, or is there some other standard one, for retrieving the text content at the nodes I have identified?

Thanks very much,

Julius

Follow-Ups:
- Re: [xml] Parse webpage HTML
  - From: Liam R E Quin

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]