[xml] Parse webpage HTML
- From: Julius Hamilton <juliushamilton100 gmail com>
- To: xml gnome org
- Subject: [xml] Parse webpage HTML
- Date: Fri, 17 Sep 2021 12:15:01 +0200
Hey,
I would like to write a script which extracts article text content from webpage HTML. The webpages have similar structure because they are all documentation pages from the same website, Microsoft Visual Basic for Applications homepage.
I believe I should first inspect the HTML tree, i.e. the raw HTML returned by wget, to figure out which nodes tend to have the text content I am seeking. Should I do that in Firefox or Chrome, or is there a good standalone tool for that?
Then, could I use this xml parsing library, or is there some other standard one, for retrieving the text content at the nodes I have identified?
Thanks very much,
Julius
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]