[xml] Parse webpage HTML



Hey,

I would like to write a script which extracts article text content from webpage HTML. The webpages have similar structure because they are all documentation pages from the same website, Microsoft Visual Basic for Applications homepage.

I believe I should first inspect the HTML tree, i.e. the raw HTML returned by wget, to figure out which nodes tend to have the text content I am seeking. Should I do that in Firefox or Chrome, or is there a good standalone tool for that?

Then, could I use this xml parsing library, or is there some other standard one, for retrieving the text content at the nodes I have identified?

Thanks very much,
Julius


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]