Re: [xml] Approach for parsing HTML file or URL

From: Michael Ludwig <mlu as-guides com>
To: "Xml Gnome Org" <xml gnome org>
Subject: Re: [xml] Approach for parsing HTML file or URL
Date: Wed, 05 Aug 2009 09:23:39 +0200

Brian Kim schrieb:


For example, <a href="aaa0", alt="aaa1"><em>test1</em> <em>test2</em>
I am a boy</a>


A comma in the attribute list is a syntax error.

Then, I want to analyze those nodes as follows. The tag of node 1 is
"a". Its attributes are href and alt, which have "aaa0" and "aaa1"
respectively Also, it has an anchor text, "I am a boy" The other two
tags are "em", which has "test1" and "test2" as an anchor text.

This kind of level is enough for me. Does anybody help me?

In fact, I have created a sample code with a xpath example.


XPath and XSLT are very good high-level tools to achieve the analysis
you want. You could also do this using DOM, but this would be more
cumbersome.

For the simple html input, my code got the almost correct parsing
result, but when I tried to parse a html from URL, which is, of
course, more complex than a simple html, I got a weird data.


As pointed out, your simple sample input has a syntax error. Random HTML
from the web may well have syntax errors, too.

Can I say if a html is not well-formed, then the association between
tag and anchor text is not sometimes handled properly?


Wellformedness applies to XML, not to HTML. Note that from this vantage
point, XHTML is XML, not HTML.

HTML may be malformed, too, as in your simple sample above.

In other words, is there a possibility that a parsing tree is not
perfectly correct if the html is not well-formed?


Definitely yes.

In fact, I want to double-check if my way is right or not, seeing some
general way of looking at html-parsed tree nodes that somebody may
suggest.


The HTML parser provided by LibXML2 is good. Other useful tools include
TagSoup [1] and Tidy [2].

Michael Ludwig

[1] http://home.ccil.org/~cowan/XML/tagsoup/
[2] http://tidy.sourceforge.net/

References:
- Re: [xml] Approach for parsing HTML file or URL
  - From: Brian Kim

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]