Re: [xml] Approach for parsing HTML file or URL

Hi. Thanks.

For example, <a href="aaa0", alt="aaa1"><em>test1</em> <em>test2</em>
I am a boy</a>

Here we have three nodes,
1. <a href="aaa0", alt="aaa1">I am a boy</a>
2. <em>test1</em>
3. <em>test2</em>

Then, I want to analyze those nodes as follows.
The tag of node 1 is "a". Its attributes are href and alt, which have
"aaa0" and "aaa1" respectively
Also, it has an anchor text, "I am a boy"
The other two tags are "em", which has "test1" and "test2" as an anchor text.

This kind of level is enough for me.
Does anybody help me?

In fact, I have created a sample code with a xpath example. For the
simple html input,
my code got the almost correct parsing result, but when I tried to
parse a html from URL, which is, of course,
more complex than a simple html, I got a weird data.
In the above example, "I am a boy" is obviously an anchor text of the
tag, "a". With this simple html,
I get it that way. However, it have been interpreted that "I am a boy"
is an anchor text of "em", if it is a part of a complex html.
Can I say if a html is not well-formed, then the association between
tag and anchor text is not sometimes handled properly?
In other words, is there a possibility that a parsing tree is not
perfectly correct if the html is not well-formed?

In fact, I want to double-check if my way is right or not, seeing some
general way of looking at html-parsed tree nodes that somebody may


Date: Tue, 04 Aug 2009 08:51:42 +0200
From: Michael Ludwig <mlu as-guides com>
Subject: Re: [xml] Approach for parsing HTML file or URL
To: xml gnome org
Message-ID: <4A77DA7E 8030406 as-guides com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Brian Kim schrieb:

I would like to parse html and see the content of html attributes in
each tag.

Using htmlreadfile function is quite obvious, but I guess there is
another way to see each node of parsed tree instead of using Xpath.

Could you define what you mean by "seeing each node of the parsed tree"?

Michael Ludwig

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]