Re: [xml] strange end-tag position (parsing html)



On Wed, Oct 06, 2010 at 02:19:32PM -0700, David Gatwood wrote:
On Oct 6, 2010, at 10:08 AM, rcsaba gmail com wrote:

On Wed, Oct 6, 2010 at 12:18 AM, Steven Falken  wrote:
Hi,
I'm trying to parse bare.txt (attached, yes it is simply cnn.com). For
this purpose I'm using parse.c (also attached).
The output is output.txt (Attachment!).
If you look at bare.txt, you see a <script> block from line 826 to
line 886. Now if you look at output.txt, you see the
<script>-Tag in line 759, but the end-Tag (</script>) is in line 784;
the problem is, that this end-Tag is in the middle
of the javascript-code, which is actually bad :(

This is because cnn's HTML sucks :). They can't seem to make up their
mind between HTML and XHTML.

Take a look at line 792 of output.txt: the for statement is mangled.
Looks like the '<' operator was interpreted by libxml as a start tag.
The </script> is in the place where a </a> is in bare.txt

Perhaps libxml2 betrayed its true nature (an XML parser) and parsed
bare.txt as XML (XHTML). In this case the content of <script> is also
parsed as, and must be valid XML (which it isn't).
See http://javascript.about.com/library/blxhtml.htm

Alternatively, this is yet another reason why inline JavaScript should be avoided if at all possible.  Use 
the src, Luke.

  HTML specification says where the <script> boundaries should end.
libxml2 HTML parser follows their recommendations.
But a number of HTML generators just fail to do this properly.
Try to use the HTML_PARSE_RECOVER option to parse such documents.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]