[xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border

From: "Cyrill Osterwalder" <Cyrill Osterwalder visonys com>
To: <xml gnome org>
Cc: Cyrill Osterwalder <Cyrill Osterwalder visonys com>
Subject: [xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border
Date: Tue, 20 Jun 2006 18:20:03 +0200

Hi Daniel / all

I encountered some problems with the HTML chunk parser if certain HTML
keywords overlap the end of the chunk (calling htmlParseChunk() ). It
seems that the HTML parser does not recognize it in certain cases and
loses the context. In order to describe the problem more clearly, I
created an easy test that can be reproduced using the "testHTML.c" of
libxml2. These findings are based on libxml2-2.6.24 and I did not find
this issue already documented.

Description:
------------
If the function htmlParseChunk() is called with a chunk of bytes where a
closing </script> or </style> tag is overlapping the end of the chunk,
the HTML parser will fail to recognize the closing tag and it will
interpret the second part of the closing tag as CDATA. This gives
unpredictable results with SAX callbacks for the rest of the HTML
content.

Example:
--------
Call the function htmlParseChunk() with two buffers subsequently in a
row, like the following examples (buffer bytes between the quotes):

Buffer1:
"<html><body><script></"

Buffer2:
"script> <a href='test'>LINK</a></body></html>"

The two buffers concatenated are valid HTML with an empty script block.
There is no special character between the two buffers. An application
using the SAX callbacks will be called like this:
- startElement("html")
- startElement("body")
- startElement("script")
- cdata("</")                         <==== ouch! we expect
endElement("script")
- cdata("script> <a href='test'>..."
- ...

The HTML parser needs a closing </script> tag again to get back into the
game.

Test with testHTML.c:
---------------------
The easiest way for anybody to test this behaviour is the following:
Reduce the chunk size variable "size" in testHTML.c from 4096 to 10 on
the lines 641 and 671. This makes sure that testHTML uses small chunks
so we can process a small test file. 

Use the following HTML content as test HTML file that we call
chunktest.html (without the dashes):

------------------------------
    <html><body>.......
..........<script></script>
<a href="test">LINK</a>
<script></script>
</body>
</html>
------------------------------

Note that testHTML first consumes 4 bytes and then 10 at a time (after
the change from above). The first line contains 23 characters plus the
newline and therefore the closing </script> tag will overlap the next
chunk border.

Using the following command with this test file I get the following
output that shows how the closing </script> tag is interpreted as CDATA
content:

# ./testHTML --push --sax --debug chunktest.html

SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElement(html)
SAX.startElement(body)
SAX.characters(.......
.........., 18)
SAX.startElement(script)
SAX.error: Invalid char in CDATA 0x0
SAX.cdata(&lt;/, 2)
SAX.error: htmlParseEndTag: '</' not found
SAX.cdata(cript&gt;
&lt;a href="test", 26)
SAX.error: Unexpected end tag : a
SAX.cdata(
&lt;script&gt;, 9)
SAX.endElement(script)
SAX.characters(
, 1)
SAX.endElement(body)
SAX.ignorableWhitespace(
, 1)
SAX.endElement(html)
SAX.ignorableWhitespace(
, 1)
SAX.endDocument()



I assume that this is a bug and the HTML parser should be able to handle
HTML tags that overlap the chunk boundary. If I'm wrong on that
assumption then of course the caller would have to make sure that no
tags are overlapping. This however would require to parse the HTML
before calling htmlParseChunk().... erm.. boom ;-)

Best regards

Cyrill

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]