Re: [xml] HTML Parser problems with chunk parser ifHTMLkeywordsoverlap chunk border




please use an attachment, not in the mail body, mailers breaks 
body content.
<...>
provide test example as attachmnent too, I will plug them 
in test/HTML

The attached tar.gz includes the contextual patch of HTMLparser.c of
libxml2-2.6.24 (now with htmlParseLookupSequence) and the test HTML file
"chunk-boundary-cdata.html". The test HTML file triggers the error in
libxml2 because it has the closing "</script>" tag exactly on the 4096
boundary. To reproduce the test, the number of chars in the test HTML
file and the number of bytes read by testHTML must not be changed(!).
The character alignment needs to match exactly to trigger the error.

Before the patch, libxml2-2.6.24 will fail the following test with the
simple test HTML file:

./testHTML --push --sax --debug chunk-boundary-cdata.html

SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElement(html)
SAX.startElement(body)
SAX.characters(.............................., 1000)
SAX.characters(...........................
.., 1000)
SAX.characters(.............................., 1000)
SAX.characters(...........................
.., 1000)
SAX.characters(.............................., 74)
SAX.startElement(script)
SAX.error: Invalid char in CDATA 0x0
SAX.cdata(&lt;/, 2)
SAX.error: htmlParseEndTag: '</' not found
SAX.cdata(cript&gt;
&lt;a href="test", 26)
SAX.error: Unexpected end tag : a
SAX.cdata(
, 1)
SAX.endElement(script)
SAX.endElement(body)
SAX.ignorableWhitespace(
, 1)
SAX.endElement(html)
SAX.ignorableWhitespace(
, 1)
SAX.endDocument()


After the patch, the result is correct.

Cyrill

Attachment: libxml2-HTMLparser-cdata-fix.tar.gz
Description: libxml2-HTMLparser-cdata-fix.tar.gz



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]