[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] UTF-8 decoding bug in HTML parser
- From: Daniel Veillard <veillard redhat com>
- To: Michael Day <mikeday yeslogic com>
- Cc: xml gnome org
- Subject: Re: [xml] UTF-8 decoding bug in HTML parser
- Date: Thu, 25 Sep 2008 18:07:11 +0200
On Thu, Sep 11, 2008 at 06:12:30PM +1000, Michael Day wrote:
> Hi,
>
> The attached file illustrates a UTF-8 decoding bug in the HTML parser,
> which can be recreated with:
>
> $ xmllint --html utf8bug.html
>
> The last one or two characters in the document are corrupted, and
> xmllint reports an encoding error. However, the text is in fact
> correctly encoded, as can be demonstrated by pasting it into an XML
> document, or just deleting some unrelated text from earlier in this
> document, which fixes the problem.
>
> As can be seen from the full example (utf8full.html) after the corrupted
> character the parser appears to switch back to a single byte encoding,
> so all subsequent multibyte UTF-8 text is also corrupted.
>
> This appears to be caused by some kind of buffering bug, perhaps a
> multibyte UTF-8 character is overlapping the end of a buffer, and the
> buffer is not being expanded correctly?
okay, thanks for the detailed informations...
The problem comes from htmlParseCharData which does a loop reading
UTF-8 characters one at a time, and using
2800 NEXTL(l);
2801 cur = CUR_CHAR(l);
2802 if (cur == 0) {
2803 SHRINK;
2804 GROW;
2805 cur = CUR_CHAR(l);
to detect end of input buffer and growing it if needed.
The problem is that if we fail, the side effect you detected comes
in with a spurious error and a subsequent encoding problem.
The simplest way is to preemtively shrink/grow the input buffer
without waiting for the error. You may loose a tiny bit of performances
but this is actually correct :-\
See patch attached, i'm commiting it to SVN as this fixes the specific
test case, all the errors seen when parsing subsequently looks 'normal'
:-) so I added it to the test suite
thanks for the detailed test and explanations !
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
Index: HTMLparser.c
===================================================================
--- HTMLparser.c (revision 3795)
+++ HTMLparser.c (working copy)
@@ -2768,6 +2768,7 @@ htmlParseCharData(htmlParserCtxtPtr ctxt
xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
int nbchar = 0;
int cur, l;
+ int chunk = 0;
SHRINK;
cur = CUR_CHAR(l);
@@ -2798,6 +2799,12 @@ htmlParseCharData(htmlParserCtxtPtr ctxt
nbchar = 0;
}
NEXTL(l);
+ chunk++;
+ if (chunk > HTML_PARSER_BUFFER_SIZE) {
+ chunk = 0;
+ SHRINK;
+ GROW;
+ }
cur = CUR_CHAR(l);
if (cur == 0) {
SHRINK;
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]