[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [xml] UTF-8 decoding bug in HTML parser



On Thu, Sep 11, 2008 at 06:12:30PM +1000, Michael Day wrote:
> Hi,
>
> The attached file illustrates a UTF-8 decoding bug in the HTML parser,  
> which can be recreated with:
>
>     $ xmllint --html utf8bug.html
>
> The last one or two characters in the document are corrupted, and  
> xmllint reports an encoding error. However, the text is in fact  
> correctly encoded, as can be demonstrated by pasting it into an XML  
> document, or just deleting some unrelated text from earlier in this  
> document, which fixes the problem.
>
> As can be seen from the full example (utf8full.html) after the corrupted  
> character the parser appears to switch back to a single byte encoding,  
> so all subsequent multibyte UTF-8 text is also corrupted.
>
> This appears to be caused by some kind of buffering bug, perhaps a  
> multibyte UTF-8 character is overlapping the end of a buffer, and the  
> buffer is not being expanded correctly?

  okay, thanks for the detailed informations...
The problem comes from htmlParseCharData which does a loop reading
UTF-8 characters one at a time, and using

2800        NEXTL(l);
2801        cur = CUR_CHAR(l);
2802        if (cur == 0) {
2803            SHRINK;
2804            GROW;
2805            cur = CUR_CHAR(l);

  to detect end of input buffer and growing it if needed.
The problem is that if we fail, the side effect you detected comes
in with a spurious error and a subsequent encoding problem.
  The simplest way is to preemtively shrink/grow the input buffer
without waiting for the error. You may loose a tiny bit of performances
but this is actually correct :-\
  See patch attached, i'm commiting it to SVN as this fixes the specific
test case, all the errors seen when parsing subsequently looks 'normal'
:-) so I added it to the test suite

   thanks for the detailed test and explanations !

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
Index: HTMLparser.c
===================================================================
--- HTMLparser.c	(revision 3795)
+++ HTMLparser.c	(working copy)
@@ -2768,6 +2768,7 @@ htmlParseCharData(htmlParserCtxtPtr ctxt
     xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
     int nbchar = 0;
     int cur, l;
+    int chunk = 0;
 
     SHRINK;
     cur = CUR_CHAR(l);
@@ -2798,6 +2799,12 @@ htmlParseCharData(htmlParserCtxtPtr ctxt
 	    nbchar = 0;
 	}
 	NEXTL(l);
+        chunk++;
+        if (chunk > HTML_PARSER_BUFFER_SIZE) {
+            chunk = 0;
+            SHRINK;
+            GROW;
+        }
 	cur = CUR_CHAR(l);
 	if (cur == 0) {
 	    SHRINK;


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]