Re: [xml] UTF-8 decoding bug in HTML parser



On Thu, Sep 11, 2008 at 06:12:30PM +1000, Michael Day wrote:
Hi,

The attached file illustrates a UTF-8 decoding bug in the HTML parser,  
which can be recreated with:

    $ xmllint --html utf8bug.html

The last one or two characters in the document are corrupted, and  
xmllint reports an encoding error. However, the text is in fact  
correctly encoded, as can be demonstrated by pasting it into an XML  
document, or just deleting some unrelated text from earlier in this  
document, which fixes the problem.

As can be seen from the full example (utf8full.html) after the corrupted  
character the parser appears to switch back to a single byte encoding,  
so all subsequent multibyte UTF-8 text is also corrupted.

This appears to be caused by some kind of buffering bug, perhaps a  
multibyte UTF-8 character is overlapping the end of a buffer, and the  
buffer is not being expanded correctly?

  okay, thanks for the detailed informations...
The problem comes from htmlParseCharData which does a loop reading
UTF-8 characters one at a time, and using

2800        NEXTL(l);
2801        cur = CUR_CHAR(l);
2802        if (cur == 0) {
2803            SHRINK;
2804            GROW;
2805            cur = CUR_CHAR(l);

  to detect end of input buffer and growing it if needed.
The problem is that if we fail, the side effect you detected comes
in with a spurious error and a subsequent encoding problem.
  The simplest way is to preemtively shrink/grow the input buffer
without waiting for the error. You may loose a tiny bit of performances
but this is actually correct :-\
  See patch attached, i'm commiting it to SVN as this fixes the specific
test case, all the errors seen when parsing subsequently looks 'normal'
:-) so I added it to the test suite

   thanks for the detailed test and explanations !

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Attachment: html_utf8.patch
Description: Text document



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]