On Thu, Sep 11, 2008 at 06:12:30PM +1000, Michael Day wrote:
Hi, The attached file illustrates a UTF-8 decoding bug in the HTML parser, which can be recreated with: $ xmllint --html utf8bug.html The last one or two characters in the document are corrupted, and xmllint reports an encoding error. However, the text is in fact correctly encoded, as can be demonstrated by pasting it into an XML document, or just deleting some unrelated text from earlier in this document, which fixes the problem. As can be seen from the full example (utf8full.html) after the corrupted character the parser appears to switch back to a single byte encoding, so all subsequent multibyte UTF-8 text is also corrupted. This appears to be caused by some kind of buffering bug, perhaps a multibyte UTF-8 character is overlapping the end of a buffer, and the buffer is not being expanded correctly?
okay, thanks for the detailed informations... The problem comes from htmlParseCharData which does a loop reading UTF-8 characters one at a time, and using 2800 NEXTL(l); 2801 cur = CUR_CHAR(l); 2802 if (cur == 0) { 2803 SHRINK; 2804 GROW; 2805 cur = CUR_CHAR(l); to detect end of input buffer and growing it if needed. The problem is that if we fail, the side effect you detected comes in with a spurious error and a subsequent encoding problem. The simplest way is to preemtively shrink/grow the input buffer without waiting for the error. You may loose a tiny bit of performances but this is actually correct :-\ See patch attached, i'm commiting it to SVN as this fixes the specific test case, all the errors seen when parsing subsequently looks 'normal' :-) so I added it to the test suite thanks for the detailed test and explanations ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/
Attachment:
html_utf8.patch
Description: Text document