Hello, I might have stumbled upon a bug in HTMLparser.c. This bug manifests itself when a UTF-8 HTML file is being read and a Unicode character gets split right at the end of the input buffer; the input buffer then gets resized and but an old pointer is used and the Unicode character is not recognized. This causes the encoding to be switched back to ISO-8859-1. I found this bug while using xsltproc with --html, and I don't know under what other circumstances it may arise. The attached patch solves the problem by updating the cur pointer in the htmlCurrentChar function whenever xmlParserInputGrow is called. Thanks for Libxml2 and Libxslt, I have used them in several occasions and they've been really helpful. -- Adiel Mittmann
Attachment:
libxml2-2.6.32-split-utf8.patch
Description: Text document