[xml] UTF-8 decoding bug in HTML parser
- From: Michael Day <mikeday yeslogic com>
- To: xml gnome org
- Subject: [xml] UTF-8 decoding bug in HTML parser
- Date: Thu, 11 Sep 2008 18:12:30 +1000
Hi,
The attached file illustrates a UTF-8 decoding bug in the HTML parser,
which can be recreated with:
$ xmllint --html utf8bug.html
The last one or two characters in the document are corrupted, and
xmllint reports an encoding error. However, the text is in fact
correctly encoded, as can be demonstrated by pasting it into an XML
document, or just deleting some unrelated text from earlier in this
document, which fixes the problem.
As can be seen from the full example (utf8full.html) after the corrupted
character the parser appears to switch back to a single byte encoding,
so all subsequent multibyte UTF-8 text is also corrupted.
This appears to be caused by some kind of buffering bug, perhaps a
multibyte UTF-8 character is overlapping the end of a buffer, and the
buffer is not being expanded correctly?
Best regards,
Michael
--
Print XML with Prince!
http://www.princexml.com
Title: ØÙÙØ ØÙØÚ
|
تاريخ درج: چهارشنبه، 29 اسفند 1386
- Wednesday, March 19, 2008
نويسنده:
دفعات مشاهده: 2688
بار كد: 341
ØÚØ ÙØØÙÙ ÙÙØ ÛÚÛ ØØ ÙÙÚØØØÙ ØØØ. ØÙÛ ØÙ ÚÙÛÚ ÚÙØ.
|
| | | |