Re: [xml] UTF-8 decoding bug in HTML parser
- From: Michael Day <mikeday yeslogic com>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] UTF-8 decoding bug in HTML parser
- Date: Fri, 26 Sep 2008 14:44:19 +1000
Hi Daniel,
See patch attached, i'm commiting it to SVN as this fixes the specific
test case, all the errors seen when parsing subsequently looks 'normal'
:-) so I added it to the test suite
Excellent!
Would there be any chance that you could look at one more related issue
affecting the HTML parser? Currently if a HTML file begins with a UTF-8
BOM, the HTML parser does not recognise it and parses it as three Latin1
characters, which results in garbage at the beginning of the file and an
incorrect encoding for the rest of the file.
Would it be possible to skip over these three bytes, and ideally set the
encoding to UTF-8 if they are present?
Best regards,
Michael
--
Print XML with Prince!
http://www.princexml.com
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]