Re: [xml] UTF-8 decoding bug in HTML parser

From: Michael Day <mikeday yeslogic com>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] UTF-8 decoding bug in HTML parser
Date: Fri, 26 Sep 2008 14:44:19 +1000

Hi Daniel,

  See patch attached, i'm commiting it to SVN as this fixes the specific
test case, all the errors seen when parsing subsequently looks 'normal'
:-) so I added it to the test suite


Excellent!

Would there be any chance that you could look at one more related issueaffecting the HTML parser? Currently if a HTML file begins with a UTF-8BOM, the HTML parser does not recognise it and parses it as three Latin1characters, which results in garbage at the beginning of the file and anincorrect encoding for the rest of the file.

Would it be possible to skip over these three bytes, and ideally set theencoding to UTF-8 if they are present?


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com

Follow-Ups:
- Re: [xml] UTF-8 decoding bug in HTML parser
  - From: Daniel Veillard

References:
- [xml] UTF-8 decoding bug in HTML parser
  - From: Michael Day
- Re: [xml] UTF-8 decoding bug in HTML parser
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]