Re: [xml] UTF-8 decoding bug in HTML parser

From: Daniel Veillard <veillard redhat com>
To: Michael Day <mikeday yeslogic com>
Cc: xml gnome org
Subject: Re: [xml] UTF-8 decoding bug in HTML parser
Date: Fri, 26 Sep 2008 12:06:22 +0200

On Fri, Sep 26, 2008 at 02:44:19PM +1000, Michael Day wrote:

Hi Daniel,

  See patch attached, i'm commiting it to SVN as this fixes the specific
test case, all the errors seen when parsing subsequently looks 'normal'
:-) so I added it to the test suite


Excellent!

Would there be any chance that you could look at one more related issue  
affecting the HTML parser? Currently if a HTML file begins with a UTF-8  
BOM, the HTML parser does not recognise it and parses it as three Latin1  
characters, which results in garbage at the beginning of the file and an  
incorrect encoding for the rest of the file.

Would it be possible to skip over these three bytes, and ideally set the  
encoding to UTF-8 if they are present?


  Reusing the XML code for this seems to work fine for em and the
regression test, but you have probably a more extensive HTML test
suite than me ;-) so raise the problem if there is a regression !
Will commit to SVN with the test case,

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Attachment: html_utf8_bom.patch
Description: Text document

Follow-Ups:
- Re: [xml] UTF-8 decoding bug in HTML parser
  - From: Michael Day
- Re: [xml] UTF-8 decoding bug in HTML parser
  - From: Michael Day

References:
- [xml] UTF-8 decoding bug in HTML parser
  - From: Michael Day
- Re: [xml] UTF-8 decoding bug in HTML parser
  - From: Daniel Veillard
- Re: [xml] UTF-8 decoding bug in HTML parser
  - From: Michael Day

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]