Re: [xml] Minor bug in htmlCtxtReset

From: Daniel Veillard <veillard redhat com>
To: Michael Day <mikeday yeslogic com>
Cc: xml gnome org
Subject: Re: [xml] Minor bug in htmlCtxtReset
Date: Wed, 8 Nov 2006 04:22:19 -0500

On Wed, Nov 08, 2006 at 10:03:29AM +1100, Michael Day wrote:

Hi Daniel,

 ctxt->charset is a remain from libxml1 where strings were stored
in the document encoding (this was a complete and total mess), now
they are always stored as UTF-8 so whether the value is 0 or 
XML_CHAR_ENCODING_UTF8 this should not change anything, really.


However, the charset value is used in htmlCurrentChar():

    if (ctxt->charset == XML_CHAR_ENCODING_UTF8) {

I'm trying to parse a HTML file encoded in ISO-8859-1 using 
htmlCtxtReadFile() and I'm getting encoding errors on some of the 
characters because they are not in UTF-8. If I use htmlReadFile() 
everything works fine. If I use htmlCtxtReadFile() and comment out this 
line of htmlCtxtReset():

    ctxt->charset = XML_CHAR_ENCODING_UTF8;

then everything works fine. However, if that line is not commented out, 
then the behaviour of htmlCtxtReadFile() is different from the behaviour 
of htmlReadFile(), and appears to be wrong. So I suggest replacing that 
line with this:

    ctxt->charset = 0;

which will truly reset the parsing context to what it was when it was 
created and give identical behaviour to htmlReadFile() and 
htmlCtxtReadFile().


  Okay, what I tough was a general rule is limited to XML parsing, we
actually do
        ctxt->charset = XML_CHAR_ENCODING_8859_1
in the HTML parser when an encoding error is detected, so you're right
and the reset code need to be fixed. HTML parsing is a really scary mess :-\

  So best is to change htmlCtxtReset() to do
        ctxt->charset = XML_CHAR_ENCODING_NONE;

  thanks for the report, I commited this change in CVS now !

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/

References:
- [xml] Minor bug in htmlCtxtReset
  - From: Michael Day
- Re: [xml] Minor bug in htmlCtxtReset
  - From: Daniel Veillard
- Re: [xml] Minor bug in htmlCtxtReset
  - From: Michael Day

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]