Re: [xml] Minor bug in htmlCtxtReset

From: Michael Day <mikeday yeslogic com>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] Minor bug in htmlCtxtReset
Date: Wed, 08 Nov 2006 10:03:29 +1100

Hi Daniel,

  ctxt->charset is a remain from libxml1 where strings were stored
in the document encoding (this was a complete and total mess), now
they are always stored as UTF-8 so whether the value is 0 orXML_CHAR_ENCODING_UTF8 this should not change anything, really.


However, the charset value is used in htmlCurrentChar():

    if (ctxt->charset == XML_CHAR_ENCODING_UTF8) {

I'm trying to parse a HTML file encoded in ISO-8859-1 usinghtmlCtxtReadFile() and I'm getting encoding errors on some of thecharacters because they are not in UTF-8. If I use htmlReadFile()everything works fine. If I use htmlCtxtReadFile() and comment out thisline of htmlCtxtReset():


    ctxt->charset = XML_CHAR_ENCODING_UTF8;

then everything works fine. However, if that line is not commented out,then the behaviour of htmlCtxtReadFile() is different from the behaviourof htmlReadFile(), and appears to be wrong. So I suggest replacing thatline with this:


    ctxt->charset = 0;

which will truly reset the parsing context to what it was when it wascreated and give identical behaviour to htmlReadFile() andhtmlCtxtReadFile().


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com

Follow-Ups:
- Re: [xml] Minor bug in htmlCtxtReset
  - From: Daniel Veillard

References:
- [xml] Minor bug in htmlCtxtReset
  - From: Michael Day
- Re: [xml] Minor bug in htmlCtxtReset
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]