Re: [xml] Minor bug in htmlCtxtReset
- From: Michael Day <mikeday yeslogic com>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] Minor bug in htmlCtxtReset
- Date: Wed, 08 Nov 2006 10:03:29 +1100
Hi Daniel,
ctxt->charset is a remain from libxml1 where strings were stored
in the document encoding (this was a complete and total mess), now
they are always stored as UTF-8 so whether the value is 0 or
XML_CHAR_ENCODING_UTF8 this should not change anything, really.
However, the charset value is used in htmlCurrentChar():
if (ctxt->charset == XML_CHAR_ENCODING_UTF8) {
I'm trying to parse a HTML file encoded in ISO-8859-1 using
htmlCtxtReadFile() and I'm getting encoding errors on some of the
characters because they are not in UTF-8. If I use htmlReadFile()
everything works fine. If I use htmlCtxtReadFile() and comment out this
line of htmlCtxtReset():
ctxt->charset = XML_CHAR_ENCODING_UTF8;
then everything works fine. However, if that line is not commented out,
then the behaviour of htmlCtxtReadFile() is different from the behaviour
of htmlReadFile(), and appears to be wrong. So I suggest replacing that
line with this:
ctxt->charset = 0;
which will truly reset the parsing context to what it was when it was
created and give identical behaviour to htmlReadFile() and
htmlCtxtReadFile().
Best regards,
Michael
--
Print XML with Prince!
http://www.princexml.com
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]