Re: [xml] Possible bug with Byte Order Marks

From: Daniel Veillard <veillard redhat com>
To: Mark Itzcovitz <mark itzcovitz ntlworld com>
Cc: xml gnome org
Subject: Re: [xml] Possible bug with Byte Order Marks
Date: Tue, 3 Jun 2003 10:16:48 -0400

On Tue, Jun 03, 2003 at 03:04:13PM +0100, Mark Itzcovitz wrote:

On reflection, I think I'm wrong in what I say above. RFC2781 is about MIME
types - the xml spec seems to say that the encoding declaration should just
say UTF-16 and is used in conjunction with the BOM.


  okay,

  That should work, with the caveat you saw. I'm a bit concerned about
the requirement to add a field to record the encoding, this should laready
be stored somewhere on the context or in the inputStream block.


The encoding from the encoding declaration is stored, but I can't see where
the encoding derived from xmlDetectCharEncoding is stored.


  okay, now the problem is that it's not a parser information but an
entity information, so ideally this should be saved in the input structure
block. But I think the encoding="" value is always finer grained than
the result of xmlDetectCharEncoding except in that case of UTF-16,

  Seems one way, the other way would be in case of just "UTF-16" being
passed
to actually serialize a BOM on output to keep something similar, except
we would always dump big endian.
  Either solution should work, the second one is slightly more
conservative.


Returning to my original query, which was that xmlDocDumpMemory and
xmlDocFormatDump don't work correctly for "UTF-16", and having looked more
closely at the code for those functions, I think that my proposed changes
have too broad a scope. I can see a different solution that can easily be
applied to those two functions but I am confused by what seems to me to be
an inconsistency, as follows:

A call to xmlFindCharEncodingHandler for "UTF-16" fails.
A call to xmlParseCharEncoding for "UTF-16" followed by a call to
xmlGetCharEncodingHandler returns the handler for XML_CHAR_ENCODING_UTF16LE.


  the problem is that you add some state information, if you can keep this in
the local variables of the serialization routine then that's fine.

The two Dump functions call xmlParseCharEncoding followed by
xmlFindCharEncodingHandler. I propose putting a call to
xmlGetCharEncodingHandler (using the result from the call to
xmlParseCharEncoding), and only calling Find if the Get fails. This is
hopefully a safe change.


  Hum, sounds better, could you give a patch ?

    thanks,

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- Re: [xml] Possible bug with Byte Order Marks
  - From: Daniel Veillard
- RE: [xml] Possible bug with Byte Order Marks
  - From: Mark Itzcovitz

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]