[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
RE: [xml] Possible bug with Byte Order Marks
- From: "Mark Itzcovitz" <mark itzcovitz ntlworld com>
- To: <xml gnome org>
- Subject: RE: [xml] Possible bug with Byte Order Marks
- Date: Tue, 3 Jun 2003 15:04:13 +0100
> -----Original Message-----
> From: Daniel Veillard [mailto:veillard redhat com]
> Sent: 02 June 2003 18:22
> To: Mark Itzcovitz
> Cc: xml gnome org
>
> On Mon, Jun 02, 2003 at 06:07:00PM +0100, Mark Itzcovitz wrote:
> > Looking at RFC2781 I think the software producing the document is at
> fault
> > (XmlTextWriter in the MS .NET framework). It obviously knows it is
> > outputting little-endian since it puts in the appropriate BOM, so it
> SHOULD
> > specify UTF-16LE, not just UTF-16. However, the RFC does only say
> SHOULD,
> > not MUST.
>
> okay, thanks for looking into this.
>
On reflection, I think I'm wrong in what I say above. RFC2781 is about MIME
types - the xml spec seems to say that the encoding declaration should just
say UTF-16 and is used in conjunction with the BOM.
> > Having said that, it would still be nice for libxml to cope with it. I'm
> not
> > sure about defaulting to Windows endianness though - the RFC seems to
> say
> > that the default should be big endian. Therefore I've developed a patch
> that
> > uses the endianness found from the BOM to modify the declared encoding
> held
> > in the context. I realise that this is a potential problem since the
> > original declared encoding is no longer available - does anyone think
> that
> > this will be a problem in practice? Also, this may break code that is
> > reading the document, not using libxml2, and doesn't know about UTF-
> 16LE/BE.
>
> Hum, right
>
> > The outline of the patch is as follows:
> >
> > Add a field extCharset to the context structure to hold the
> xmlCharEncoding
> > worked out from the BOM, if any.
> > Set this field in xmlParseDocument to the value returned by
> > xmlDetectCharEncoding.
> > Add LE or BE onto the end of the encoding name before the final return
> in
> > xmlParseEncName if the name is UTF-16 or UTF16 (any case) and extCharset
> is
> > set to UTF16LE/BE.
> >
> > If you think this is the way to go, I'll submit details of the patch.
>
> That should work, with the caveat you saw. I'm a bit concerned about
> the requirement to add a field to record the encoding, this should laready
> be stored somewhere on the context or in the inputStream block.
The encoding from the encoding declaration is stored, but I can't see where
the encoding derived from xmlDetectCharEncoding is stored.
> Seems one way, the other way would be in case of just "UTF-16" being
> passed
> to actually serialize a BOM on output to keep something similar, except
> we would always dump big endian.
> Either solution should work, the second one is slightly more
> conservative.
>
Returning to my original query, which was that xmlDocDumpMemory and
xmlDocFormatDump don't work correctly for "UTF-16", and having looked more
closely at the code for those functions, I think that my proposed changes
have too broad a scope. I can see a different solution that can easily be
applied to those two functions but I am confused by what seems to me to be
an inconsistency, as follows:
A call to xmlFindCharEncodingHandler for "UTF-16" fails.
A call to xmlParseCharEncoding for "UTF-16" followed by a call to
xmlGetCharEncodingHandler returns the handler for XML_CHAR_ENCODING_UTF16LE.
The two Dump functions call xmlParseCharEncoding followed by
xmlFindCharEncodingHandler. I propose putting a call to
xmlGetCharEncodingHandler (using the result from the call to
xmlParseCharEncoding), and only calling Find if the Get fails. This is
hopefully a safe change.
________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs Email
Security System. For more information on a proactive email security
service working around the clock, around the globe, visit
http://www.messagelabs.com
________________________________________________________________________
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]