RE: [xml] Possible bug with Byte Order Marks

From: "Mark Itzcovitz" <mark itzcovitz ntlworld com>
To: <xml gnome org>
Subject: RE: [xml] Possible bug with Byte Order Marks
Date: Tue, 3 Jun 2003 15:04:13 +0100

-----Original Message-----
From: Daniel Veillard [mailto:veillard redhat com]
Sent: 02 June 2003 18:22
To: Mark Itzcovitz
Cc: xml gnome org

On Mon, Jun 02, 2003 at 06:07:00PM +0100, Mark Itzcovitz wrote:

Looking at RFC2781 I think the software producing the document is at

fault

(XmlTextWriter in the MS .NET framework). It obviously knows it is
outputting little-endian since it puts in the appropriate BOM, so it

SHOULD

specify UTF-16LE, not just UTF-16. However, the RFC does only say

SHOULD,

not MUST.


  okay, thanks for looking into this.


On reflection, I think I'm wrong in what I say above. RFC2781 is about MIME
types - the xml spec seems to say that the encoding declaration should just
say UTF-16 and is used in conjunction with the BOM.

Having said that, it would still be nice for libxml to cope with it. I'm

not

sure about defaulting to Windows endianness though - the RFC seems to

say

that the default should be big endian. Therefore I've developed a patch

that

uses the endianness found from the BOM to modify the declared encoding

held

in the context. I realise that this is a potential problem since the
original declared encoding is no longer available - does anyone think

that

this will be a problem in practice? Also, this may break code that is
reading the document, not using libxml2, and doesn't know about UTF-

16LE/BE.

  Hum, right

The outline of the patch is as follows:

Add a field extCharset to the context structure to hold the

xmlCharEncoding

worked out from the BOM, if any.
Set this field in xmlParseDocument to the value returned by
xmlDetectCharEncoding.
Add LE or BE onto the end of the encoding name before the final return

in

xmlParseEncName if the name is UTF-16 or UTF16 (any case) and extCharset

is

set to UTF16LE/BE.

If you think this is the way to go, I'll submit details of the patch.

  That should work, with the caveat you saw. I'm a bit concerned about
the requirement to add a field to record the encoding, this should laready
be stored somewhere on the context or in the inputStream block.


The encoding from the encoding declaration is stored, but I can't see where
the encoding derived from xmlDetectCharEncoding is stored.

  Seems one way, the other way would be in case of just "UTF-16" being
passed
to actually serialize a BOM on output to keep something similar, except
we would always dump big endian.
  Either solution should work, the second one is slightly more
conservative.


Returning to my original query, which was that xmlDocDumpMemory and
xmlDocFormatDump don't work correctly for "UTF-16", and having looked more
closely at the code for those functions, I think that my proposed changes
have too broad a scope. I can see a different solution that can easily be
applied to those two functions but I am confused by what seems to me to be
an inconsistency, as follows:

A call to xmlFindCharEncodingHandler for "UTF-16" fails.
A call to xmlParseCharEncoding for "UTF-16" followed by a call to
xmlGetCharEncodingHandler returns the handler for XML_CHAR_ENCODING_UTF16LE.

The two Dump functions call xmlParseCharEncoding followed by
xmlFindCharEncodingHandler. I propose putting a call to
xmlGetCharEncodingHandler (using the result from the call to
xmlParseCharEncoding), and only calling Find if the Get fails. This is
hopefully a safe change.



________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs Email
Security System. For more information on a proactive email security
service working around the clock, around the globe, visit
http://www.messagelabs.com
________________________________________________________________________

Follow-Ups:
- Re: [xml] Possible bug with Byte Order Marks
  - From: Daniel Veillard

References:
- Re: [xml] Possible bug with Byte Order Marks
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]