[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [xml] Possible bug with Byte Order Marks



> -----Original Message-----
> From: Daniel Veillard [mailto:veillard redhat com]
> Sent: 02 June 2003 18:22
> To: Mark Itzcovitz
> Cc: xml gnome org
> 
> On Mon, Jun 02, 2003 at 06:07:00PM +0100, Mark Itzcovitz wrote:
> > Looking at RFC2781 I think the software producing the document is at
> fault
> > (XmlTextWriter in the MS .NET framework). It obviously knows it is
> > outputting little-endian since it puts in the appropriate BOM, so it
> SHOULD
> > specify UTF-16LE, not just UTF-16. However, the RFC does only say
> SHOULD,
> > not MUST.
> 
>   okay, thanks for looking into this.
> 

On reflection, I think I'm wrong in what I say above. RFC2781 is about MIME
types - the xml spec seems to say that the encoding declaration should just
say UTF-16 and is used in conjunction with the BOM.


> > Having said that, it would still be nice for libxml to cope with it. I'm
> not
> > sure about defaulting to Windows endianness though - the RFC seems to
> say
> > that the default should be big endian. Therefore I've developed a patch
> that
> > uses the endianness found from the BOM to modify the declared encoding
> held
> > in the context. I realise that this is a potential problem since the
> > original declared encoding is no longer available - does anyone think
> that
> > this will be a problem in practice? Also, this may break code that is
> > reading the document, not using libxml2, and doesn't know about UTF-
> 16LE/BE.
> 
>   Hum, right
> 
> > The outline of the patch is as follows:
> >
> > Add a field extCharset to the context structure to hold the
> xmlCharEncoding
> > worked out from the BOM, if any.
> > Set this field in xmlParseDocument to the value returned by
> > xmlDetectCharEncoding.
> > Add LE or BE onto the end of the encoding name before the final return
> in
> > xmlParseEncName if the name is UTF-16 or UTF16 (any case) and extCharset
> is
> > set to UTF16LE/BE.
> >
> > If you think this is the way to go, I'll submit details of the patch.
> 

>   That should work, with the caveat you saw. I'm a bit concerned about
> the requirement to add a field to record the encoding, this should laready
> be stored somewhere on the context or in the inputStream block.

The encoding from the encoding declaration is stored, but I can't see where
the encoding derived from xmlDetectCharEncoding is stored.

>   Seems one way, the other way would be in case of just "UTF-16" being
> passed
> to actually serialize a BOM on output to keep something similar, except
> we would always dump big endian.
>   Either solution should work, the second one is slightly more
> conservative.
> 

Returning to my original query, which was that xmlDocDumpMemory and
xmlDocFormatDump don't work correctly for "UTF-16", and having looked more
closely at the code for those functions, I think that my proposed changes
have too broad a scope. I can see a different solution that can easily be
applied to those two functions but I am confused by what seems to me to be
an inconsistency, as follows:

A call to xmlFindCharEncodingHandler for "UTF-16" fails.
A call to xmlParseCharEncoding for "UTF-16" followed by a call to
xmlGetCharEncodingHandler returns the handler for XML_CHAR_ENCODING_UTF16LE.

The two Dump functions call xmlParseCharEncoding followed by
xmlFindCharEncodingHandler. I propose putting a call to
xmlGetCharEncodingHandler (using the result from the call to
xmlParseCharEncoding), and only calling Find if the Get fails. This is
hopefully a safe change.



________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs Email
Security System. For more information on a proactive email security
service working around the clock, around the globe, visit
http://www.messagelabs.com
________________________________________________________________________



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]