RE: [xml] Possible bug with Byte Order Marks

I also have a query for documents that have a BOM and where the encoding
declaration specifies UTF-16 (not UTF-16LE or UTF-16BE). There is no
reading in the document, but xmlDocDumpMemory fails, I think because it
can't decide what encoding handler to use. xmlDocFormatDump doesn't fail
outputs the document in UTF-8.

Should an encoding declaration of UTF-16 work?

  yes, and default to Windows endianness since they are the main users of
UTF16, I take patches !

Looking at RFC2781 I think the software producing the document is at fault
(XmlTextWriter in the MS .NET framework). It obviously knows it is
outputting little-endian since it puts in the appropriate BOM, so it SHOULD
specify UTF-16LE, not just UTF-16. However, the RFC does only say SHOULD,
not MUST.

Having said that, it would still be nice for libxml to cope with it. I'm not
sure about defaulting to Windows endianness though - the RFC seems to say
that the default should be big endian. Therefore I've developed a patch that
uses the endianness found from the BOM to modify the declared encoding held
in the context. I realise that this is a potential problem since the
original declared encoding is no longer available - does anyone think that
this will be a problem in practice? Also, this may break code that is
reading the document, not using libxml2, and doesn't know about UTF-16LE/BE.

The outline of the patch is as follows:

Add a field extCharset to the context structure to hold the xmlCharEncoding
worked out from the BOM, if any.
Set this field in xmlParseDocument to the value returned by
Add LE or BE onto the end of the encoding name before the final return in
xmlParseEncName if the name is UTF-16 or UTF16 (any case) and extCharset is
set to UTF16LE/BE.

If you think this is the way to go, I'll submit details of the patch.


This email has been scanned for all viruses by the MessageLabs Email
Security System. For more information on a proactive email security
service working around the clock, around the globe, visit

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]