[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] Possible bug with Byte Order Marks
- From: Daniel Veillard <veillard redhat com>
- To: Mark Itzcovitz <mark itzcovitz ntlworld com>
- Cc: xml gnome org
- Subject: Re: [xml] Possible bug with Byte Order Marks
- Date: Mon, 2 Jun 2003 13:21:39 -0400
On Mon, Jun 02, 2003 at 06:07:00PM +0100, Mark Itzcovitz wrote:
> Looking at RFC2781 I think the software producing the document is at fault
> (XmlTextWriter in the MS .NET framework). It obviously knows it is
> outputting little-endian since it puts in the appropriate BOM, so it SHOULD
> specify UTF-16LE, not just UTF-16. However, the RFC does only say SHOULD,
> not MUST.
okay, thanks for looking into this.
> Having said that, it would still be nice for libxml to cope with it. I'm not
> sure about defaulting to Windows endianness though - the RFC seems to say
> that the default should be big endian. Therefore I've developed a patch that
> uses the endianness found from the BOM to modify the declared encoding held
> in the context. I realise that this is a potential problem since the
> original declared encoding is no longer available - does anyone think that
> this will be a problem in practice? Also, this may break code that is
> reading the document, not using libxml2, and doesn't know about UTF-16LE/BE.
Hum, right
> The outline of the patch is as follows:
>
> Add a field extCharset to the context structure to hold the xmlCharEncoding
> worked out from the BOM, if any.
> Set this field in xmlParseDocument to the value returned by
> xmlDetectCharEncoding.
> Add LE or BE onto the end of the encoding name before the final return in
> xmlParseEncName if the name is UTF-16 or UTF16 (any case) and extCharset is
> set to UTF16LE/BE.
>
> If you think this is the way to go, I'll submit details of the patch.
That should work, with the caveat you saw. I'm a bit concerned about
the requirement to add a field to record the encoding, this should laready
be stored somewhere on the context or in the inputStream block.
Seems one way, the other way would be in case of just "UTF-16" being passed
to actually serialize a BOM on output to keep something similar, except
we would always dump big endian.
Either solution should work, the second one is slightly more conservative.
Daniel
--
Daniel Veillard | Red Hat Network https://rhn.redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]