Re: [xml] Push-parsing Unicode with LibXML2
- From: Eric Seidel <eseidel apple com>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] Push-parsing Unicode with LibXML2
- Date: Mon, 13 Feb 2006 15:40:48 -0800
On Feb 13, 2006, at 3:26 PM, Daniel Veillard wrote:
On Mon, Feb 13, 2006 at 02:07:32PM -0800, Eric Seidel wrote:
I'm reading in data off the network, converting it to utf16, and then
passing it off to libxml2. In the parser4 adapted example, I'm
reading ascii from a local file, expanding it to integers
(effectively utf16) and then passing it to libxml2:
[...]
const unsigned BOM = 0xFEFF;
const unsigned char BOMHighByte = *(const unsigned char *)&BOM;
xmlSwitchEncoding(ctxt, BOMHighByte == 0xFF ?
XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE);
What did you expect to achieve that way ?!?
UTF-16 is one of the encodings that an XML parser must autodetect and
use
http://www.w3.org/TR/REC-xml/#sec-guessing
what you are doing may perfectly well break the internal parser
detection. You must not use xmlSwitchEncoding() unless you're an
expert
in the way libxml2 internals work. So don't do this at least at
this stage !
Thanks for the feedback. Those calls are actually unnecessary,
removing those lines does not change anything. I left them to give
you a full picture of our usage.
Actually even converting to UTF-16 from the external source it just
plain
broken, the xml declaration may state that this is some other encoding
and then the actual bytes and the declared encoding will conflict,
really
not a good idea, again unless you really really know what you're doing
you should never attempt to work around the parser autodetection code:
you're playing with conformance of the parser to the spec so this is
on the edge of what is acceptable from client code.
We convert everything to UTF16, and pass around only UTF16 strings
internally in WebKit (http://www.webkit.org). If that means we have
to also removed the encoding information from the string before
passing it into libxml (or better yet, tell libxml to ignore it) we
can do that.
In our case, we don't want the parser to autodetect. We do all that
already in WebKit, we'd just like to pass an already properly decoded
utf16 string off to libxml and let it do its magic.
In my example it still seems that libxml falls over well before
actually reaching any xml encoding declaration. The first byte
passed seems to put the parser context into an error state. Any
thoughts on what might be causing this? Again, removing my bogus
xmlSwitchEncoding call, does not change the behavior.
-eric
Daniel
--
Daniel Veillard | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://
xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]