Re: [xml] Push-parsing Unicode with LibXML2




On Feb 13, 2006, at 3:26 PM, Daniel Veillard wrote:

On Mon, Feb 13, 2006 at 02:07:32PM -0800, Eric Seidel wrote:
I'm reading in data off the network, converting it to utf16, and then
passing it off to libxml2.  In the parser4 adapted example, I'm
reading ascii from a local file, expanding it to integers
(effectively utf16) and then passing it to libxml2:
[...]
    const unsigned BOM = 0xFEFF;
    const unsigned char BOMHighByte = *(const unsigned char *)&BOM;
    xmlSwitchEncoding(ctxt, BOMHighByte == 0xFF ?
XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE);

  What did you expect to achieve that way ?!?
UTF-16 is one of the encodings that an XML parser must autodetect and
use
  http://www.w3.org/TR/REC-xml/#sec-guessing
what you are doing may perfectly well break the internal parser
detection. You must not use xmlSwitchEncoding() unless you're an expert in the way libxml2 internals work. So don't do this at least at this stage !

Thanks for the feedback. Those calls are actually unnecessary, removing those lines does not change anything. I left them to give you a full picture of our usage.

Actually even converting to UTF-16 from the external source it just plain
broken, the xml declaration may state that this is some other encoding
and then the actual bytes and the declared encoding will conflict, really
not a good idea, again unless you really really know what you're doing
you should never attempt to work around the parser autodetection code:
you're playing with conformance of the parser to the spec so this is
on the edge of what is acceptable from client code.

We convert everything to UTF16, and pass around only UTF16 strings internally in WebKit (http://www.webkit.org). If that means we have to also removed the encoding information from the string before passing it into libxml (or better yet, tell libxml to ignore it) we can do that.

In our case, we don't want the parser to autodetect. We do all that already in WebKit, we'd just like to pass an already properly decoded utf16 string off to libxml and let it do its magic.

In my example it still seems that libxml falls over well before actually reaching any xml encoding declaration. The first byte passed seems to put the parser context into an error state. Any thoughts on what might be causing this? Again, removing my bogus xmlSwitchEncoding call, does not change the behavior.

-eric

Daniel

--
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http:// xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]