Re: [xml] Push-parsing Unicode with LibXML2




On Feb 14, 2006, at 12:33 AM, Daniel Veillard wrote:

On Mon, Feb 13, 2006 at 03:40:48PM -0800, Eric Seidel wrote:
We convert everything to UTF16, and pass around only UTF16 strings
internally in WebKit (http://www.webkit.org).  If that means we have
to also removed the encoding information from the string before
passing it into libxml (or better yet, tell libxml to ignore it) we
can do that.

In our case, we don't want the parser to autodetect.  We do all that
already in WebKit, we'd just like to pass an already properly decoded
utf16 string off to libxml and let it do its magic.

In my example it still seems that libxml falls over well before
actually reaching any xml encoding declaration.  The first byte
passed seems to put the parser context into an error state.  Any
thoughts on what might be causing this?  Again, removing my bogus
xmlSwitchEncoding call, does not change the behavior.

  First thing I notice is that you pass one byte at a time. At best
this is just massively inefficient, at worse you're hitting a bug .
The source from parse4.c does not do this.
Also if you have converted to a memory string, why do you need to use
progressive parsing ? If the conversion is progressive, I still doubt
it delivers data byte by byte, just pass the blocks as they are converted.

So I found the bug in my original code:

        unsigned unicode = chars[0];
xmlParseChunk(ctxt, (const char *)&unicode, sizeof (unsigned), 0);

Notice I'm converting to an "unsigned" (4 bytes) instead of a "short". That was (understandably) confusing libxml.


So now that I have that resolved, all xml is *working* again, except for xml which manually specifies an encoding as part of:
<?xml version="1.0" encoding="iso-8859-1"?>

If any encoding other than utf-16 is manually specified, libxml falls over, as the encoding="iso-8859-1" attribute overrides the utf-16 which it had previously (correctly) detected.


So let me revise my question:

I'm now looking for a way to make libxml ignore the encoding="iso-8859-1" attribute, and instead rely on the utf-16 it autodetected (or which I can manually specify).


Again, our web engine (WebKit -- http://www.webkit.org/) handles all strings internally as UTF-16 (I believe we do this because JavaScript methods require utf-16 access to string data). We autodetect encodings (in a similar to manner to libxml), decode, and then pass utf-16 data off to our tokenizers (in this case, libxml).

I'd like to have a clean way to force libxml2 to always treat my input data as utf-16, regardless of what encoding="" attribute it finds. (I have to imagine there is already a way to do this based off of say http content-encoding headers for example?) I have not yet found such a method.


I also saw at:
http://xmlsoft.org/encoding.html#extend
you mention it might be possible to make libxml use all utf-16 internally. Do you know if anyone has tried?

Thanks for your help.

-eric


Daniel

--
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http:// xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]