Re: [xml] Push-parsing Unicode with LibXML2

From: Eric Seidel <eseidel apple com>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] Push-parsing Unicode with LibXML2
Date: Tue, 14 Feb 2006 00:45:14 -0800


On Feb 14, 2006, at 12:33 AM, Daniel Veillard wrote:

On Mon, Feb 13, 2006 at 03:40:48PM -0800, Eric Seidel wrote:

We convert everything to UTF16, and pass around only UTF16 strings
internally in WebKit (http://www.webkit.org).  If that means we have
to also removed the encoding information from the string before
passing it into libxml (or better yet, tell libxml to ignore it) we
can do that.

In our case, we don't want the parser to autodetect.  We do all that
already in WebKit, we'd just like to pass an already properly decoded
utf16 string off to libxml and let it do its magic.

In my example it still seems that libxml falls over well before
actually reaching any xml encoding declaration.  The first byte
passed seems to put the parser context into an error state.  Any
thoughts on what might be causing this?  Again, removing my bogus
xmlSwitchEncoding call, does not change the behavior.


  First thing I notice is that you pass one byte at a time. At best
this is just massively inefficient, at worse you're hitting a bug .
The source from parse4.c does not do this.

Also if you have converted to a memory string, why do you need touse

progressive parsing ? If the conversion is progressive, I still doubt

it delivers data byte by byte, just pass the blocks as they areconverted.


So I found the bug in my original code:

        unsigned unicode = chars[0];

xmlParseChunk(ctxt, (const char *)&unicode, sizeof(unsigned), 0);

Notice I'm converting to an "unsigned" (4 bytes) instead of a"short". That was (understandably) confusing libxml.

So now that I have that resolved, all xml is *working* again, exceptfor xml which manually specifies an encoding as part of:

<?xml version="1.0" encoding="iso-8859-1"?>

If any encoding other than utf-16 is manually specified, libxml fallsover, as the encoding="iso-8859-1" attribute overrides the utf-16which it had previously (correctly) detected.



So let me revise my question:

I'm now looking for a way to make libxml ignore theencoding="iso-8859-1" attribute, and instead rely on the utf-16 itautodetected (or which I can manually specify).

Again, our web engine (WebKit -- http://www.webkit.org/) handles allstrings internally as UTF-16 (I believe we do this because JavaScriptmethods require utf-16 access to string data). We autodetectencodings (in a similar to manner to libxml), decode, and then passutf-16 data off to our tokenizers (in this case, libxml).

I'd like to have a clean way to force libxml2 to always treat myinput data as utf-16, regardless of what encoding="" attribute itfinds. (I have to imagine there is already a way to do this basedoff of say http content-encoding headers for example?) I have notyet found such a method.



I also saw at:
http://xmlsoft.org/encoding.html#extend

you mention it might be possible to make libxml use all utf-16internally. Do you know if anyone has tried?


Thanks for your help.

-eric

Daniel

--
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Follow-Ups:
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard

References:
- [xml] Push-parsing Unicode with LibXML2
  - From: Eric Seidel
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Eric Seidel
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]