Re: [xml] Push-parsing Unicode with LibXML2
- From: Eric Seidel <eseidel apple com>
 
- To: veillard redhat com
 
- Cc: xml gnome org
 
- Subject: Re: [xml] Push-parsing Unicode with LibXML2
 
- Date: Tue, 14 Feb 2006 00:45:14 -0800
 
On Feb 14, 2006, at 12:33 AM, Daniel Veillard wrote:
On Mon, Feb 13, 2006 at 03:40:48PM -0800, Eric Seidel wrote:
We convert everything to UTF16, and pass around only UTF16 strings
internally in WebKit (http://www.webkit.org).  If that means we have
to also removed the encoding information from the string before
passing it into libxml (or better yet, tell libxml to ignore it) we
can do that.
In our case, we don't want the parser to autodetect.  We do all that
already in WebKit, we'd just like to pass an already properly decoded
utf16 string off to libxml and let it do its magic.
In my example it still seems that libxml falls over well before
actually reaching any xml encoding declaration.  The first byte
passed seems to put the parser context into an error state.  Any
thoughts on what might be causing this?  Again, removing my bogus
xmlSwitchEncoding call, does not change the behavior.
  First thing I notice is that you pass one byte at a time. At best
this is just massively inefficient, at worse you're hitting a bug .
The source from parse4.c does not do this.
  Also if you have converted to a memory string, why do you need to  
use
progressive parsing ? If the conversion is progressive, I still doubt
it delivers data byte by byte, just pass the blocks as they are  
converted.
So I found the bug in my original code:
        unsigned unicode = chars[0];
        xmlParseChunk(ctxt, (const char *)&unicode, sizeof 
(unsigned), 0);
Notice I'm converting to an "unsigned" (4 bytes) instead of a  
"short".  That was (understandably) confusing libxml.
So now that I have that resolved, all xml is *working* again, except  
for xml which manually specifies an encoding as part of:
<?xml version="1.0" encoding="iso-8859-1"?>
If any encoding other than utf-16 is manually specified, libxml falls  
over, as the encoding="iso-8859-1" attribute overrides the utf-16  
which it had previously (correctly) detected.
So let me revise my question:
I'm now looking for a way to make libxml ignore the  
encoding="iso-8859-1" attribute, and instead rely on the utf-16 it  
autodetected (or which I can manually specify).
Again, our web engine (WebKit -- http://www.webkit.org/) handles all  
strings internally as UTF-16 (I believe we do this because JavaScript  
methods require utf-16 access to string data).  We autodetect  
encodings (in a similar to manner to libxml), decode, and then pass  
utf-16 data off to our tokenizers (in this case, libxml).
I'd like to have a clean way to force libxml2 to always treat my  
input data as utf-16, regardless of what encoding="" attribute it  
finds.  (I have to imagine there is already a way to do this based  
off of say http content-encoding headers for example?)  I have not  
yet found such a method.
I also saw at:
http://xmlsoft.org/encoding.html#extend
you mention it might be possible to make libxml use all utf-16  
internally.  Do you know if anyone has tried?
Thanks for your help.
-eric
Daniel
--
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http:// 
xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next]   [
Thread Prev][
Thread Next]   
[
Thread Index]
[
Date Index]
[
Author Index]