Re: [xml] xmlParseChunk with UTF-16LE fails on special occasion



On Tue, Nov 04, 2003 at 04:34:08PM +0100, Kasimier Buchcik wrote:
Hi,


Kasimier Buchcik wrote:
Hi,

To anyone who cares:

I found a workaround for my problem by doing the following:

Since I work with Delphi, the DOMString is a WideString, encoded in 
UTF-16 (little-endian and *no* byte order mark); in order to parse a 
DOMString representation of a xml-document - regardless of its encoding 
declaration - I used "xmlCreatePushParserCtxt & xmlParseChunk" with an 
initial chunk of 0xff 0xfe and the first 2 bytes of the specified 
DOMString. Doing this libxml2 detects the constructed UTF-16LE encoding 
and switches encoding. It does not care for the encoding declaration. 
I'm happy that it works :-)

Well, don't count your chickens before they're hatched...
The push parser seems not to work in that way, or there might be a bug.

- The following does work:

1. The context is created with the first 4 bytes of the buffer.
2. A call to xmlParseChunk reads *all* the data in the buffer; i.e. the
    chunk size is big enough to read all the buffer.
3. A final call to xmlParseChunk with @terminate of 1.

- The following does *not* work:

1. The context is created with the first 4 bytes of the buffer.
2. *Multiple* calls to xmlParseChunk are make, since the chunk size is
    *not* big enough to read all the buffer at once.
3. A final call to xmlParseChunk with @terminate of 1.

----------

I can reproduce this with xmllint and a modified version of the file 
"libxml2\test\wap.xml":

Note that 'encoding="iso-8859-1"' was added to the prolog of the file 
"wap.xml" and the file was UTF-16LE encoded (with BOM).

P:\tests\unicodeConsole>xmllint --push wap.xml
wap.xml:12: parser error : AttValue: ' expected
         <postfield name="tp" value="wml/state/variables/parsing/1
                                                                  ^
wap.xml:12: parser error : attributes construct error
         <postfield name="tp" value="wml/state/variables/parsing/1
                                                                  ^
wap.xml:12: parser error : Couldn't find end of Start Tag postfield
         <postfield name="tp" value="wml/state/variables/parsing/1
                                                                  ^
----------

- Everything works fine if I recompile xmllint with a chunk size of 4096 
instead of 1024.

- Everything works fine if the file is *not* encoded in UTF-16 (e.g. UTF-8).

- Everything works fine if the file *is* encoded in UTF-16LE *and* the 
prolog defines an encoding of "UTF-16LE".

It seems that the push parser stumbles over something if:
1. multiple calls to xmlParseChunk are performed
2. input is UTF-16 encoded
3. the declaration does state an other encoding


I've tried to look up the archives concerning this issue and found some 
mailings that seem to point to a comparable problem:

http://mail.gnome.org/archives/xml/2002-March/msg00012.html
http://mail.gnome.org/archives/xml/2002-January/msg00105.html
http://mail.gnome.org/archives/xml/2002-August/msg00040.html


If this may be of concern: I have the *feeling* (and nothing more) that 
libxml2 takes the declared encoding into account *somewhere* and 
stumbles  over the UTF-16 encoded buffer (sorry that I could not debug 
libxml2, acually I don't even know how to debug libxml2; the dsp and 
bcb5 stuff is gone and I'm really not that much into "c").
But maby the push parser wasn't build for that kind of processing.


Anyway, this behaviour seems inconsistent to me, so I'm hoping for 
clarification.

  Can you bugzilla this, maybe this need some global checking,
let's record the issue !

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]