[xml] xmlParseChunk with UTF-16LE fails on special occasion
- From: Kasimier Buchcik <kbuchcik 4commerce de>
- To: <xml gnome org>
- Subject: [xml] xmlParseChunk with UTF-16LE fails on special occasion
- Date: Tue, 04 Nov 2003 16:34:08 +0100
Hi,
Kasimier Buchcik wrote:
Hi,
To anyone who cares:
I found a workaround for my problem by doing the following:
Since I work with Delphi, the DOMString is a WideString, encoded in
UTF-16 (little-endian and *no* byte order mark); in order to parse a
DOMString representation of a xml-document - regardless of its encoding
declaration - I used "xmlCreatePushParserCtxt & xmlParseChunk" with an
initial chunk of 0xff 0xfe and the first 2 bytes of the specified
DOMString. Doing this libxml2 detects the constructed UTF-16LE encoding
and switches encoding. It does not care for the encoding declaration.
I'm happy that it works :-)
Well, don't count your chickens before they're hatched...
The push parser seems not to work in that way, or there might be a bug.
- The following does work:
1. The context is created with the first 4 bytes of the buffer.
2. A call to xmlParseChunk reads *all* the data in the buffer; i.e. the
chunk size is big enough to read all the buffer.
3. A final call to xmlParseChunk with @terminate of 1.
- The following does *not* work:
1. The context is created with the first 4 bytes of the buffer.
2. *Multiple* calls to xmlParseChunk are make, since the chunk size is
*not* big enough to read all the buffer at once.
3. A final call to xmlParseChunk with @terminate of 1.
----------
I can reproduce this with xmllint and a modified version of the file
"libxml2\test\wap.xml":
Note that 'encoding="iso-8859-1"' was added to the prolog of the file
"wap.xml" and the file was UTF-16LE encoded (with BOM).
P:\tests\unicodeConsole>xmllint --push wap.xml
wap.xml:12: parser error : AttValue: ' expected
<postfield name="tp" value="wml/state/variables/parsing/1
^
wap.xml:12: parser error : attributes construct error
<postfield name="tp" value="wml/state/variables/parsing/1
^
wap.xml:12: parser error : Couldn't find end of Start Tag postfield
<postfield name="tp" value="wml/state/variables/parsing/1
^
----------
- Everything works fine if I recompile xmllint with a chunk size of 4096
instead of 1024.
- Everything works fine if the file is *not* encoded in UTF-16 (e.g. UTF-8).
- Everything works fine if the file *is* encoded in UTF-16LE *and* the
prolog defines an encoding of "UTF-16LE".
It seems that the push parser stumbles over something if:
1. multiple calls to xmlParseChunk are performed
2. input is UTF-16 encoded
3. the declaration does state an other encoding
I've tried to look up the archives concerning this issue and found some
mailings that seem to point to a comparable problem:
http://mail.gnome.org/archives/xml/2002-March/msg00012.html
http://mail.gnome.org/archives/xml/2002-January/msg00105.html
http://mail.gnome.org/archives/xml/2002-August/msg00040.html
If this may be of concern: I have the *feeling* (and nothing more) that
libxml2 takes the declared encoding into account *somewhere* and
stumbles over the UTF-16 encoded buffer (sorry that I could not debug
libxml2, acually I don't even know how to debug libxml2; the dsp and
bcb5 stuff is gone and I'm really not that much into "c").
But maby the push parser wasn't build for that kind of processing.
Anyway, this behaviour seems inconsistent to me, so I'm hoping for
clarification.
Best regards,
Kasimier
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]