"Re: [xml] xmlParseChunk with UTF-16LE fails on special occasion"



Hi,


Kasimier Buchcik wrote:

Hi,


Kasimier Buchcik wrote:
 

Hi,

To anyone who cares:

I found a workaround for my problem by doing the following:

Since I work with Delphi, the DOMString is a WideString, encoded in 
UTF-16 (little-endian and *no* byte order mark); in order to parse a 
DOMString representation of a xml-document - regardless of its encoding 
declaration - I used "xmlCreatePushParserCtxt & xmlParseChunk" with an 
initial chunk of 0xff 0xfe and the first 2 bytes of the specified 
DOMString. Doing this libxml2 detects the constructed UTF-16LE encoding 
and switches encoding. It does not care for the encoding declaration. 
I'm happy that it works :-)
   


Well, don't count your chickens before they're hatched...
The push parser seems not to work in that way, or there might be a bug.

- The following does work:

1. The context is created with the first 4 bytes of the buffer.
2. A call to xmlParseChunk reads *all* the data in the buffer; i.e. the
   chunk size is big enough to read all the buffer.
3. A final call to xmlParseChunk with @terminate of 1.

- The following does *not* work:

1. The context is created with the first 4 bytes of the buffer.
2. *Multiple* calls to xmlParseChunk are make, since the chunk size is
   *not* big enough to read all the buffer at once.
3. A final call to xmlParseChunk with @terminate of 1.

----------

I can reproduce this with xmllint and a modified version of the file 
"libxml2\test\wap.xml":

Note that 'encoding="iso-8859-1"' was added to the prolog of the file 
"wap.xml" and the file was UTF-16LE encoded (with BOM).

P:\tests\unicodeConsole>xmllint --push wap.xml
wap.xml:12: parser error : AttValue: ' expected
        <postfield name="tp" value="wml/state/variables/parsing/1
                                                                 ^
wap.xml:12: parser error : attributes construct error
        <postfield name="tp" value="wml/state/variables/parsing/1
                                                                 ^
wap.xml:12: parser error : Couldn't find end of Start Tag postfield
        <postfield name="tp" value="wml/state/variables/parsing/1
                                                                 ^
----------

- Everything works fine if I recompile xmllint with a chunk size of 4096 
instead of 1024.

- Everything works fine if the file is *not* encoded in UTF-16 (e.g. UTF-8).

- Everything works fine if the file *is* encoded in UTF-16LE *and* the 
prolog defines an encoding of "UTF-16LE".

It seems that the push parser stumbles over something if:
1. multiple calls to xmlParseChunk are performed
2. input is UTF-16 encoded
3. the declaration does state an other encoding


I've tried to look up the archives concerning this issue and found some 
mailings that seem to point to a comparable problem:

http://mail.gnome.org/archives/xml/2002-March/msg00012.html
http://mail.gnome.org/archives/xml/2002-January/msg00105.html
http://mail.gnome.org/archives/xml/2002-August/msg00040.html


If this may be of concern: I have the *feeling* (and nothing more) that 
libxml2 takes the declared encoding into account *somewhere* and 
stumbles  over the UTF-16 encoded buffer (sorry that I could not debug 
libxml2, acually I don't even know how to debug libxml2; the dsp and 
bcb5 stuff is gone and I'm really not that much into "c").
But maby the push parser wasn't build for that kind of processing.


Anyway, this behaviour seems inconsistent to me, so I'm hoping for 
clarification.

 

Problem solved: I tried to let libxml2 eat UTF-16LE with an encoding declaration 
stating an other encoding. This is *not possible* as stated by William M. Brack 
in "http://bugzilla.gnome.org/show_bug.cgi?id=126197";. The parser seems to switch 
back to the declared encoding after the initial chunk given to the push parser.

This seems still somehow inconsistent to me, since the parser will not take the 
declared encoding if all data is processed with the first chunk.

Does anybody have a hint on how to let the push parser eat UTF-16LE regardless of
the declared encoding - if this is possible at all?


Regards,

Kasimier




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]