[xml] Problems to parse UTF-16 encoded xml with libxml implementation o f xmlReader



Hi!
New elements for UTF-16 parsing with xmlTextReader API. (See message
msg00210)

Following, the results of my tests of xmlTextReader API with version 2.5.8
of libxml2 in order to parse UTF-16 encoding xml :
(And then the conclusion.)

the cases 5 and 6 are demonstration of functioning cases

1) 
creation of the reader :
    reader = xmlNewTextReaderFilename(filename);
hexadecimal display of file begining :
    3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00 6F 00
6E 00 3D 00 22 00 31 00 2E 00 30 00 22 00 3F 00 3E 00
'ascii' display of the file begining :
    <?xml version="1.0"?>
result :
--------->  FAILED TO PARSE after the first xmlTextReaderRead

2)      
creation of the reader :
    reader = xmlNewTextReaderFilename(filename);
hexadecimal display of file begining :
    FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00
6F 00 6E 00 3D 00 22 00 31 00 2E 00 30 00 22 00 3F 00 3E 00
'ascii' display of the file begining :
    ÿþ<?xml version="1.0"?>
result :
--------->  FAILED TO PARSE after the first xmlTextReaderRead
      
3)
creation of the reader :
    input =
xmlParserInputBufferCreateFilename(filename,XML_CHAR_ENCODING_UTF16LE);
    reader = xmlNewTextReader(input,filename);
hexadecimal display of file begining :
    3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00 6F 00
6E 00 3D 00 22 00 31 00 2E 00 30 00 22 00 20 00 65 00 6E 00 63 00 6F 00 64
00 69 00 6E 00 67 00 3D 00 22 00 55 00 54 00 46 00 2D 00 31 00 36 00 22 00
3F 00 3E 00
'ascii' display of the file begining :
    <?xml version="1.0" encoding="UTF-16"?>
result :
--------->  FAILED TO PARSE after the first xmlTextReaderRead

4)
creation of the reader :
    input =
xmlParserInputBufferCreateFilename(filename,XML_CHAR_ENCODING_UTF16LE);
    reader = xmlNewTextReader(input,filename);
hexadecimal display of file begining :
    FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00
6F 00 6E 00 3D 00 22 00 31 00 2E 00 30 00 22 00 20 00 65 00 6E 00 63 00 6F
00 64 00 69 00 6E 00 67 00 3D 00 22 00 55 00 54 00 46 00 2D 00 31 00 36 00
22 00 3F 00 3E 00
'ascii' display of the file begining :
    ÿþ<?xml version="1.0" encoding="UTF-16"?>
result :
--------->  FAILED TO PARSE after the first xmlTextReaderRead

5)
creation of the reader :
    input =
xmlParserInputBufferCreateFilename(filename,XML_CHAR_ENCODING_UTF16LE);
    reader = xmlNewTextReader(input,filename);
hexadecimal display of file begining :
    3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00 6F 00
6E 00 3D 00 22 00 31 00 2E 00 30 00 22 00 3F 00 3E 00
'ascii' display of the file begining :
    <?xml version="1.0"?>
result :
--------->  OK

6)
creation of the reader :
    input =
xmlParserInputBufferCreateFilename(filename,XML_CHAR_ENCODING_UTF16LE);
    reader = xmlNewTextReader(input,filename);
hexadecimal display of file begining :
    FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00
6F 00 6E 00 3D 00 22 00 31 00 2E 00 30 00 22 00 3F 00 3E 00
'ascii' display of the file begining :
    ÿþ<?xml version="1.0"?>
result :
--------->  OK

**********
CONCLUSION
**********
1)In order to parse UTF-16 encoded xml, the encoding must be specified via
the use of "xmlParserInputBufferCreateFilename" or equivalent.
So if the file encoding is unknown the encoding signature must be searched
before to create the reader.

2)If the encoding is specified via "xmlParserInputBufferCreateFilename", the
encoding attribute must not be present in the Xml declaration.
Here is the bug :
"ctxt->input->cur" become wrong after a call to "xmlSwitchEncoding" (why a
switch encoding here?)
CALL STACK :
    xmlSwitchEncoding
    xmlParseEncodingDecl
    xmlParseXMLDecl
    xmlParseTryOrFinish
    xmlParseChunk
    xmlTextReaderPushData
    xmlTextReaderRead

Pierre.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]