[xml] Problems to parse UTF-16 encoded xml with libxml implementation o f xmlReader



Hello,

Following is the description of the problem I encountered.

*Context :
  I work under WindowsXP with a 2.5.6 version of libxml found in the
"libxml2-2.5.6.win32.zip" archive provided by Igor Zlatkovic.

*Problem :
  When using xmlReader in order to parse an xml document encoded in UTF-16
the parser fails to read nodes.
  It seems that the document is not recognized as UTF-16 encoded document.

  The document is in UTF-16 little endian

  I first used the xmlNewTextReaderFilename function to create the parser.
  The errors messages are the following :
    - if the file begin with the \xFF \xFE bytes : "Start tag expect, '<'
not found"
    - if the file begin with the \x3C \x00 bytes : "xmlParseStartTag:
invalid element name"

  I secondly used xmlAllocParserInputBuffer(XML_CHAR_ENCODING_UTF16LE) to
get the xmlParserInputBufferPtr that I passed to the xmlNewTextReader
function.
  Then the error message is the follwing : "Extra content at the end of the
document"

*Resolution?
  I resolve my problem by converting the document from UTF-16 encoding to
UTF-8 encoding by myself before to parse it.
  Is this the only solution? Is this a bad solution regarding the
performance? Is xmlReader supposed to parse only UTF-8 encoded xml?

Thank you,

Pierre.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]