Re: [xml] Push-parsing Unicode with LibXML2



On Mon, Feb 13, 2006 at 02:07:32PM -0800, Eric Seidel wrote:
I'm reading in data off the network, converting it to utf16, and then  
passing it off to libxml2.  In the parser4 adapted example, I'm  
reading ascii from a local file, expanding it to integers  
(effectively utf16) and then passing it to libxml2:
[...]
    const unsigned BOM = 0xFEFF;
    const unsigned char BOMHighByte = *(const unsigned char *)&BOM;
    xmlSwitchEncoding(ctxt, BOMHighByte == 0xFF ?  
XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE);

  What did you expect to achieve that way ?!?
UTF-16 is one of the encodings that an XML parser must autodetect and
use 
  http://www.w3.org/TR/REC-xml/#sec-guessing
what you are doing may perfectly well break the internal parser
detection. You must not use xmlSwitchEncoding() unless you're an expert
in the way libxml2 internals work. So don't do this at least at this stage !

Actually even converting to UTF-16 from the external source it just plain
broken, the xml declaration may state that this is some other encoding
and then the actual bytes and the declared encoding will conflict, really
not a good idea, again unless you really really know what you're doing
you should never attempt to work around the parser autodetection code:
you're playing with conformance of the parser to the spec so this is
on the edge of what is acceptable from client code.

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]