Re: [xml] Push-parsing Unicode with LibXML2




On Feb 14, 2006, at 2:32 AM, Daniel Veillard wrote:

On Tue, Feb 14, 2006 at 01:38:45AM -0800, Eric Seidel wrote:
As I see it, my only options are:

1.  Find (with your help) some way to hack around libxml's encoding-
overrides-everything behavior.  (This might mean detecting and
stripping <?xml... lines or encoding="" attributes from the input
stream.)
2.  Ask you nicely to add an API for disabling this behavior (or
otherwise manually overriding the encoding.)
3.  Hack some such manual-encoding-override behavior into the Mac OS
X system version of libxml2 for our next release.  (My least favorite
option.)

Any suggestions are most welcome...

  To me the most logical would be to do surgery on your input stream
you are modifying it by changing its encoding, you should then also
change or remove the encoding declaration of the xmlDecl if present.
  However to follow appendix F2 the user provided encoding should
override the detected one, so that could be considered a libxml2 bug,
I'm just really worried about breaking existing code in changing this.

I've found a (hackish) solution to the problem. By calling xmlSwitchEncoding before every chunk (and passing the proper utf-16 variant), I'm able to make my existing code work:

// Hack around libxml2's lack of encoding overide support by manually // resetting the encoding to UTF-16 before every chunk. Otherwise libxml // will detect <?xml version="1.0" encoding="<encoding name>"?> blocks
    // and switch encodings, causing the parse to fail.
    const QChar BOM(0xFEFF);
const unsigned char BOMHighByte = *reinterpret_cast<const unsigned char *>(&BOM); xmlSwitchEncoding(m_context, BOMHighByte == 0xFF ? XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE);

xmlParseChunk(m_context, reinterpret_cast<const char *> (parseString.unicode()), sizeof(QChar) * parseString.length(), 0);


  Other suggestion: don't mess with the LE or BE specific names for
UTF-16, just use "UTF-16", the parser automatically ajust anyway.

That's good to know. Under the covers, libxml uses XML_CHAR_ENCODING_UTF16LE or XML_CHAR_ENCODING_UTF16BE however, so for now, I'm just detecting which by myself and passing it directly to xmlSwitchEncoding.

Thanks again for all your help.

-eric

Daniel

--
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http:// xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]