Re: [xml] Push-parsing Unicode with LibXML2


On Feb 14, 2006, at 12:59 AM, Daniel Veillard wrote:

On Tue, Feb 14, 2006 at 12:45:14AM -0800, Eric Seidel wrote:
I'm now looking for a way to make libxml ignore the
encoding="iso-8859-1" attribute, and instead rely on the utf-16 it
autodetected (or which I can manually specify).

  xmlCreatePushParserCtxt() doesn't have an encoding option, but
calling xmlCtxtResetPush() after its creation with the parameters
might help.

xmlParserCtxtPtr parser = xmlCreatePushParserCtxt(handlers, 0, 0, 0, 0);
    xmlCtxtResetPush(parser, 0, 0, 0, "UTF-16BE");

Has no effect. Looking at the code for xmlParseChunk (and more specifically xmlParseEncodingDecl), I can see why. The code seems written such that an encoding="<name here>" attribute will always trump any previously detected encoding. (see parser.c 8786, in xmlParseEncodingDecl) At least if the parser is in XML_PARSER_START mode.

Just for grins, I tried forcing the parser to start in XML_PARSER_MISC after manually specifying the encoding, but that only resulted in an "XML declaration allowed only at the start of the document" error.

As I see it, my only options are:

1. Find (with your help) some way to hack around libxml's encoding- overrides-everything behavior. (This might mean detecting and stripping <?xml... lines or encoding="" attributes from the input stream.) 2. Ask you nicely to add an API for disabling this behavior (or otherwise manually overriding the encoding.) 3. Hack some such manual-encoding-override behavior into the Mac OS X system version of libxml2 for our next release. (My least favorite option.)

Any suggestions are most welcome...

Note that you really should try to pass all parameters
an not NULLs/0, things like the filename which sets the base URI are
important for further processing of URI references.

I will certainly consider adding the URI.

  And please don't push one byte at a time, after that people may
claim that libxml2 is a poor performer !

:) Of course not. This is just for testing. I'm pushing one byte at a time to make this easier to debug.

Thanks again for all your help.



Daniel Veillard      | Red Hat
veillard redhat com | libxml GNOME XML XSLT toolkit http:// | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]