Re: [xml] Push-parsing Unicode with LibXML2

From: Kasimier Buchcik <K Buchcik 4commerce de>
To: Daniel Veillard <veillard redhat com>
Cc: ML-libxml2 <xml gnome org>
Subject: Re: [xml] Push-parsing Unicode with LibXML2
Date: Tue, 14 Feb 2006 13:14:21 +0100

Hi,

On Tue, 2006-02-14 at 05:32 -0500, Daniel Veillard wrote:

On Tue, Feb 14, 2006 at 01:38:45AM -0800, Eric Seidel wrote:

As I see it, my only options are:

1.  Find (with your help) some way to hack around libxml's encoding- 
overrides-everything behavior.  (This might mean detecting and  
stripping <?xml... lines or encoding="" attributes from the input  
stream.)
2.  Ask you nicely to add an API for disabling this behavior (or  
otherwise manually overriding the encoding.)
3.  Hack some such manual-encoding-override behavior into the Mac OS  
X system version of libxml2 for our next release.  (My least favorite  
option.)

Any suggestions are most welcome...


  To me the most logical would be to do surgery on your input stream
you are modifying it by changing its encoding, you should then also 
change or remove the encoding declaration of the xmlDecl if present.


We are doing this in our Delphi DOM-wrapper and lxml does it as well.
I guess PHP does something similar.

Since in Delphi we defined the DOMString to be little-endian with
no BOM, we currently do the following if parsing a DOMString: 

1) Extract the version and standalone values from the XML declaration
   on our side.
2) Create a new string with an initial UTF-16LE BOM (4 bytes),
   followed by a reconstructed XML declaration with the version and
   standalone values obtained from 1). Omit the encoding declaration.
3) Feed the parser with the string from 2)
4) Feed the parser with the rest of the input starting after
   the XML declaration.

The parser will switch in xmlParseDocument() to the desired
encoding with the use of xmlDetectCharEncoding() due to the
BOM. If there's no encoding declaration then xmlParseXMLDecl()
won't switch encoding.

  However to follow appendix F2 the user provided encoding should
override the detected one, so that could be considered a libxml2 bug,
I'm just really worried about breaking existing code in changing this.


Fooling the parser in order to eat the user's encoding works, but
it's not nice.
I wonder if we could have an additional xmlParserOption,
e.g. XML_PARSE_OVERRIDEENCDECL, to explicitely instruct the parser to
parse the encoding declaration, but not to use it; this wouldn't break
existing code.

  Other suggestion: don't mess with the LE or BE specific names for
UTF-16, just use "UTF-16", the parser automatically ajust anyway.

Daniel


Regards,

Kasimier

Follow-Ups:
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Rob Richards

References:
- [xml] Push-parsing Unicode with LibXML2
  - From: Eric Seidel
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Eric Seidel
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Eric Seidel
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Eric Seidel
- Re: [xml] Push-parsing Unicode with LibXML2
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]