Re: [xml] Push-parsing Unicode with LibXML2


On Wed, 2006-02-15 at 08:50 -0500, Rob Richards wrote:
After reading this thread and the comments in the bug report I have a 
few questions/comments.

Kasimier Buchcik wrote:
  To me the most logical would be to do surgery on your input stream
you are modifying it by changing its encoding, you should then also 
change or remove the encoding declaration of the xmlDecl if present.
We are doing this in our Delphi DOM-wrapper and lxml does it as well.
I guess PHP does something similar.

Since in Delphi we defined the DOMString to be little-endian with
no BOM, we currently do the following if parsing a DOMString: 
PHP doesn't play around with encoding or even implement a DOMString in 
the DOM extension. If any special encoding needs to be handled using a 
string it's up to the user to encode it as needed. The specified 
document encoding or BOM  is what is used  to determine encoding as I 

This is not restricted to parsing of a DOMString.

With the DOM Load & Save module you can override the encoding
declaration of the XML entitity via the LSInput.encoding property:

"For other sources of input [other than DOMString], an encoding
specified by means of this attribute will override any encoding
specified in the XML declaration or the Text declaration, or an encoding
obtained from a higher level protocol, such as HTTP [IETF RFC 2616]."

really dont agree with overriding encoding and haven't heard any 
complaints yet.

Then PHP doesn't use (hasn't implemented) the LS module.

For LSInput.stringData (which is of type DOMString) it reads:
"String data to parse. If provided, this will always be treated as a
sequence of 16-bit units (UTF-16 encoded characters)."

I do have a question on Kasimier's latest comment in the bug report 
about keeping any specified encoding if the document. If this value is 
not kept, then what encoding is used when the document is serialized and 
not explicitly passed to the save functions? Would it use the overriding 
value rather than the origional one specified in the XMLDecl?

If one is using the Load & Save module [1], then this is defined in:

So the sequence of obtaining the encoding for serialization is:
1) LSOutput.encoding
2) Document.inputEncoding
3) Document.xmlEncoding

... with a fallback to UTF-8 if none of the above is specified.

In any event whatever change is made to this I doubt it will have any 
impact on my side in terms of breakage since I don't muck around with 
encoding while parsing and use different I/O routines in the event any 
changes are made here for some sort of encoding detection (i.e. http 
headers, etc..).




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]