Re: [xml] Push-parsing Unicode with LibXML2
- From: Kasimier Buchcik <K Buchcik 4commerce de>
- To: Rob Richards <rrichards ctindustries net>
- Cc: ML-libxml2 <xml gnome org>
- Subject: Re: [xml] Push-parsing Unicode with LibXML2
- Date: Wed, 15 Feb 2006 15:58:26 +0100
Hi,
On Wed, 2006-02-15 at 08:50 -0500, Rob Richards wrote:
After reading this thread and the comments in the bug report I have a
few questions/comments.
Kasimier Buchcik wrote:
To me the most logical would be to do surgery on your input stream
you are modifying it by changing its encoding, you should then also
change or remove the encoding declaration of the xmlDecl if present.
We are doing this in our Delphi DOM-wrapper and lxml does it as well.
I guess PHP does something similar.
Since in Delphi we defined the DOMString to be little-endian with
no BOM, we currently do the following if parsing a DOMString:
PHP doesn't play around with encoding or even implement a DOMString in
the DOM extension. If any special encoding needs to be handled using a
string it's up to the user to encode it as needed. The specified
document encoding or BOM is what is used to determine encoding as I
This is not restricted to parsing of a DOMString.
With the DOM Load & Save module you can override the encoding
declaration of the XML entitity via the LSInput.encoding property:
"For other sources of input [other than DOMString], an encoding
specified by means of this attribute will override any encoding
specified in the XML declaration or the Text declaration, or an encoding
obtained from a higher level protocol, such as HTTP [IETF RFC 2616]."
http://www.w3.org/TR/2003/CR-DOM-Level-3-LS-20031107/load-save.html#LS-LSInput-encoding
really dont agree with overriding encoding and haven't heard any
complaints yet.
Then PHP doesn't use (hasn't implemented) the LS module.
For LSInput.stringData (which is of type DOMString) it reads:
"String data to parse. If provided, this will always be treated as a
sequence of 16-bit units (UTF-16 encoded characters)."
http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSInput-stringData
I do have a question on Kasimier's latest comment in the bug report
about keeping any specified encoding if the document. If this value is
not kept, then what encoding is used when the document is serialized and
not explicitly passed to the save functions? Would it use the overriding
value rather than the origional one specified in the XMLDecl?
If one is using the Load & Save module [1], then this is defined in:
http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSSerializer-write
So the sequence of obtaining the encoding for serialization is:
1) LSOutput.encoding
2) Document.inputEncoding
3) Document.xmlEncoding
... with a fallback to UTF-8 if none of the above is specified.
In any event whatever change is made to this I doubt it will have any
impact on my side in terms of breakage since I don't muck around with
encoding while parsing and use different I/O routines in the event any
changes are made here for some sort of encoding detection (i.e. http
headers, etc..).
Rob
Regards,
Kasimier
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]