[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] Push-parsing Unicode with LibXML2
- From: Kasimier Buchcik <K Buchcik 4commerce de>
- To: Rob Richards <rrichards ctindustries net>
- Cc: ML-libxml2 <xml gnome org>
- Subject: Re: [xml] Push-parsing Unicode with LibXML2
- Date: Wed, 15 Feb 2006 15:58:26 +0100
Hi,
On Wed, 2006-02-15 at 08:50 -0500, Rob Richards wrote:
> After reading this thread and the comments in the bug report I have a
> few questions/comments.
>
> Kasimier Buchcik wrote:
> >> To me the most logical would be to do surgery on your input stream
> >> you are modifying it by changing its encoding, you should then also
> >> change or remove the encoding declaration of the xmlDecl if present.
> >>
> > We are doing this in our Delphi DOM-wrapper and lxml does it as well.
> > I guess PHP does something similar.
> >
> > Since in Delphi we defined the DOMString to be little-endian with
> > no BOM, we currently do the following if parsing a DOMString:
> >
> PHP doesn't play around with encoding or even implement a DOMString in
> the DOM extension. If any special encoding needs to be handled using a
> string it's up to the user to encode it as needed. The specified
> document encoding or BOM is what is used to determine encoding as I
This is not restricted to parsing of a DOMString.
With the DOM Load & Save module you can override the encoding
declaration of the XML entitity via the LSInput.encoding property:
"For other sources of input [other than DOMString], an encoding
specified by means of this attribute will override any encoding
specified in the XML declaration or the Text declaration, or an encoding
obtained from a higher level protocol, such as HTTP [IETF RFC 2616]."
http://www.w3.org/TR/2003/CR-DOM-Level-3-LS-20031107/load-save.html#LS-LSInput-encoding
> really dont agree with overriding encoding and haven't heard any
> complaints yet.
Then PHP doesn't use (hasn't implemented) the LS module.
For LSInput.stringData (which is of type DOMString) it reads:
"String data to parse. If provided, this will always be treated as a
sequence of 16-bit units (UTF-16 encoded characters)."
http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSInput-stringData
> I do have a question on Kasimier's latest comment in the bug report
> about keeping any specified encoding if the document. If this value is
> not kept, then what encoding is used when the document is serialized and
> not explicitly passed to the save functions? Would it use the overriding
> value rather than the origional one specified in the XMLDecl?
If one is using the Load & Save module [1], then this is defined in:
http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSSerializer-write
So the sequence of obtaining the encoding for serialization is:
1) LSOutput.encoding
2) Document.inputEncoding
3) Document.xmlEncoding
... with a fallback to UTF-8 if none of the above is specified.
> In any event whatever change is made to this I doubt it will have any
> impact on my side in terms of breakage since I don't muck around with
> encoding while parsing and use different I/O routines in the event any
> changes are made here for some sort of encoding detection (i.e. http
> headers, etc..).
>
> Rob
Regards,
Kasimier
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]