'Re: [xml] "Control over encoding declaration (prolog and meta)'


on 1/15/2004 11:52 AM Daniel Veillard wrote:

On Thu, Jan 15, 2004 at 10:34:05AM +0100, Kasimier Buchcik wrote:

Would this approach be thread-save? I would expect this procedure to 
temporarily change the encoding handler for e.g. ISO-8859-1 to UTF-16LE, 
but if I'm serializing an other document to ISO-8859-1 at the same time, 
I would get false results. Or is the registration of encoding handlers 
somehow implemented per-thread like?

  No it's global.

Ok, so I can't use it.

Lying about encodings is bad. I don't know why you want to do this,
but I don't want to start making specialized APIs for this reason.

:-) Lying about encodings: "Hey dude, you tricked me with that encoding. 
   It says ISO-stuff but it's UTF-stuff. Gimme back my bucks!"

Ok, this issue is DOM 3 related. As you might remember I'm still 
struggeling with "to DOMString serialization" and "from DOMString 
parsing", which has to be always UTF-16 encoded, regardless of the 
content; so if I have e.g. an ISO-8859-1 document I still need it to be 
serialized to UTF-16, but it still *has to* contain an encoding 
declaration of ISO-8859-1. It sounds like no big deal, but if I don't 
have control over both, the target encoding and the declared encoding, I 
can't fullfill the requirements of the DOM 3 spec.

Encodings are registered globally, I think it's a sound decision, it's
a framework capacity and an API that I expect to be used once at startup.

Yes, I agree.
I think it's not an encoding issue, but rather: "let *me* decide how the 
declaration goes, I'm big enough to decide if it's wrong or not".

If you have a completely broken requirement, fork, do the unclean stuff
in the forked process and be done with it. If there is a speed penalty,
then that will give people an incentive to fix the receiving side. Sorry
this is not a valuable reason to add even more confusing APIs, increase
libxml2 code and overall complexity.

I know that libxml2 has not much to do with DOM requirements. But I 
would not call the implementation of a DOM 3 requirement "unclean stuff".

XHTML is XML, the tools MUST parse it following the XML rules which are
cristal clear, if your instance says "ISO-8859-1" and is encoded in

As stated above, XML spec on the one side, DOM spec on the other.

"UTF-16LE" then it's a well formedness error, unless you get something
like an HTTP header telling what the real encoding is (and I personally
consider this a terrible bad kludge, but that's how it is).

So the sum of use cases has risen to 2 :-)

Daniel, you wrote some of your mails on the list that there are too many 
entrypoints to the library already; I understand your concern, and 
things like the xmlReadxxx API with all the nice options are really 
compact and concise. So I wonder if it would be good to have a 
xmlSerializexxx API; a serialization context sounds a bit heavy, but 
more flexible - allowing extensible options for the future. And I would 
be happy about a field "declaredEncoding" taking a custom encoding to be 
declared. I really think the serialization will become far more complex, 
and should be more customizable, if (hopefully) libxml2 will try help 
out more with DOM stuff in the future.

Finally I must admit that there would be a workaround for me: I could 
serialize with the existing API, then encode to UTF-16LE. But since we 
are using quite huge documents, I guess it will not acceptable in 
matters of performance and seems rather stupid.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]