'Re: [xml] "Control over encoding declaration (prolog and meta)'


on 1/15/2004 2:54 PM Daniel Veillard wrote:
On Thu, Jan 15, 2004 at 01:19:13PM +0100, Kasimier Buchcik wrote:

Ok, this issue is DOM 3 related. As you might remember I'm still 
struggeling with "to DOMString serialization" and "from DOMString 
parsing", which has to be always UTF-16 encoded, regardless of the 
content; so if I have e.g. an ISO-8859-1 document I still need it to be 
serialized to UTF-16, but it still *has to* contain an encoding 
declaration of ISO-8859-1.

 No I'm not sure I understand. 
DOM decided to use UTF16 for internal representation and interface,
libxml2 decided to use UTF8. I don't see the relationship w.r.t. 

There is no relationship.

If DOM3 APIs allows to serialize but don't allow to
control the effective encoding, they are buggy, and you should provide
a comment to the working group for clarification.

Hmm, ok, I guess I did not explain it clearly enough; so here the specs:


The DOMString type is used to store [Unicode] characters as a code unit 
string as defined in section 3.4 of [CharModel]. Applications must 
encode the characters using UTF-16 as defined in [Unicode] and Amendment 
1 of [ISO/IEC 10646].
DOM 3 LS - LSSerializer.writeToString

The output is written to a DOMString that is returned to the caller 
(this method completely ignores all the encoding information available).

XHTML is XML, the tools MUST parse it following the XML rules which are
cristal clear, if your instance says "ISO-8859-1" and is encoded in

As stated above, XML spec on the one side, DOM spec on the other.

Sorry, I have a hard time about this.

You are not alone here.

Maybe DOM3 is really broken. There is a workaround : save with libxml2
and then convert back to UTF16 with a string conversion API.


Daniel, you wrote some of your mails on the list that there are too many 
entrypoints to the library already; I understand your concern, and 
things like the xmlReadxxx API with all the nice options are really 
compact and concise. So I wonder if it would be good to have a 
xmlSerializexxx API; a serialization context sounds a bit heavy, but 
more flexible - allowing extensible options for the future. And I would 
be happy about a field "declaredEncoding" taking a custom encoding to be 
declared. I really think the serialization will become far more complex, 
and should be more customizable, if (hopefully) libxml2 will try help 
out more with DOM stuff in the future.

  If DOM is broken w.r.t. XML, well DOM must be fixed, not XML or
the zillion libraries and tools using it.

IMHO, I think the DOM people had a good reason to do it this way. Think 
of a XML editor, that is not able to display all the zillion encodings 
out there; with the specification to serialize any node to a string with 
a specific encoding, all components just have to understand Unicode to 
work with the data *without* changing the encoding information.

Finally I must admit that there would be a workaround for me: I could 
serialize with the existing API, then encode to UTF-16LE. But since we 
are using quite huge documents, I guess it will not acceptable in 
matters of performance and seems rather stupid.

  Where is the stupidity coming from ? I think forcing the encoding

With "stupid" I meant that the efford to get the desired encoding 
declaration seems a bit oversized.

of a string containing a serialized document to be different of the
real encoding of the document for braindead interface decision is
where the stupidity lies. That's what must be fixed.

Hmm, Daniel I guess you don't like the people to call your 
implementation "braindead", and I guess the DOM people don't like it either.

  If DOM3 is stupid, get it fixed or don't use it, what else can I say ?

:-) you know it's not *that* easy...



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]