Re: [xml] libxml and encoding's



Hello Ricardo,

I want top give you some background on the character encoding issues and
the use of UTF-8 in libxml2. I hope these remarks are somewhat helpful.

I'm confuse of how to write C code to use with the libxml2. do i need to
convert "próp" tu UTF8 before to call xmlGetProp ?, and then convert to
ISO-8859-1 the value of the property ? ... if the answer is "yes", why this
function does not do the encoding convertion from UTF8 to declared encoding
internally?

If you mean "why" in the sense of: "Why the heck doesn't it work like I
want to have it work (and can it be changed?)" I have to disappoint you.

It's this way around for good reasons I would be very confused if it ever
changes.

If you ask "why" to know the reason, please see the positive side of this.

XML data is always Unicode! Independent of the actual encoding choosen for
the text file representing the abstract data, pretty much every Unicode
character can occur in the data. If this character is not in the file's
encoding, it will be written as numeric entity.

So, for this reason alone, the API for accessing the XML data should better
be Unicode (and if religious debates are going to be started, the debate
should be whether UTF-8 or UTF-16 or UTF-32). Otherwise all the low level
stuff of encoding and decoing these entities would have to be done by your
application.

For apps that access rather generic XML data, this reasoning should sound
clear (and will save you from saying "but I will see only ISO-8859-1" now
and awakening the whole character encoding mess later).

A somewhat different view can be taken, when you use XML only to store and
retrieve specific data of your application. But then there will be for any
reasonable design, some layer where your data is XMLized (or the other way
around), and in this layer, it's easy to plugin the character set
conversion.

Regards,
Peter Jacobi




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]