Re: [xml] encoding tutorial draft



Feedback encouraged. Among other things, I've not explained (because I
don't understand) the role of iconv in all this magic, and I've not
given a definitive list of the encodings supported.

  without iconv support libxml2 has native support only for the UTF8,
UTF16 and ISO-8859-1 encodings. If it has been compiled with iconv
it will use it to support all the set available from the iconv library
(which itself is dependant on the iconv implementation).

I dare to believe that John needs more explanation, for he stated he is not a programmer. :-)

Formally:

Under 'Program', I mean a process with its own virtual address space. This is, for example, an application based on libxml, together with libxml and all other required libraries.

* Data *internal* to the program is everything stored in computer's main memory and accessible to the program. Mostly, if not always, this data is generated by the program itself. * Data *external* to the program is all other data. External data comes and goes through various channels, such as the local disc and the network.

Iconv is only used by libxml to convert external data on its way into or out from the program. Internal data is always kept in UTF-8 and assumed to be in UTF-8. This gives: 1a. All data the program generates in memory must be in UTF-8 by the time it reaches libxml code. 1b. All data libxml generates in memory will be in UTF-8 by the time the rest of the program sees it. 2a. All data which libxml must gather from external sources must be in a format which can be converted to UTF-8, using conversion facilities available to libxml. 2b. All data which libxml must store at the external destinations must be stored in a format UTF-8 can be converted to, using conversion facilities available to libxml.

Without iconv, only UTF-8, UTF-16 and ISO-8859-1 can be used as external formats. With iconv, any format can be used provided iconv is able to convert it to and from UTF-8. Currently iconv supports about 150 different character formats with ability to convert from any to any. While the actual number of supported formats varies between implementations, every iconv implementation is almost guaranteed to support every format anyone has ever heard of.

Every system with iconv has an 'iconv' executable which can be called as 'iconv -l' to retrieve a full list of all format identifiers supported by the iconv implementation. Note that this does not reflect the actual number of formats, because one and the same format often has more than one identifier. For example, 'ISO-8859-15' is known as 'Latin0' as well. 'UCS-4', 'UCS4', UCS-4-INTERNAL' and 'UCS-4LE' all refer to the same format on my platform.

Knowing that, it is worth pointing out that this internal/external view is not the main problem of most users. The main problem is that their programs use different formats for the internal data in different parts of code. The most common case is that their application code they wrote themselves assumes ISO-8859-1 to be the internal data format. This they combine with libxml which assumes UTF-8 to be the internal data format.

The result is an application (to which libxml belongs according to the definition above) which treats internal data differently, depending on which code section is executing. The one or the other part of code will then, naturally, misinterpret the data.

These users must divide the internal data further, to libxml-internal and user-internal, draw a line between the two and convert all data traffic that crosses the line in either direction. How to convert must remain the user's worry and choice. Using iconv is the best bet. The point to stress is that libxml will not do this for them, they must do it, because only they know where the actually line is.

Ciao
Igor




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]