Re: [xml] encoding tutorial draft
- From: Igor Zlatkovic <igor stud fh-frankfurt de>
- To: xml gnome org
- Subject: Re: [xml] encoding tutorial draft
- Date: Mon, 04 Nov 2002 14:57:08 +0100
Feedback encouraged. Among other things, I've not explained (because I
don't understand) the role of iconv in all this magic, and I've not
given a definitive list of the encodings supported.
without iconv support libxml2 has native support only for the UTF8,
UTF16 and ISO-8859-1 encodings. If it has been compiled with iconv
it will use it to support all the set available from the iconv library
(which itself is dependant on the iconv implementation).
I dare to believe that John needs more explanation, for he stated he is not
a programmer. :-)
Formally:
Under 'Program', I mean a process with its own virtual address space. This
is, for example, an application based on libxml, together with libxml and
all other required libraries.
* Data *internal* to the program is everything stored in computer's main
memory and accessible to the program. Mostly, if not always, this data is
generated by the program itself.
* Data *external* to the program is all other data. External data comes and
goes through various channels, such as the local disc and the network.
Iconv is only used by libxml to convert external data on its way into or out
from the program. Internal data is always kept in UTF-8 and assumed to be in
UTF-8. This gives:
1a. All data the program generates in memory must be in UTF-8 by the time
it reaches libxml code.
1b. All data libxml generates in memory will be in UTF-8 by the time the
rest of the program sees it.
2a. All data which libxml must gather from external sources must be in a
format which can be converted to UTF-8, using conversion facilities
available to libxml.
2b. All data which libxml must store at the external destinations must be
stored in a format UTF-8 can be converted to, using conversion facilities
available to libxml.
Without iconv, only UTF-8, UTF-16 and ISO-8859-1 can be used as external
formats. With iconv, any format can be used provided iconv is able to
convert it to and from UTF-8. Currently iconv supports about 150 different
character formats with ability to convert from any to any. While the actual
number of supported formats varies between implementations, every iconv
implementation is almost guaranteed to support every format anyone has ever
heard of.
Every system with iconv has an 'iconv' executable which can be called as
'iconv -l' to retrieve a full list of all format identifiers supported by
the iconv implementation. Note that this does not reflect the actual number
of formats, because one and the same format often has more than one
identifier. For example, 'ISO-8859-15' is known as 'Latin0' as well.
'UCS-4', 'UCS4', UCS-4-INTERNAL' and 'UCS-4LE' all refer to the same format
on my platform.
Knowing that, it is worth pointing out that this internal/external view is
not the main problem of most users. The main problem is that their programs
use different formats for the internal data in different parts of code. The
most common case is that their application code they wrote themselves
assumes ISO-8859-1 to be the internal data format. This they combine with
libxml which assumes UTF-8 to be the internal data format.
The result is an application (to which libxml belongs according to the
definition above) which treats internal data differently, depending on which
code section is executing. The one or the other part of code will then,
naturally, misinterpret the data.
These users must divide the internal data further, to libxml-internal and
user-internal, draw a line between the two and convert all data traffic that
crosses the line in either direction. How to convert must remain the user's
worry and choice. Using iconv is the best bet. The point to stress is that
libxml will not do this for them, they must do it, because only they know
where the actually line is.
Ciao
Igor
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]