RE: [xml] encoding tutorial draft


I am not sure Igor if you do not expect too few. Of course there are
programmer who never put a thought on the general problem of data
representation in computing, but from time to time every programmer is faced
with the problem, that (s)he needs to know about the actual representation
of data. (This does not relate only to character data but to numbers and of
course the more complex data types). Not knowing what "encoding" is produces
so many traps. I think you can expect, that most of the libxml users DO know
what encoding is and that there are many different ways of representing
character data, since everybody has once walked into one of these traps or
the other. And don't forget the people we are talking about are C
programmers! I emphasize again: C!! Only that they need a small reminder
More grateful they will be if supplied with a simple way of achieving the
conversion to utf-8. Maybe the smaller code snippet I made for the FAQ
should at least be mentioned in the tutorial, since 80% of all appliances of
conversion will be from iso latin-1 (or some local windows codepage almost
the same) and about 10% the other way round (just a guess and not counting
east asian programmers who know the business of encoding quite well :-):

in = "some null terminated iso latin-1 string";
temp = size = (int)strlen(in)+1; /*terminating null included*/
out_size = size*2-1; /*terminating null is just one byte*/
out = malloc((size_t)out_size); 
if (!out) {
        if ((ret=isolat1ToUTF8(out, &out_size, in, &temp)) || temp-size) {


-----Original Message-----
From: Igor Zlatkovic [mailto:igor stud fh-frankfurt de]
Sent: Monday, November 04, 2002 8:21 PM
To: xml gnome org
Subject: Re: [xml] encoding tutorial draft

Exactly, Igor! I'm a known danger when actually writing code, and
slightly less dangerous but still a high risk when trying to explain

Hehe, no you are not. I believe you are very good at 
explaining things you 
do understand. It must be the job of us developers to give 
you insight into 
the code.

This is excellent explanation, which I will incorporate. 
I'm assuming
you saw nothing in the code itself that will lead libxml immigrants

Immigrants it won't confuse, but the beginners will have 
their tough time. 
The beginners will not be irritated by the code itself, but 
by the fact that 
the conversion is needed. Expirience says that realising this 
fact is most 
difficult for the beginners, understanding the code is not an issue.

I would say that the code in a libxml tutorial is not there 
to teach people 
how to program. It is there to explain how libxml ticks, on 
example. In this 
particular case it is there to emphasise the fact that a 
conversion is 
needed. For that, better use complete code fragments which 
tell a whole 
story at the first sight. Do not use separate lines without a 
very clear 
common context.

Also, this fact needs emphasising so badly, that using the 
libxml encoding 
handler facility is too much at one place. Explain the 
semantics of libxml 
encoding handlers once the users have understood the need for 

For example, consider this tutorial fragment:


Lets define a fictituous conversion facility. We give it a 
buffer with text 
in our own encoding, whatever it may be, and receive a 
different buffer back 
that contains the same information, encoded in UTF-8, 
libxml's internal text 

   xmlChar* CONVERT_TO_UTF_8(char* input);

Lets define another fictituous conversion facility that does 
it the other 
way around:

   char* CONVERT_FROM_UTF_8(xmlChar* input);

Both facilities take a buffer of text and return another 
buffer with the 
converted text, memory for which they allocate internally. 
Note again that 
these CONVERT_something facilities do not really exist in 
libxml, but are 
just a guideline on how you should use some really existing ones.

Every bit of textual data given to libxml must be in UTF-8 format. 
Unfortunately, little of our data really is in that format, 
not even the 
constant C strings are. Our data's format is usually 
influenced by our 
current locale and if you did nothing to affect that, then it 
is so in your 
case as well. This means that the conversion must be done 
each time our data 
crosses over to libxml, or whenever libxml gives us some text.

Here is an example on how we create a simple XML document 
with one node and 
some text, then save it in a classic format:

   xmlDocPtr doc;
   xmlNodePtr rootNode;

   char* nodeName = "RootNode";
   char* nodeContent = "This is the text, the text this is.";

   xmlChar* nodeNameU8 = CONVERT_TO_UTF8(nodeName);
   xmlChar* nodeContentU8 = CONVERT_TO_UTF8(nodeContent);

   doc = xmlNewDocument("1.0");
   rootNode = xmlNewDocNode(doc, NULL, nodeNameU8, nodeContentU8);
   xmlDocSetRootElement(doc, rootNode);
   xmlSaveFormatFileEnc("doc.xml", doc, "ISO-8859-1", 0);

Note that our own text must be converted to UTF-8 before it 
enters libxml 
realm. Of course, these simple constant strings do not 
undergo any change 
when converted to UTF-8, with any regular C compiler. 
However, our text will 
seldom originate in a constant string. Usually this text is 
obtained from 
different external sources and as soon it contains anything 
so simple as a 
french accent or a german umlaut, libxml will misinterpret it without 
conversion. The conversion is absolutely necessary, unless 
your text is 
allready in UTF-8 format by default, or unless you are 
absolutely sure that 
the conversion will produce nothing different to the original.

The reverse procedure must be applied when we retrieve text 
from libxml. 
Here is an example on how we read the document we created 
above back into 

   xmlDocPtr doc;
   xmlNodePtr rootNode;

   char* nodeName;
   char* nodeContent;

   xmlChar* nodeNameU8;
   xmlChar* nodeContentU8;

   doc = xmlParseFile("1.0");
   rootNode = xmlDocGetRootElement(doc);
   nodeNameU8 = xmlNodeGetName(rootNode);
   nodeContentU8 = xmlNodeGetContent(rootNode);

   nodeName = CONVERT_FROM_UTF8(nodeNameU8);
   nodeContent = CONVERT_FROM_UTF8(nodeContentU8);

Note that some check for error conditions should be applied in a real 
program, as well as freeing the memory libxml allocated for 
you. Using some 
real encoding conversion facility, such as iconv, is also advisable.

Also note that you, the user, are responsible for all 
conversions you might 
need when talking to libxml. Libxml will automatically invoke 
its internal 
conversion facilities when it talks to the external data sources or 
destinations. When you wrote the file, you needed only 
specify the desired 
output encoding, libxml did the conversion automatically. 
When you loaded 
the file, you didn't need to specify anything, libxml 
detected and handled 
the encoding alone. However, when you want to use text libxml 
gave you, or 
want libxml to use the text you give to it, you must handle 
the format 
differences yourself.


Please ignore the fact that xmlNodeGetName is a non-existent 
function. We 
should make one.

Well? I believe that all, really all users must be aware of 
the fact that 
conversion is necessary after reading this. Precisely that 
fact is the most 
common source of trouble. Most users find it hard to believe 
that each and 
every string they send into libxml must be converted. Equall 
hard they find 
the fact that they must convert everything that comes from 
libxml. The sad 
fact is that most users really need the conversion. Those who 
work with 
UTF-8 directly are rare and not affected by all this anyway.

I realise that writing a whole tutorial such as this one is a 
full-time job 
that takes weeks to complete. Other than that, it would take 
more to correct 
end adapt everything once the feedback is there about what is 
still missing 
and what is still being misunderstood. The one who manages to 
put a whole 
such tutorial together has the full right to charge money for 
all inquiries 
such as "how can I pass an umlaut ä to libxml". :-)


xml mailing list, project page
xml gnome org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]