Re: [xml] encoding tutorial draft
- From: Igor Zlatkovic <igor stud fh-frankfurt de>
- To: xml gnome org
- Subject: Re: [xml] encoding tutorial draft
- Date: Mon, 04 Nov 2002 20:21:12 +0100
Exactly, Igor! I'm a known danger when actually writing code, and
slightly less dangerous but still a high risk when trying to explain
code.
Hehe, no you are not. I believe you are very good at explaining things you
do understand. It must be the job of us developers to give you insight into
the code.
This is excellent explanation, which I will incorporate. I'm assuming
you saw nothing in the code itself that will lead libxml immigrants
astray?
Immigrants it won't confuse, but the beginners will have their tough time.
The beginners will not be irritated by the code itself, but by the fact that
the conversion is needed. Expirience says that realising this fact is most
difficult for the beginners, understanding the code is not an issue.
I would say that the code in a libxml tutorial is not there to teach people
how to program. It is there to explain how libxml ticks, on example. In this
particular case it is there to emphasise the fact that a conversion is
needed. For that, better use complete code fragments which tell a whole
story at the first sight. Do not use separate lines without a very clear
common context.
Also, this fact needs emphasising so badly, that using the libxml encoding
handler facility is too much at one place. Explain the semantics of libxml
encoding handlers once the users have understood the need for conversion.
For example, consider this tutorial fragment:
---------
Lets define a fictituous conversion facility. We give it a buffer with text
in our own encoding, whatever it may be, and receive a different buffer back
that contains the same information, encoded in UTF-8, libxml's internal text
format:
xmlChar* CONVERT_TO_UTF_8(char* input);
Lets define another fictituous conversion facility that does it the other
way around:
char* CONVERT_FROM_UTF_8(xmlChar* input);
Both facilities take a buffer of text and return another buffer with the
converted text, memory for which they allocate internally. Note again that
these CONVERT_something facilities do not really exist in libxml, but are
just a guideline on how you should use some really existing ones.
Every bit of textual data given to libxml must be in UTF-8 format.
Unfortunately, little of our data really is in that format, not even the
constant C strings are. Our data's format is usually influenced by our
current locale and if you did nothing to affect that, then it is so in your
case as well. This means that the conversion must be done each time our data
crosses over to libxml, or whenever libxml gives us some text.
Here is an example on how we create a simple XML document with one node and
some text, then save it in a classic format:
xmlDocPtr doc;
xmlNodePtr rootNode;
char* nodeName = "RootNode";
char* nodeContent = "This is the text, the text this is.";
xmlChar* nodeNameU8 = CONVERT_TO_UTF8(nodeName);
xmlChar* nodeContentU8 = CONVERT_TO_UTF8(nodeContent);
doc = xmlNewDocument("1.0");
rootNode = xmlNewDocNode(doc, NULL, nodeNameU8, nodeContentU8);
xmlDocSetRootElement(doc, rootNode);
xmlSaveFormatFileEnc("doc.xml", doc, "ISO-8859-1", 0);
Note that our own text must be converted to UTF-8 before it enters libxml
realm. Of course, these simple constant strings do not undergo any change
when converted to UTF-8, with any regular C compiler. However, our text will
seldom originate in a constant string. Usually this text is obtained from
different external sources and as soon it contains anything so simple as a
french accent or a german umlaut, libxml will misinterpret it without
conversion. The conversion is absolutely necessary, unless your text is
allready in UTF-8 format by default, or unless you are absolutely sure that
the conversion will produce nothing different to the original.
The reverse procedure must be applied when we retrieve text from libxml.
Here is an example on how we read the document we created above back into
memory:
xmlDocPtr doc;
xmlNodePtr rootNode;
char* nodeName;
char* nodeContent;
xmlChar* nodeNameU8;
xmlChar* nodeContentU8;
doc = xmlParseFile("1.0");
rootNode = xmlDocGetRootElement(doc);
nodeNameU8 = xmlNodeGetName(rootNode);
nodeContentU8 = xmlNodeGetContent(rootNode);
nodeName = CONVERT_FROM_UTF8(nodeNameU8);
nodeContent = CONVERT_FROM_UTF8(nodeContentU8);
Note that some check for error conditions should be applied in a real
program, as well as freeing the memory libxml allocated for you. Using some
real encoding conversion facility, such as iconv, is also advisable.
Also note that you, the user, are responsible for all conversions you might
need when talking to libxml. Libxml will automatically invoke its internal
conversion facilities when it talks to the external data sources or
destinations. When you wrote the file, you needed only specify the desired
output encoding, libxml did the conversion automatically. When you loaded
the file, you didn't need to specify anything, libxml detected and handled
the encoding alone. However, when you want to use text libxml gave you, or
want libxml to use the text you give to it, you must handle the format
differences yourself.
---------
Please ignore the fact that xmlNodeGetName is a non-existent function. We
should make one.
Well? I believe that all, really all users must be aware of the fact that
conversion is necessary after reading this. Precisely that fact is the most
common source of trouble. Most users find it hard to believe that each and
every string they send into libxml must be converted. Equall hard they find
the fact that they must convert everything that comes from libxml. The sad
fact is that most users really need the conversion. Those who work with
UTF-8 directly are rare and not affected by all this anyway.
I realise that writing a whole tutorial such as this one is a full-time job
that takes weeks to complete. Other than that, it would take more to correct
end adapt everything once the feedback is there about what is still missing
and what is still being misunderstood. The one who manages to put a whole
such tutorial together has the full right to charge money for all inquiries
such as "how can I pass an umlaut ä to libxml". :-)
Ciao
Igor
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]