RE: [xml] encoding tutorial draft
- From: "Labib Iskander, Marcus" <ml cm4all com>
- To: "'xml gnome org'" <xml gnome org>
- Subject: RE: [xml] encoding tutorial draft
- Date: Tue, 5 Nov 2002 14:03:18 +0100
Hi,
I am not sure Igor if you do not expect too few. Of course there are
programmer who never put a thought on the general problem of data
representation in computing, but from time to time every programmer is faced
with the problem, that (s)he needs to know about the actual representation
of data. (This does not relate only to character data but to numbers and of
course the more complex data types). Not knowing what "encoding" is produces
so many traps. I think you can expect, that most of the libxml users DO know
what encoding is and that there are many different ways of representing
character data, since everybody has once walked into one of these traps or
the other. And don't forget the people we are talking about are C
programmers! I emphasize again: C!! Only that they need a small reminder
perhaps.
More grateful they will be if supplied with a simple way of achieving the
conversion to utf-8. Maybe the smaller code snippet I made for the FAQ
should at least be mentioned in the tutorial, since 80% of all appliances of
conversion will be from iso latin-1 (or some local windows codepage almost
the same) and about 10% the other way round (just a guess and not counting
east asian programmers who know the business of encoding quite well :-):
in = "some null terminated iso latin-1 string";
temp = size = (int)strlen(in)+1; /*terminating null included*/
out_size = size*2-1; /*terminating null is just one byte*/
out = malloc((size_t)out_size);
if (!out) {
if ((ret=isolat1ToUTF8(out, &out_size, in, &temp)) || temp-size) {
free(out);
out=NULL;
}
}
Cheers,
Marcus
-----Original Message-----
From: Igor Zlatkovic [mailto:igor stud fh-frankfurt de]
Sent: Monday, November 04, 2002 8:21 PM
To: xml gnome org
Subject: Re: [xml] encoding tutorial draft
Exactly, Igor! I'm a known danger when actually writing code, and
slightly less dangerous but still a high risk when trying to explain
code.
Hehe, no you are not. I believe you are very good at
explaining things you
do understand. It must be the job of us developers to give
you insight into
the code.
This is excellent explanation, which I will incorporate.
I'm assuming
you saw nothing in the code itself that will lead libxml immigrants
astray?
Immigrants it won't confuse, but the beginners will have
their tough time.
The beginners will not be irritated by the code itself, but
by the fact that
the conversion is needed. Expirience says that realising this
fact is most
difficult for the beginners, understanding the code is not an issue.
I would say that the code in a libxml tutorial is not there
to teach people
how to program. It is there to explain how libxml ticks, on
example. In this
particular case it is there to emphasise the fact that a
conversion is
needed. For that, better use complete code fragments which
tell a whole
story at the first sight. Do not use separate lines without a
very clear
common context.
Also, this fact needs emphasising so badly, that using the
libxml encoding
handler facility is too much at one place. Explain the
semantics of libxml
encoding handlers once the users have understood the need for
conversion.
For example, consider this tutorial fragment:
---------
Lets define a fictituous conversion facility. We give it a
buffer with text
in our own encoding, whatever it may be, and receive a
different buffer back
that contains the same information, encoded in UTF-8,
libxml's internal text
format:
xmlChar* CONVERT_TO_UTF_8(char* input);
Lets define another fictituous conversion facility that does
it the other
way around:
char* CONVERT_FROM_UTF_8(xmlChar* input);
Both facilities take a buffer of text and return another
buffer with the
converted text, memory for which they allocate internally.
Note again that
these CONVERT_something facilities do not really exist in
libxml, but are
just a guideline on how you should use some really existing ones.
Every bit of textual data given to libxml must be in UTF-8 format.
Unfortunately, little of our data really is in that format,
not even the
constant C strings are. Our data's format is usually
influenced by our
current locale and if you did nothing to affect that, then it
is so in your
case as well. This means that the conversion must be done
each time our data
crosses over to libxml, or whenever libxml gives us some text.
Here is an example on how we create a simple XML document
with one node and
some text, then save it in a classic format:
xmlDocPtr doc;
xmlNodePtr rootNode;
char* nodeName = "RootNode";
char* nodeContent = "This is the text, the text this is.";
xmlChar* nodeNameU8 = CONVERT_TO_UTF8(nodeName);
xmlChar* nodeContentU8 = CONVERT_TO_UTF8(nodeContent);
doc = xmlNewDocument("1.0");
rootNode = xmlNewDocNode(doc, NULL, nodeNameU8, nodeContentU8);
xmlDocSetRootElement(doc, rootNode);
xmlSaveFormatFileEnc("doc.xml", doc, "ISO-8859-1", 0);
Note that our own text must be converted to UTF-8 before it
enters libxml
realm. Of course, these simple constant strings do not
undergo any change
when converted to UTF-8, with any regular C compiler.
However, our text will
seldom originate in a constant string. Usually this text is
obtained from
different external sources and as soon it contains anything
so simple as a
french accent or a german umlaut, libxml will misinterpret it without
conversion. The conversion is absolutely necessary, unless
your text is
allready in UTF-8 format by default, or unless you are
absolutely sure that
the conversion will produce nothing different to the original.
The reverse procedure must be applied when we retrieve text
from libxml.
Here is an example on how we read the document we created
above back into
memory:
xmlDocPtr doc;
xmlNodePtr rootNode;
char* nodeName;
char* nodeContent;
xmlChar* nodeNameU8;
xmlChar* nodeContentU8;
doc = xmlParseFile("1.0");
rootNode = xmlDocGetRootElement(doc);
nodeNameU8 = xmlNodeGetName(rootNode);
nodeContentU8 = xmlNodeGetContent(rootNode);
nodeName = CONVERT_FROM_UTF8(nodeNameU8);
nodeContent = CONVERT_FROM_UTF8(nodeContentU8);
Note that some check for error conditions should be applied in a real
program, as well as freeing the memory libxml allocated for
you. Using some
real encoding conversion facility, such as iconv, is also advisable.
Also note that you, the user, are responsible for all
conversions you might
need when talking to libxml. Libxml will automatically invoke
its internal
conversion facilities when it talks to the external data sources or
destinations. When you wrote the file, you needed only
specify the desired
output encoding, libxml did the conversion automatically.
When you loaded
the file, you didn't need to specify anything, libxml
detected and handled
the encoding alone. However, when you want to use text libxml
gave you, or
want libxml to use the text you give to it, you must handle
the format
differences yourself.
---------
Please ignore the fact that xmlNodeGetName is a non-existent
function. We
should make one.
Well? I believe that all, really all users must be aware of
the fact that
conversion is necessary after reading this. Precisely that
fact is the most
common source of trouble. Most users find it hard to believe
that each and
every string they send into libxml must be converted. Equall
hard they find
the fact that they must convert everything that comes from
libxml. The sad
fact is that most users really need the conversion. Those who
work with
UTF-8 directly are rare and not affected by all this anyway.
I realise that writing a whole tutorial such as this one is a
full-time job
that takes weeks to complete. Other than that, it would take
more to correct
end adapt everything once the feedback is there about what is
still missing
and what is still being misunderstood. The one who manages to
put a whole
such tutorial together has the full right to charge money for
all inquiries
such as "how can I pass an umlaut ä to libxml". :-)
Ciao
Igor
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]