Re: [xml] encoding tutorial draft

From: Igor Zlatkovic <igor stud fh-frankfurt de>
To: xml gnome org
Subject: Re: [xml] encoding tutorial draft
Date: Mon, 04 Nov 2002 20:21:12 +0100

Exactly, Igor! I'm a known danger when actually writing code, and
slightly less dangerous but still a high risk when trying to explain
code.

Hehe, no you are not. I believe you are very good at explaining things youdo understand. It must be the job of us developers to give you insight intothe code.

This is excellent explanation, which I will incorporate. I'm assuming
you saw nothing in the code itself that will lead libxml immigrants
astray?

Immigrants it won't confuse, but the beginners will have their tough time.The beginners will not be irritated by the code itself, but by the fact thatthe conversion is needed. Expirience says that realising this fact is mostdifficult for the beginners, understanding the code is not an issue.

I would say that the code in a libxml tutorial is not there to teach peoplehow to program. It is there to explain how libxml ticks, on example. In thisparticular case it is there to emphasise the fact that a conversion isneeded. For that, better use complete code fragments which tell a wholestory at the first sight. Do not use separate lines without a very clearcommon context.

Also, this fact needs emphasising so badly, that using the libxml encodinghandler facility is too much at one place. Explain the semantics of libxmlencoding handlers once the users have understood the need for conversion.


For example, consider this tutorial fragment:

---------

Lets define a fictituous conversion facility. We give it a buffer with textin our own encoding, whatever it may be, and receive a different buffer backthat contains the same information, encoded in UTF-8, libxml's internal textformat:


  xmlChar* CONVERT_TO_UTF_8(char* input);

Lets define another fictituous conversion facility that does it the otherway around:


  char* CONVERT_FROM_UTF_8(xmlChar* input);

Both facilities take a buffer of text and return another buffer with theconverted text, memory for which they allocate internally. Note again thatthese CONVERT_something facilities do not really exist in libxml, but arejust a guideline on how you should use some really existing ones.

Every bit of textual data given to libxml must be in UTF-8 format.Unfortunately, little of our data really is in that format, not even theconstant C strings are. Our data's format is usually influenced by ourcurrent locale and if you did nothing to affect that, then it is so in yourcase as well. This means that the conversion must be done each time our datacrosses over to libxml, or whenever libxml gives us some text.

Here is an example on how we create a simple XML document with one node andsome text, then save it in a classic format:


  xmlDocPtr doc;
  xmlNodePtr rootNode;

  char* nodeName = "RootNode";
  char* nodeContent = "This is the text, the text this is.";

  xmlChar* nodeNameU8 = CONVERT_TO_UTF8(nodeName);
  xmlChar* nodeContentU8 = CONVERT_TO_UTF8(nodeContent);

  doc = xmlNewDocument("1.0");
  rootNode = xmlNewDocNode(doc, NULL, nodeNameU8, nodeContentU8);
  xmlDocSetRootElement(doc, rootNode);
  xmlSaveFormatFileEnc("doc.xml", doc, "ISO-8859-1", 0);

Note that our own text must be converted to UTF-8 before it enters libxmlrealm. Of course, these simple constant strings do not undergo any changewhen converted to UTF-8, with any regular C compiler. However, our text willseldom originate in a constant string. Usually this text is obtained fromdifferent external sources and as soon it contains anything so simple as afrench accent or a german umlaut, libxml will misinterpret it withoutconversion. The conversion is absolutely necessary, unless your text isallready in UTF-8 format by default, or unless you are absolutely sure thatthe conversion will produce nothing different to the original.

The reverse procedure must be applied when we retrieve text from libxml.Here is an example on how we read the document we created above back intomemory:


  xmlDocPtr doc;
  xmlNodePtr rootNode;

  char* nodeName;
  char* nodeContent;

  xmlChar* nodeNameU8;
  xmlChar* nodeContentU8;

  doc = xmlParseFile("1.0");
  rootNode = xmlDocGetRootElement(doc);
  nodeNameU8 = xmlNodeGetName(rootNode);
  nodeContentU8 = xmlNodeGetContent(rootNode);

  nodeName = CONVERT_FROM_UTF8(nodeNameU8);
  nodeContent = CONVERT_FROM_UTF8(nodeContentU8);

Note that some check for error conditions should be applied in a realprogram, as well as freeing the memory libxml allocated for you. Using somereal encoding conversion facility, such as iconv, is also advisable.

Also note that you, the user, are responsible for all conversions you mightneed when talking to libxml. Libxml will automatically invoke its internalconversion facilities when it talks to the external data sources ordestinations. When you wrote the file, you needed only specify the desiredoutput encoding, libxml did the conversion automatically. When you loadedthe file, you didn't need to specify anything, libxml detected and handledthe encoding alone. However, when you want to use text libxml gave you, orwant libxml to use the text you give to it, you must handle the formatdifferences yourself.


---------

Please ignore the fact that xmlNodeGetName is a non-existent function. Weshould make one.

Well? I believe that all, really all users must be aware of the fact thatconversion is necessary after reading this. Precisely that fact is the mostcommon source of trouble. Most users find it hard to believe that each andevery string they send into libxml must be converted. Equall hard they findthe fact that they must convert everything that comes from libxml. The sadfact is that most users really need the conversion. Those who work withUTF-8 directly are rare and not affected by all this anyway.

I realise that writing a whole tutorial such as this one is a full-time jobthat takes weeks to complete. Other than that, it would take more to correctend adapt everything once the feedback is there about what is still missingand what is still being misunderstood. The one who manages to put a wholesuch tutorial together has the full right to charge money for all inquiriessuch as "how can I pass an umlaut ä to libxml". :-)


Ciao
Igor

References:
- [xml] encoding tutorial draft
  - From: John Fleck
- Re: [xml] encoding tutorial draft
  - From: Daniel Veillard
- Re: [xml] encoding tutorial draft
  - From: Igor Zlatkovic
- Re: [xml] encoding tutorial draft
  - From: John Fleck

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]