Re: [xml] encoding tutorial draft

Exactly, Igor! I'm a known danger when actually writing code, and
slightly less dangerous but still a high risk when trying to explain

Hehe, no you are not. I believe you are very good at explaining things you do understand. It must be the job of us developers to give you insight into the code.

This is excellent explanation, which I will incorporate. I'm assuming
you saw nothing in the code itself that will lead libxml immigrants

Immigrants it won't confuse, but the beginners will have their tough time. The beginners will not be irritated by the code itself, but by the fact that the conversion is needed. Expirience says that realising this fact is most difficult for the beginners, understanding the code is not an issue.

I would say that the code in a libxml tutorial is not there to teach people how to program. It is there to explain how libxml ticks, on example. In this particular case it is there to emphasise the fact that a conversion is needed. For that, better use complete code fragments which tell a whole story at the first sight. Do not use separate lines without a very clear common context.

Also, this fact needs emphasising so badly, that using the libxml encoding handler facility is too much at one place. Explain the semantics of libxml encoding handlers once the users have understood the need for conversion.

For example, consider this tutorial fragment:


Lets define a fictituous conversion facility. We give it a buffer with text in our own encoding, whatever it may be, and receive a different buffer back that contains the same information, encoded in UTF-8, libxml's internal text format:

  xmlChar* CONVERT_TO_UTF_8(char* input);

Lets define another fictituous conversion facility that does it the other way around:

  char* CONVERT_FROM_UTF_8(xmlChar* input);

Both facilities take a buffer of text and return another buffer with the converted text, memory for which they allocate internally. Note again that these CONVERT_something facilities do not really exist in libxml, but are just a guideline on how you should use some really existing ones.

Every bit of textual data given to libxml must be in UTF-8 format. Unfortunately, little of our data really is in that format, not even the constant C strings are. Our data's format is usually influenced by our current locale and if you did nothing to affect that, then it is so in your case as well. This means that the conversion must be done each time our data crosses over to libxml, or whenever libxml gives us some text.

Here is an example on how we create a simple XML document with one node and some text, then save it in a classic format:

  xmlDocPtr doc;
  xmlNodePtr rootNode;

  char* nodeName = "RootNode";
  char* nodeContent = "This is the text, the text this is.";

  xmlChar* nodeNameU8 = CONVERT_TO_UTF8(nodeName);
  xmlChar* nodeContentU8 = CONVERT_TO_UTF8(nodeContent);

  doc = xmlNewDocument("1.0");
  rootNode = xmlNewDocNode(doc, NULL, nodeNameU8, nodeContentU8);
  xmlDocSetRootElement(doc, rootNode);
  xmlSaveFormatFileEnc("doc.xml", doc, "ISO-8859-1", 0);

Note that our own text must be converted to UTF-8 before it enters libxml realm. Of course, these simple constant strings do not undergo any change when converted to UTF-8, with any regular C compiler. However, our text will seldom originate in a constant string. Usually this text is obtained from different external sources and as soon it contains anything so simple as a french accent or a german umlaut, libxml will misinterpret it without conversion. The conversion is absolutely necessary, unless your text is allready in UTF-8 format by default, or unless you are absolutely sure that the conversion will produce nothing different to the original.

The reverse procedure must be applied when we retrieve text from libxml. Here is an example on how we read the document we created above back into memory:

  xmlDocPtr doc;
  xmlNodePtr rootNode;

  char* nodeName;
  char* nodeContent;

  xmlChar* nodeNameU8;
  xmlChar* nodeContentU8;

  doc = xmlParseFile("1.0");
  rootNode = xmlDocGetRootElement(doc);
  nodeNameU8 = xmlNodeGetName(rootNode);
  nodeContentU8 = xmlNodeGetContent(rootNode);

  nodeName = CONVERT_FROM_UTF8(nodeNameU8);
  nodeContent = CONVERT_FROM_UTF8(nodeContentU8);

Note that some check for error conditions should be applied in a real program, as well as freeing the memory libxml allocated for you. Using some real encoding conversion facility, such as iconv, is also advisable.

Also note that you, the user, are responsible for all conversions you might need when talking to libxml. Libxml will automatically invoke its internal conversion facilities when it talks to the external data sources or destinations. When you wrote the file, you needed only specify the desired output encoding, libxml did the conversion automatically. When you loaded the file, you didn't need to specify anything, libxml detected and handled the encoding alone. However, when you want to use text libxml gave you, or want libxml to use the text you give to it, you must handle the format differences yourself.


Please ignore the fact that xmlNodeGetName is a non-existent function. We should make one.

Well? I believe that all, really all users must be aware of the fact that conversion is necessary after reading this. Precisely that fact is the most common source of trouble. Most users find it hard to believe that each and every string they send into libxml must be converted. Equall hard they find the fact that they must convert everything that comes from libxml. The sad fact is that most users really need the conversion. Those who work with UTF-8 directly are rare and not affected by all this anyway.

I realise that writing a whole tutorial such as this one is a full-time job that takes weeks to complete. Other than that, it would take more to correct end adapt everything once the feedback is there about what is still missing and what is still being misunderstood. The one who manages to put a whole such tutorial together has the full right to charge money for all inquiries such as "how can I pass an umlaut ä to libxml". :-)


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]