Re: [xml] how to dump a node in utf8

From: Petr Pajas <pajas ufal ms mff cuni cz>
To: veillard redhat com
Cc: libxml2 <xml gnome org>
Subject: Re: [xml] how to dump a node in utf8
Date: Fri, 26 Nov 2004 11:31:26 +0100

Daniel Veillard <veillard redhat com> writes:

On Thu, Nov 25, 2004 at 07:54:32PM +0100, Petr Pajas wrote:

Hi Daniel, All,

in libxml2-2.4.x, when one called xmlNodeDump, the result was a UTF-8
encoded string. This seems to have changed in libxml2-2.6.x (or
probably during 2.5.x). Instead of UTF8 it now uses character entities
to encode all non-ascii.


  probably depend if your node has a doc, if that doc has an encoding,


Yes, I'm calling that on the root element of 

$ cat para0.xml
<?xml version='1.0' encoding='utf-8'?>
<para>ÄÅÄÅÅÅÃÃÃÅÃÅÅïÂ</para>

(don't know what mail would do with these UTF8 characters, but it is a
valid UTF8 string).

I must say that this change is quite annoying. Why isn't the new
xmlNodeDump called xmlNodeDumpASCII or something at least for the sake
of backward compatibility?


  "It doesn't work for me, it's broken" ... Well there is a number
of parameters which can change the behaviour.

This didn't break from the python regression tests. So you will have
to provide more informations and at least open a bug report.  If
something changed between 2.4.x and 2.6.x nobody complained yet in
the 2 years interval about it.


Oops, the information about versions I gave last night was rather
misleading. Now, that I tried again in a clean environment, it seems
that the escaping started somewhere between 2.6.8 and 2.6.15
(./parseprint.c is attached):

$ LD_PRELOAD=/net/su/h/local2-rh8/lib/libxml2.so.2.6.8 ./parseprint para0.xml
<para>ÄÅÄÅÅÅÃÃÃÅÃÅÅïÂ</para>

$ LD_PRELOAD=/net/su/h/local2-rh8/lib/libxml2.so.2.6.15 ./parseprint para0.xml
<para>&#x11B;&#x161;&#x10D;&#x159;&#x17E;&#x159;&#xFD;&#xE1;&#xED;&#x16F;&#xFA;&#x165;&#x148;&#x10F;</para>

Is my test case, parseprint.c correct? Is this a sign of a bug?
Should I open a bug report for this?

So I have a bit of a hard time accepting your rant at this point in
all honnesty.


Accept my apologies, I didn't mean to "rant" and really don't intend
to make offense. My wording should have been more pragmatic (I was
really tired last night).

xmlBufferPtr buffer;

buffer = xmlBufferCreate();

xmlOutputBufferPtr outbuf;
outbuf = (xmlOutputBufferPtr) xmlMalloc(sizeof(xmlOutputBuffer));

if (outbuf != NULL) {
   memset(outbuf, 0, (size_t) sizeof(xmlOutputBuffer));
   outbuf->buffer = buffer;
   xmlNodeDumpOutput(outbuf, doc, root_element, 0, 0, "UTF-8");
   xmlFree(outbuf);
   if ( xmlBufferLength(buffer) > 0 ) {
      printf("%s\n",ret);
   }
}

I wonder, is there any shortcut for that?


  xmlsave.c is a first step. minimal APIs to provide backward compatibility.
I'm not sure this is fine.
  Moreover you're doing this because you don't like the default escaping.
Well that default escaping *can* be provided now in the new API with
xmlSaveSetEscape()


Yes, I didn't like the default escaping, since it appears that
escaping wasn't default since something between 2.6.8 and 2.6.15. Now,
taking it pragmatically, it's either a bug or not (which should be
your decision as a designer of the API). In the first case I can fill
a bug report if needed; in the latter case, I'd like to find a correct
way to avoid escaping.

check struct _xmlOutputBuffer {
    ....
    xmlBufferPtr buffer;    /* Local buffer encoded in UTF-8 or ISOLatin */
    xmlBufferPtr conv;      /* if encoder != NULL buffer for output */

  there is *two* buffers.

  Precisely this may not work. And if you try with a larger buffer
you may only get a chunk of the UTF-8 as everything else would possibly be
converted. You can hack as you want but since you want to get control
over the way things are escaped, it sounds to me you should rather use the
new APIs precisely to control the output.


Ok, so in the new API, if I want UTF8 encoded node-dumps (regardless
of what encoding was declared in the xml file), should I use

  output->buffer
or 
  output->conv 

provided I use

xmlNodeDumpOutput(outbuf, output->doc, root_element, 0, 0, "UTF-8")?

  So in a nutshell, I rewrote the API for speed, and more flexibility,


I do understand that.

they may not be finalized yet, but ranting about the fact it changed
really isn't the right way to get me to invest more time finishing
it :-(


Again, I'm sorry. I'm not ranting. I'm just seeking advice on how to
retain the old behavior with the new API, so that the Perl-bindings
behaved consistently.

Thanks,

-- Petr

Attachment: parseprint.c
Description: Binary data

Attachment: pgpXTyMcWsC99.pgp
Description: PGP signature

Follow-Ups:
- Re: [xml] how to dump a node in utf8
  - From: Daniel Veillard

References:
- [xml] how to dump a node in utf8
  - From: Petr Pajas
- Re: [xml] how to dump a node in utf8
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]