Re: [xml] how to dump a node in utf8



On Fri, Nov 26, 2004 at 11:31:26AM +0100, Petr Pajas wrote:
Daniel Veillard <veillard redhat com> writes:
  probably depend if your node has a doc, if that doc has an encoding,

Yes, I'm calling that on the root element of 

$ cat para0.xml
<?xml version='1.0' encoding='utf-8'?>
<para>ÄÅÄÅÅÅÃÃÃÅÃÅÅ?Â</para>

(don't know what mail would do with these UTF8 characters, but it is a
valid UTF8 string).

  hum, okay...

Oops, the information about versions I gave last night was rather
misleading. Now, that I tried again in a clean environment, it seems
that the escaping started somewhere between 2.6.8 and 2.6.15
(./parseprint.c is attached):

  okay, I'm gonna look at this.

$ LD_PRELOAD=/net/su/h/local2-rh8/lib/libxml2.so.2.6.8 ./parseprint para0.xml
<para>ÄÅÄÅÅÅÃÃÃÅÃÅÅ?Â</para>

$ LD_PRELOAD=/net/su/h/local2-rh8/lib/libxml2.so.2.6.15 ./parseprint para0.xml
<para>&#x11B;&#x161;&#x10D;&#x159;&#x17E;&#x159;&#xFD;&#xE1;&#xED;&#x16F;&#xFA;&#x165;&#x148;&#x10F;</para>

Is my test case, parseprint.c correct? Is this a sign of a bug?
Should I open a bug report for this?

  opening a bug is the best way to be sure things are not left as-is, yes

Accept my apologies, I didn't mean to "rant" and really don't intend
to make offense. My wording should have been more pragmatic (I was
really tired last night).

  okay, let's try to fix the problem. We have a chance with the new API
to fix things, so let's do this.

Yes, I didn't like the default escaping, since it appears that
escaping wasn't default since something between 2.6.8 and 2.6.15. Now,
taking it pragmatically, it's either a bug or not (which should be
your decision as a designer of the API). In the first case I can fill
a bug report if needed; in the latter case, I'd like to find a correct
way to avoid escaping.

check struct _xmlOutputBuffer {
    ....
    xmlBufferPtr buffer;    /* Local buffer encoded in UTF-8 or ISOLatin */
    xmlBufferPtr conv;      /* if encoder != NULL buffer for output */

  there is *two* buffers.

  Precisely this may not work. And if you try with a larger buffer
you may only get a chunk of the UTF-8 as everything else would possibly be
converted. You can hack as you want but since you want to get control
over the way things are escaped, it sounds to me you should rather use the
new APIs precisely to control the output.

Ok, so in the new API, if I want UTF8 encoded node-dumps (regardless
of what encoding was declared in the xml file), should I use

  output->buffer
or 
  output->conv 

provided I use

xmlNodeDumpOutput(outbuf, output->doc, root_element, 0, 0, "UTF-8")?

I'm not confortable with you attacking the API at that level.
In general the buffer would have to be flushed to force left conversions
if any, then depending on the presence of conv or not use either conv or
buffer .

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]