RE: [xml] Encoding problems building document from scratch




-----Original Message-----
From: Daniel Veillard [mailto:veillard redhat com]
Sent: Monday, January 14, 2002 6:05 PM
To: Henke, Markus
Cc: 'xml gnome org'
Subject: Re: [xml] Encoding problems building document from scratch


On Mon, Jan 14, 2002 at 04:39:15PM +0100, Henke, Markus wrote:
And what's the correct way to set an encoding for a new document?

  set doc->encoding

Well, that's what i've done...

  You need to convert first the string to UTF8. There is 
a function
exported from encoding.h to do IsoLatin1 to UTF8 encoding.

Hum, that's of course an option.

  No it's the only option. libxml2 API strings are UTF8 encoded, there
is no way around this.

Oh, that's crystal clear and nobody wants to change this (at least
i won't).

If you have a requirement for another encoding,
encapsulate libxml2 calls with them.

No need for another (internal) encoding,
it's just about benefit from the existing libxml encoding support.

Having libxml2 expose incoherent API is not needed
(there is enough of them), 
[...]

That's certainly not what i'm asking for.
With all due respect, that is no reasonable answer to my
question. Looking through this thread, i can't find any place
where i invite you to "expose incoherent API".
I've a hunch that we talk at cross purposes, maybe i've to
illustrate my concern:

When i'm parsing a document that has <...encoding="iso-8859-1">
(or any other (registered) encoding),
libxml will handle the charset conversion and build an
internal representation that is encoded in UTF-8
(and that's pretty nice and preventing... 8)
Therefore it uses the default encoding support or a
(application defined) encoding handler. The raw data are an
(application defined) character buffer and the encoding
information ("iso-8859-1") that is hold in the xmlDoc node.
there is also a performance
issue when delegating charset conversions to libxml2.
(Performance seems OK, at least i havn't read any complaints 8)
Or have i got things completely wrong?


Now, if i'm building a document from scratch, using

docPtr = xmlNewDoc(XML_DEFAULT_VERSION);
docPtr->encoding = xmlStrdup("iso-8859-1");
(or something like xmlNewDocEnc(XML_DEFAULT_VERSION,
encoding))

and create a new node that is attached to that doc, let's
say using a function like

xmlNewDocNodeEnc(xmlDocPtr doc, xmlNsPtr ns,
const char *name, const char *content);

i've the same raw data, an application buffer (content),
the encoding information (docPtr->encoding) and i want
to build an internal, UTF-8 encoded representation of
that data.
Additionally we have the existing libxml encoding support
(and poss. extension handler), a mechanism that is
verifiable perfectly suitable for that job... 8)

So, is it abjectly to think about if there's
already a way (or if it's usefull to have one)
to handle the above mentioned scenario in a efficient way
that benefits from the existing encoding support?
Where do you see incohorence?


Daniel


Ciao, Markus



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]