[xml] Encoding problems building document from scratch
- From: "Henke, Markus" <Markus_Henke ordat com>
- To: "'xml gnome org'" <xml gnome org>
- Subject: [xml] Encoding problems building document from scratch
- Date: Fri, 11 Jan 2002 13:15:44 +0100
and sorry that i post this again. But i've posted it before as a RE: to
my original posting and that's maybe mistakable since it's no answer,
rather a follow-up question that includes some debugging information
and a possible solution that seems to work, although i don't know
if it's the right way to handle these things...
From: Henke, Markus
Sent: Wednesday, January 09, 2002 11:33 AM
To: 'xml gnome org'
Subject: encoding problems using xmlSaveFormatFileTo()
me again... 8)
While testing an application to write XML docs from scratch
using libxml2-2.4.12 i've some troubles regarding the character
encoding support of libxml.
I've read the documentation to that subject and the reference
to the functions that i use and thought i got it so far,
since i'm currently working with iso-8859-1 encoding
(that is supported by default).
But now i'm unshure if i got something wrong (it wouldn't be
the first time... 8).
That's what i'm doing:
/* Build a new document */
docPtr = xmlNewDoc(XML_DEFAULT_VERSION);
/* encode entities for new node content */
char* contentBuff = "Some content < & > aou äöü AOU ÄÖÜ ß üöä ";
tmpBuff = xmlEncodeEntitiesReentrant(docPtr, BAD_CAST contentBuff);
/* Add node */
xmlNewChild(docPtr, NULL, "aNode", tmpBuff);
/*** create output buffer ***/
outputBufferPtr = xmlOutputBufferCreateIO(writeCallback,
closeCallback, (void*)&fileDesc, NULL);
/* save doc to disk */
xmlSaveFormatFileTo(outputBufferPtr, docPtr, "iso-8859-1", 0);
That's what i get:
<?xml version="1.0" encoding="iso-8859-1"?>
<aNode>Some content < & > aou ä¶¼ AOU Ä-Ü ß Ã¼öä </aNode>
So, some of the german umlaute are encoded correctly, some
not (but differently) and some as well as (!).
Debugging shows that xmlEncodeEntitiesReentrant()
correctly replaces the german umlaute with their character references,
That's seems to be not the whole truth. A more exact look
shows that the encoding of
"Some content < & > aou äöü AOU ÄÖÜ ß üöä "
"Some content < & > aou 䶼 AOU
and that looks not OK for me.
I've debuged in xmlEncodeEntitiesReentrant() and found that
it evaluates doc->encoding, which is NULL at the time where i
append my child node since the document is created just now.
So an UTF-8 encoding is assumed and the character 'ä' (#xe4)
is encoded as (3 byte character) 䶼 which breaks the input buffer
(skipped two character, next is BLANK etc.).
I've tried to manually set the encoding of the doc immediately after
creation, before appending child nodes via
docPtr->encoding = xmlStrdup("iso-8859-1");
and that seems to work.
Is this the correct (resp. the only) way to get a correct
encoding for a document that is build from scratch?
And if so, wouldn't it be usefull to provide a xmlNewDoc()
function that takes an encoding as parameter?
Or is there already something similar that i've missed...?
so i've thought about the
xmlCharEncodingHandlerPtr parameter in xmlOutputBufferCreateIO(),
but the library reference manual keeps silent about it...
I've hoped that passing NULL to xmlOutputBufferCreateIO()
would invoke the
libxml default encoding handler!? Is this the mistake or
something else is going wrong?
Thanx for your effort & Ciao, Markus
Mit freundlichen Gruessen - Kind regards
ORDAT GmbH & Co. KG - Serversystems / eCom
Dipl.-Inf. (FH) Markus Henke Fon: +49 (641) 7941-0
Rathenaustr. 1 Fax: +49 (641) 7941-132
35394 Gießen mailto:markus henke ordat com
...this behavior is by desig...
] [Thread Prev