[xml] RE: encoding problems using xmlSaveFormatFileTo()




 -----Original Message-----
From:         Henke, Markus  
Sent: Wednesday, January 09, 2002 11:33 AM
To:   'xml gnome org'
Subject:      encoding problems using xmlSaveFormatFileTo()

Hello,

me again... 8)
While testing an application to write XML docs from scratch 
using libxml2-2.4.12
i've some troubles regarding the character encoding support of libxml.
I've read the documentation to that subject and the reference 
to the functions
that i use and thought i got it so far, since i'm currently 
working with iso-8859-1
encoding (that is supported by default).
But now i'm unshure if i got something wrong (it wouldn't be 
the first time... 8).

That's what i'm doing:

/* Build a new document */
docPtr = xmlNewDoc(XML_DEFAULT_VERSION);
...

/* encode entities for new node content */
char* contentBuff = "Some content < & > aou äöü AOU ÄÖÜ ß üöä ";
tmpBuff = xmlEncodeEntitiesReentrant(docPtr, BAD_CAST contentBuff);
...

/* Add node */
xmlNewChild(docPtr, NULL, "aNode", tmpBuff);
xmlFree(tmpBuff);
...

/*** create output buffer ***/
outputBufferPtr = xmlOutputBufferCreateIO(writeCallback,
closeCallback, (void*)&fileDesc, NULL);

/* save doc to disk */
xmlSaveFormatFileTo(outputBufferPtr, docPtr, "iso-8859-1", 0);
...


That's what i get:

<?xml version="1.0" encoding="iso-8859-1"?>
<aNode>Some content &lt; &amp; &gt; aou 䶼 AOU Ä-Ü ß Ã¼öä </aNode>


So, some of the german umlaute are encoded correctly, some 
not (but differently)
and some as well as (!).
Debugging shows that xmlEncodeEntitiesReentrant() correctly 
replaces the
german umlaute with their character references, 

That's seems to be not the whole truth. A more exact look shows that the
encoding of

"Some content < & > aou äöü AOU ÄÖÜ ß üöä "

is

"Some content &lt; &amp; &gt; aou &#x4DBC; AOU &#x116;&#x720;&#x7E0;&#252;öä
"

and that looks not OK for me.
I've debuged in xmlEncodeEntitiesReentrant() and found that it evaluates
doc->encoding,
which is NULL at the time where i append my child node since the document is
created just now.
So an UTF-8 encoding is assumed and the character 'ä' (#xe4) is encoded as
(3 byte character)
&#x4DBC; which breaks the input buffer (skipped two character, next  is
BLANK etc.).
I've tried to manually set the encoding of the doc immediately after
creation, before
appending child nodes via

docPtr->encoding = xmlStrdup("iso-8859-1");

and that seems to work.
Is this the correct (resp. the only) way to get a correct encoding for a
document that is
build from scratch?
And if so, wouldn't it be usefull to provide a xmlNewDoc() function that
takes an
encoding as parameter? Or is there already something similar that i've
missed...?


 so i've thought about the
xmlCharEncodingHandlerPtr parameter in xmlOutputBufferCreateIO(),
but the library reference manual keeps silent about it...
I've hoped that passing NULL to xmlOutputBufferCreateIO() 
would invoke the
libxml default encoding handler!? Is this the mistake or 
something else is going wrong?


Thanx for your effort & Ciao, Markus



Mit freundlichen Gruessen - Kind regards
Markus Henke



________________________Addressed by:________________________
 ORDAT GmbH & Co. KG  -  Serversystems / eCom 
 Dipl.-Inf. (FH) Markus Henke  Fon: +49 (641) 7941-0
 Rathenaustr. 1                Fax: +49 (641) 7941-132
 35394 Gießen                  mailto:markus henke ordat com
 See:                          http://www.ordat.com
_____________________________________________________________
              ...this behavior is by desig...



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]