[xml] Encoding problems building document from scratch



Hello,

and sorry that i post this again. But i've posted it before as a RE: to
my original posting and that's maybe mistakable since it's no answer,
rather a follow-up question that includes some debugging information
and a possible solution that seems to work, although i don't know
if it's the right way to handle these things...

 -----Original Message-----
From:         Henke, Markus  
Sent: Wednesday, January 09, 2002 11:33 AM
To:   'xml gnome org'
Subject:      encoding problems using xmlSaveFormatFileTo()

Hello,

me again... 8)
While testing an application to write XML docs from scratch 
using libxml2-2.4.12 i've some troubles regarding the character
encoding support of libxml.
I've read the documentation to that subject and the reference 
to the functions that i use and thought i got it so far,
since i'm currently  working with iso-8859-1 encoding
(that is supported by default).
But now i'm unshure if i got something wrong (it wouldn't be 
the first time... 8).

That's what i'm doing:

/* Build a new document */
docPtr = xmlNewDoc(XML_DEFAULT_VERSION);
...

/* encode entities for new node content */
char* contentBuff = "Some content < & > aou äöü AOU ÄÖÜ ß üöä ";
tmpBuff = xmlEncodeEntitiesReentrant(docPtr, BAD_CAST contentBuff);
...

/* Add node */
xmlNewChild(docPtr, NULL, "aNode", tmpBuff);
xmlFree(tmpBuff);
...

/*** create output buffer ***/
outputBufferPtr = xmlOutputBufferCreateIO(writeCallback,
closeCallback, (void*)&fileDesc, NULL);

/* save doc to disk */
xmlSaveFormatFileTo(outputBufferPtr, docPtr, "iso-8859-1", 0);
...


That's what i get:

<?xml version="1.0" encoding="iso-8859-1"?>
<aNode>Some content &lt; &amp; &gt; aou 䶼 AOU Ä-Ü ß Ã¼öä </aNode>


So, some of the german umlaute are encoded correctly, some 
not (but differently) and some as well as (!).
Debugging shows that xmlEncodeEntitiesReentrant()
correctly replaces the german umlaute with their character references, 

 That's seems to be not the whole truth. A more exact look 
 shows that the encoding of
 
"Some content < & > aou äöü AOU ÄÖÜ ß üöä "

is

"Some content &lt; &amp; &gt; aou &#x4DBC; AOU 
&#x116;&#x720;&#x7E0;&#252;öä "

and that looks not OK for me.
I've debuged in xmlEncodeEntitiesReentrant() and found that 
it evaluates doc->encoding, which is NULL at the time where i
append my child node since the document is created just now.
So an UTF-8 encoding is assumed and the character 'ä' (#xe4) 
is encoded as (3 byte character) &#x4DBC; which breaks the input buffer
(skipped two character, next  is BLANK etc.).
I've tried to manually set the encoding of the doc immediately after
creation, before appending child nodes via

docPtr->encoding = xmlStrdup("iso-8859-1");

and that seems to work.
Is this the correct (resp. the only) way to get a correct 
encoding for a document that is build from scratch?
And if so, wouldn't it be usefull to provide a xmlNewDoc() 
function that takes an encoding as parameter?
Or is there already something similar that i've missed...?


 so i've thought about the
xmlCharEncodingHandlerPtr parameter in xmlOutputBufferCreateIO(),
but the library reference manual keeps silent about it...
I've hoped that passing NULL to xmlOutputBufferCreateIO() 
would invoke the
libxml default encoding handler!? Is this the mistake or 
something else is going wrong?

 

Thanx for your effort & Ciao, Markus



Mit freundlichen Gruessen - Kind regards
Markus Henke



________________________Addressed by:________________________
  ORDAT GmbH & Co. KG  -  Serversystems / eCom 
  Dipl.-Inf. (FH) Markus Henke  Fon: +49 (641) 7941-0
  Rathenaustr. 1                Fax: +49 (641) 7941-132
  35394 Gießen                  mailto:markus henke ordat com
  See:                          http://www.ordat.com
 _____________________________________________________________
               ...this behavior is by desig...



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]