[xml] Substituting Umlauts by Unicode (&#x...) entities

From: Holger Rauch <holger rauch heitec de>
To: xml gnome org
Subject: [xml] Substituting Umlauts by Unicode (&#x...) entities
Date: Thu, 23 Jan 2003 11:45:12 +0100
Hi!

I'm using libxml2 version 2.5.1 on Solaris 8 and I'm
parsing the following ISO 8859-1 encoded XML file:

<?xml version="1.0" encoding="iso-8859-1"?>
<SELECT id="s-dokumente.retrieval">
<SQL table-name="dokumente">
  <WHERE>(Selektor10 like '%müssen%') and Seiten&gt;0 and (dokformat is not
null or dokformat=1)</WHERE>
</SQL>
<XML target="/arc:Document" order-by="sortall (@docId desc)">
  <XSLT type="resultset" id="arc.doc.retr.full.xsl">
  <!--
    <PARAM key="result-small-form" value="1"/>
    -->
  </XSLT>
</XML>
</SELECT>

After the document is parsed, I need to extract the contents of
"/SELECT/SQL/WHERE" in a way that any umlauts in there are substituted
by the corresponding Unicode entity:

ä ==> &#xE4;
ß ==> &#xDF;

etc. in order to avoid displaying unreadable UTF-8 characters on an ISO-8859-1
terminal window. (Subsituting them during the parsing process would also be
OK, if that's possible).

What I tried up to now is the following (after parsing the above document):

1. Retrieve the content of "/SELECT/SQL/WHERE" using the XPath module in
conjunction with xmlNodeGetContent() and feed the result into one of

xmlEncodeEntitiesReentrant()
xmlEncodeSpecialChars()

Unfortunately, applying either one of those two functions did not produce
the desired result. (By "not produce the desired result" I mean that the
result was some unreadable character. This probably comes from that fact
that I was trying to display UTF-8 in an ISO 8859-1 terminal window.

2. Retrieve the content of "/SELECT/SQL/WHERE" using the XPath module in
conjunction with xmlNodeGetContent() and feed the result into the encoding
function given in John Fleck's tutorial at

http://www.inkstain.net/fleck/tutorial/apf.html

==> Same result as above (unreadable character)

3. Obtain just the "/SELECT/SQL/WHERE" node again using the XPath module but
this time without using xmlNodeGetContent() but rather reparsing the node's
content using xmlParseBalancedChunkMemory() and then use xmlNodeGetContent()
again.

==> Same result as with 1. and 2. (unreadable character)

4. Looked up the code of xmlDocDump() in tree.c. From that I discovered that
- an xmlOutputBuffer is created
- the encoding of the current document is determined and fed into
xmlFindCharEncodingHandler
- at some later point an internal (static) function named
xmlNodeDumpOutputInternal() is invoked

I tried to adopt this to my code in the following manner (trying to dump to
memory instead of a file):

(seldoc is a valid (non-NULL) xmlDocPtr to the parsed XML document.
Furthermore, I verified that numbytes is > 0, so there obviously were no
errors until xmlOutputBufferClose().

if ( seldoc->encoding ) {
    encHandler = xmlFindCharEncodingHandler( seldoc->encoding );
    if ( ! encHandler ) {
      /* error */
    }
    outbuf = xmlAllocOutputBuffer( encHandler );
    if ( ! outbuf ) {
      /* error */
    }
    numbytes = xmlSaveFileTo( outbuf,
                              seldoc,
                              NULL );
    if ( numbytes == 0 ) {
      /* error */
    }
    numbytes = xmlOutputBufferClose( outbuf );
    if ( numbytes == 0 ) {
      /* error */
    }

The problem is that I don't know how to fetch that output buffer's contents.
I tried both

xmlBufferContent( outbuf->buffer ) and
xmlBufferContent( outbuf->conv )

but even though numbytes was > 0 in all of the above calls, 
both invocations of xmlBufferContent() lead to segfaults when being
passed to printf().

5. Using the search facility provided at http://xmlsoft.org/search.php to
search the mailing list archives. Unfortunately, I couldn't find a solution
there.

As obivous from what I mentioned above I did quite a lot of work on my own
and am now sort of lost. My question is thus: How can I substitute umlauts
by Unicode entities? (I noticed that xmlDocDump(stdout, ...) does output
these entities  when I add new elements containing text nodes with umlauts
to an already parsed XML document using xmlAddChild(), so there probably is a
way to output these entity refernces.)

Thanks in advance for any help and sorry for this rather long mail, but I
had the impression it was necessary to mention what I tried.

Greetings,

        Holger
Follow-Ups:
- Re: [xml] Substituting Umlauts by Unicode (&#x...) entities
  - From: Daniel Veillard
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]