Re: [xml] Substituting Umlauts by Unicode (&#x...) entities

From: Daniel Veillard <veillard redhat com>
To: xml gnome org
Subject: Re: [xml] Substituting Umlauts by Unicode (&#x...) entities
Date: Thu, 23 Jan 2003 16:55:02 -0500

On Thu, Jan 23, 2003 at 09:29:00PM +0100, Holger Rauch wrote:

It seems like you misunderstood me in my first mail. I didn't want to save


 Right,

the output, I wanted to simply store the output in memory (but with the UTF-8
characters substituted by their corresponding character references).
And actually not the entire document, but just a specific text node. Maybe
xmlNodeDumpOutput() would have been more appropriate for dumping a single
node.


  Yes that's the right API to use. Unless you want to dump a full document
where 

void            xmlDocDumpMemoryEnc     (xmlDocPtr out_doc,
                                         xmlChar **doc_txt_ptr,
                                         int * doc_txt_len,
                                         const char *txt_encoding);

  does what you need.

Besides, the try with the xmlOutputBufferPtr was just in order to emulate the
behavior of xmlDocDump() except for the part where it comes to saving the
output to a file. Is it possible to view the contents of an
xmlOutputBufferPtr using xmlBufferContent() or not? (That's one of the
important questions in my first mail.)


  xmlOutputBufferPtr is used for streaming output:

struct _xmlOutputBuffer {
    void*                   context;
    xmlOutputWriteCallback  writecallback;
    xmlOutputCloseCallback  closecallback;

    xmlCharEncodingHandlerPtr encoder; /* I18N conversions to UTF-8 */

    xmlBufferPtr buffer;    /* Local buffer encoded in UTF-8 or ISOLatin */
    xmlBufferPtr conv;      /* if encoder != NULL buffer for output */
    int written;            /* total number of byte written */
};

  and xmlOutputBufferPtr is not an xmlBufferPtr, clearly you cannot use:

const xmlChar * xmlBufferContent(const xmlBufferPtr buf)

 those are not the same type. You can look at the buffer field
or the conv field.

Sorry, but as a libxml2 *user* who has never dealt with the issue described
in my first mail, I just cannot know whether the xmlOutputBuffer stuff is
the right tool for what I'm trying to achieve.


  You're trying to achieve something which is NOT part of a normal
processing model for XML, that 's the core of your problem !
Normal users do not tweak the serailization, so the fact that it';s not
easy to do, is in my opinion perfectly normal. The API is designed
for normal use. There is no reason in hell that a serialization 
of XML should use character entity references instead of the inline 
serialization of the unicode characters in the selected encoding.
Moreover saving an XML node out of context also has very little sense
from an XML point of view, what about entities, namespaces, base contexts.
This is not an XML processing that you're trying to do, right ?

In case using xmlOutputBuffers is wrong, does xmlDocDumpMemoryEnc() convert
UTF-8 characters by the corresponding character entities.


  Look at your sentence,it makes very little sense. Terminology:
     - a character in an XML context is an UNICODe codepoint.
     - UTF-8 is an encoding allowing to convert the full range of UNICODE
     - there is no UTF-8 specific characters, they are all covered.
     - &#20; is not a character entities is a character reference.

 To rephrase your question correctly, assuming I understood it:
 "does xmlDocDumpMemoryEnc() convert non-ascii characters by the corresponding
  character reference ?"

And the answer is:
     it depends on the encoding used. If the target encoding do not 
    support the code point of a character needing to be serialized, then
   yes it will try to use a character reference to emit that character.
   If the target encoding support the full unicode range than this will
   never happen.
   In the specific case of the "ASCII" encoding any character oustside the
   ASCII range will be so converted !

I also note that trying it out with a 10 line C program would have taken
less time than for me to write this answer... oh well ...

Furthermore, two more things should have become obvious from my first mail:

- I *did* spend quite some time on the issue (more precisely, two days.
Otherwise, I couldn't have come up with such a list of items in my first
mail).
- I reached a point where I got stuck and really didn't know how to continue


  I got annoyedby first the 2 mail sent privately and you bouncing them
as-is whithout even taking the effort to format them correctly back to
the mailing-list. I may jump over the gun too easilly, but I'm also 
sensitive to people being nice to the whole list.

   2/ I cannot answer question all the time


Right. But you should be able and willing to distiguish between

- basic questions that *really* can be found in the already existing docs
and those that cannot and
- between library users and library developers. Not everybody who wants to
*use* libxml2 in his own apps should need/have to know all the ins and outs of
libmxl2 in order to be able to use it.


  Well as soon as you try to do things not in the XML way, well it
will take work.

In case you feel that you tend to answer a question many times, the FAQ on
the web site might be the appropriate place for putting the answer there.


  I made the search engine because a FAQ is far too limited.

Concerning the "ascii" stuff: I'm aware that I can pass a specific encoding
to xmlSaveFileTo(), but it's not obvious to me why I should pass an encoding
if I already used the xmlFindCharEncodingHandler() mechanism. Sorry in case
I'm mixing things up here.


Either you use high level interfaces where an encoding is absically designed
by a string like xmlSaveFileTo() or you use directly the encoding conversion
operating on memory strings. It makes no sense because you try to mix
levels.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- (fwd) Re: [xml] Substituting Umlauts by Unicode (&#x...) entities
  - From: Holger Rauch
- Re: (fwd) Re: [xml] Substituting Umlauts by Unicode (&#x...) entities
  - From: Daniel Veillard
- Re: [xml] Substituting Umlauts by Unicode (&#x...) entities
  - From: Holger Rauch

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]