Re: [xml] French character encoding problem

On Thu, Sep 15, 2005 at 05:24:29PM -0400, Fred Fung wrote:
The byte sequence for "Ç" that would appear in an xml or html page
is "Ç" as I stated in my first email.

  yes and that's why I said the internal representation would not depend
on the encoding information in my previous mail. I *did* read and remember
your initial message when answering.

I understand that all strings are internally encoded as UTF-8. But what
I want to achieve is that, once I retrieve the UTF-8 encoded string into
a C variable, how can I convert the UTF-8 encoded squence "#C3#87" back
to the corresponding "Ç" character so that other part of my application
can use this character instead of the UTF-8 sequence ?

  It is not the question you asked in the first mail.
You can use UTF8Toisolat1() which is defined in <libxml/encoding.h>
or the iconv library which is part of the POSIX subsystem.

As I said in my original email, I ran xmllint on the xml file and
it was able to output "Ç" properly on my screen, NOT the UTF-8
encoded string.

  When an encoding was provided as part of the document the serialization
routines try to convert back to that encoding. It will use 
 UTF8Toisolat1() internally unless the iconv() system or converters
provided by the application override libxml2 provided default.

So there must be something that I should be calling to do the conversion.

the page I pointed to states

  "libxml2 has a set of default converters for the following encodings (located in encoding.c):

     1. UTF-8 is supported by default (null handlers)
     2. UTF-16, both little and big endian
     3. ISO-Latin-1 (ISO-8859-1) covering most western languages
     4. ASCII, useful mostly for saving
     5. HTML, a specific handler for the conversion of UTF-8 to ASCII
        with HTML predefined entities like &copy; for the Copyright sign.

   More over when compiled on an Unix platform with iconv support the full
   set of encodings supported by iconv can be instantly be used by libxml"

  I can't really list all encodings one by one and point to the associated
converter. But I do point to the module holding them and to the iconv
system routine which can be used too.

Please, if you are not able to help, just say so, or just don't bother to reply.

  I can help if I get a coherent question. To me your initial set of
question were not coherent, so I pointed to the documentation. Your
second mail were not clearer about what you wanted to do, sometimes
you took examples about the input, some time about the return values
returned by the API, sometimes about the reserialized after parse form.
Reread your mails, they mixed all 3 plus they mixed encoding, code point,
and representation issues:

 "it was able to output "Ç" properly on my screen"

  really mean to me that you still don't understand the problem.
the output on your screen is a *representation* of the sequence of bytes
emitted. If it was serialized in UTF-8, even if that character had been
represented by 2 bytes, then you would see one caracter glyph on screen
anyway if your locale was fr_FR.UTF-8 or any other locale using an UTF-8
  There is 3 layers, the byte sequence, the character sequence based 
on Unicode code points and the representation as a sequence of glyphs.
Mixing the 3 is a common problem even for "fellow competant programmer",
and Joel Spolsky is right on when he says that it's a very serious problem


Daniel Veillard      | Red Hat Desktop team
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]