RE: [xml] French character encoding problem


As a fellow competant programmer, I exhausted all combinations of trying to make something work before coming 
here for help. I have also read and re-read the documentation page a few times and am not able to get 
anywhere therefore I came here for help and suggestion. As such, I came to this list for help, not to be 
called stupid.

Also, obviously my problem was not properly read and understood before it was answered.

The byte sequence for "Ã" that would appear in an xml or html page is "Ç" as I stated in my first email.

I understand that all strings are internally encoded as UTF-8. But what I want to achieve is that, once I 
retrieve the UTF-8 encoded string into a C variable, how can I convert the UTF-8 encoded squence "#C3#87" 
back to the corresponding "Ã" character so that other part of my application can use this character instead 
of the UTF-8 sequence ?

As I said in my original email, I ran xmllint on the xml file and it was able to output "Ã" properly on my 
screen, NOT the UTF-8 encoded string. So there must be something that I should be calling to do the 

Please, if you are not able to help, just say so, or just don't bother to reply.



-----Original Message-----
From: Daniel Veillard [mailto:veillard redhat com] 
Sent: Thursday, September 15, 2005 4:37 PM
To: Fred Fung
Cc: xml gnome org
Subject: Re: [xml] French character encoding problem

On Thu, Sep 15, 2005 at 12:51:40PM -0400, Fred Fung wrote:

Thanks for the prompt reply.

I already tried "ISO-8859-1" (and just tried again after reading your reply) and I still get the same 

  yes that's normal. You could use any encoding you will get the same.

Already read the encoding.html page a few times. According to this 
page, does that mean that by specifying encoding to be ISO-8859-1, one 
can put "Ã" in the xml file ?

  What is "Ã" ? What byte sequence ? Corresponding to what unicode code point(s) ?

What about if they choose to generate Ç instead of the character ?
I actually just tried putting "Ã" in the xml file with encoding ISO-8859-1.
xmlNodeGetContent() still returns "Ãî" instead.

  It returned the 2 bytes corresponding to that code point in the UTF-8 encoding. The fact that all strings 
are encoded in UTF-8 internally is written on that page. 

Also, if xmllint is able to return the proper character, what am I 
missing that's causing xmlNodeGetContent() not ?

  That all internal representation are kept in UTF-8.
It is clear you did not understood that page. Make sure you understand it.

  "One of the core decisions was to force all documents to be converted to
   a default internal encoding, and that encoding to be UTF-8"

 There is a few pointers at the beginning of that page explaining more about encodings, code points and 
unicode and how they relate. As long as you won't be familiar with those you will continue to have troubles 
I'm afraid.


Daniel Veillard      | Red Hat Desktop team
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM 
search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]