Re: [[gtkmm] Problem with std::string -> Glib::ustring conversion]



On Sun, 14 Sep 2003, Murray Cumming wrote:

> Brian Gartner <cadrach-gtkmm wcug wwu edu> wrote: 
> > I'm writing an application that, during initialization, must read strings
> > from a binary data file which includes non-ascii characters such as the
> > unicode character U00C6 (the combined "AE" character). 
> 
> Unicode has several possible encodings. I'm not an expert, so hopefully
> someone will correct me if I'm wrong:
> 
> I believe that the UTF-8 encoding does not have null bytes in its middle, so
> that it can be compatible with ASCII string handling routines. So, I guess
> that this is some other encoding of Unicode such as UCS-2. You might use the
> iconv library to convert between encodings. 
> 
> 
> Murray Cumming

Hmm... I'm afraid that I must have miscommunicated. I'm reading these
strings in a raw fashion from a binary file (I know how many
characters/bytes long each one will be), dumping them into std::strings.
Each character in the string is, of course, one byte, and they correspond
to the ASCII characters for values of <= 127. The file, however, contains
strings with characters of values > 127, for example the combined "AE"
character (sorry that I don't know the name of it) that, were it to be
represented in Unicode (which it isn't) would have the value of U00C6. In
the file's encoding, that character is simply represented by one unsigned
byte with a value of 198 (C6 in hex). I believe that this encoding is
called ISO 8859-1, consists of the first 256 characters of the Unicode
encoding, and is the default Linux character set. This encoding, if I'm
not mistaken, is also the one that DOS used and English-language versions
of Windows default to for 1-byte wide characters.

My questions, then, are:

1) Is it possible to simply specify that a string must be converted from
e.g. ISO 8859-1, including the characters with values > 127, into UTF-8? I
thought that this was what locale_to_utf8 would do, but if so I'm doing
something wrong, as my program crashed when I called it on strings
containing characters with values > 127.

2) If (1) is not possible, is there any way to accomplish this other than
just inserting the UTF-8 character that would correspond to a given ISO
8859-1 character into a UTF-8 string? e.g. I know that whenever I find a
byte in that file with a value of 198 I just insert UTF-8 character
foo_char, which I hard-coded into the program to correspond to value 198?
I'd really like to avoid this "solution" if possible.

Again, I'm sorry to be asking these questions if the answers are obvious
to others, but I've been throwing myself against this and don't really
know which way to point myself. If this is a limitation of
locale_to_utf8/etc and it's something that people would like fixed, I'll
work on it if I'm shown the right direction to look. If it's just my
foolishness, well, then I'm sorry to have wasted people's time, but I'd
still very much appreciate being told the simple way of solving this
issue.

Thank you,

Brian Gartner




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]