Re: UCS-2 in gunicode.h
- From: Derek Simkowiak <dereks kd-dev com>
- To: Havoc Pennington <hp redhat com>
- Cc: gtk-devel-list gnome org
- Subject: Re: UCS-2 in gunicode.h
- Date: Fri, 7 Jul 2000 17:44:23 -0700 (PDT)
-> I think we're just going to add a function like this to glib:
->
-> gchar*
-> g_convert (const gchar *str,
-> gint len,
-> const gchar *to_codeset,
-> const gchar *from_codeset,
-> gint *bytes_converted)
So a g_ wrapper around iconv, then. Will all the g_utf8_*()
functions currently in gunicode.h disappear, then?
To answer my original question: libiconv stores everything
internally was a wide character (wchar_t). Then, when returning converted
strings, it puts the encoding into the appropriate width (8-bit, 16-bit,
or 32-bit).
For UCS-2, here is the function that converts the internal wchar_t
UCS-2 string into the 16-bit output string:
[ From ucs2.h in libiconv: ]
static int
ucs2_wctomb (conv_t conv, unsigned char *r, wchar_t wc, int n)
{
if (wc < 0x10000 && wc != 0xfffe) {
if (n >= 2) {
r[0] = (unsigned char) (wc >> 8);
r[1] = (unsigned char) wc;
return 2;
} else
return RET_TOOSMALL;
} else
return RET_ILSEQ;
}
This looks to me like any 32-bit Unicode character--that is, one
which will not exist in the UCS-2 space--will result in a "RET_ILSEQ"
return value.
The function iconv() uses this return value to note that the
conversion has failed. It will then try several fallbacks for the
conversion of the character: First, a U+303E-prefixed variant, then
transliteration, and finally it gives up and converts the entire character
into "Undefined", Unicode char FFFD.
So, in summary, if you tried to convert a UTF-8 string into a
UCS-2 string, and that UTF-8 string had the multi-byte encoding of a
32-bit Unicode character, the conversion would succeed but the 32-bit
character would be replaced with the UCS-2 encoding of the "Undefined"
character. All in all, a very graceful solution if you ask me.
(It would be cool if Pango could draw a cute little "Don't Panic"
icon for FFFD :) )
--Derek
P.S.> I found the iconv code somewhat hard to follow, with lots of tall
nested blocks, multiple gotos, #defines of return values and then not
using those #defines values in the error-checking switch statements (i.e.
magic numbers), and variable names like "ap", "bp", and "cp". Not at all
like the Glib code.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]