Re: Just a few UTF8 questions...

From: "Peter \"Firefly\" Lund" <firefly diku dk>
To: martyn 2 russell bt com
Cc: gtk-app-devel-list gnome org
Subject: Re: Just a few UTF8 questions...
Date: Tue, 8 Jul 2003 18:48:56 +0200 (MEST)

On Tue, 8 Jul 2003 martyn 2 russell bt com wrote:

If I use str = g_strdup_printf("mystring"); does this mean str is a valid
UTF8 string?


In this particular case, yes, because "mystring" happens to both valid
US-ASCII (7-bit ASCII), ISO 8859-1 (ISO-Latin 1), and UTF8.

If you had written:

  str = g_strdup_printf("Rødgrød med fløde"); /* phrase used by Danes to
tease foreigners */

it wouldn't be.

g_strdup_printf() doens't translate the characters, it only scans for %'s
inside the first parameter (the format string) and inserts the other
parameters (converted to strings if necessary) into the result string.

If that is the case, how is it that when I have two different strings (one
in German and the other in Russian), one is valid UTF8 and the other is not?
(as it happens, g_utf8_validate returns FALSE for the de_string).

      ru_string = "???? ???????";
      de_string = "Schöne Gzrüße";


de_string is not valid UTF8.  It is, however, perfectly valid ISO 8859-1
(ISO Latin 1) and ISO 8859-15 (ISO Latin 9).  The latter is a small
modification of 8859-1 to incorporate the Euro sign.

US ASCII is a map between the numbers 32..126 and certain characters.  It
also defines 0..31 + 127 as specific control characters.  It could be
represented perfectly fine in 7-bits but the typical encoding is to use
8-bits (with the MSB=0).

ISO 8859-n are also maps, where the first 128 numbers map to exactly the
same characters as in US ASCII.  The remaining 128 numbers are used to map
to other characters, including those funny ones needed to write languages
that don't happen to be Swahili, Hawaiian -- or English.  Because more
special characters are needed than just 128 (and 32 of those are reserved
for a "mirror image" of the 32 control characters in US ASCII), more than
just one character map is necessary.  It is encoding one character to each
byte.

Western European languages use Latin characters, sometimes with funny
diacritics (like é, è, ä, ø, å, etc...), sometimes with some completely
foreign characters (the Icelandic thorn, for example) or ligatures, such
as the French ligature for oe (as in coeur = heart) or the Danish ligature
æ which is so old that it has become an ordinary letter today instead of a
ligature.

Other parts of Europe use variations of the Cyrillic script (Greek,
Russian), so they need their own character maps.

The Slavic languages need an exceptional amount of diacritics so they
can't use Latin-1/9 like the Germanic and Romance languages.  I think
Latin-2 covers most of those.

Anyway, this is a mess.  And it isn't even complete: it doesn't cover
Arabic, Hebrew, Japanese, Indian scripts, etc.

The solution for that is a new character mapping with tens of thousands of
characters (*).  The first 256 numbers are defined to have exactly the
same mapping as ISO 8859-1 (I wonder what they are going to do now that EU
has switched to 8859-15?).

*) actually about a million but very few programs support that correctly.

This can be encoded in many ways.  One of them would be to use 16-bits per
character and ignore the "upper" characters.  I mean, who cares, if you
ever only use English anyway?  That's the path Java and Windows has taken.
Another one would be to always use 32-bits.  That's kind of wasteful.

And then there are the variable length encodings.  A typical one would be
to use 16-bits most of the time but sometimes use two 16-bit words after
each other to get to the upper characters.  Another one would be to use
8-bit bytes most of the time (for US-ASCII) and more bytes for characters
with higher numbers.  That's what UTF8 does.

So if you stick to plain English, your ASCII strings will automatically be
valid UTF8, too.

Another good thing about UTF8 is that you can treat UTF8 strings as normal
C strings when you are copying them or counting their lengths (in bytes):
a 0 byte will always mean the end of the string.  It will not occur as
part of a funny character.

(I'm glossing over a lot of details here, mostly because I can't remember
them)

So you should get your de_string fixed...  There is a conversion function
(which I can't remember the name of) which will convert it to UTF8 for
you.

If I use g_utf8_strlen(ru_string, -1), the length returned is 35 (strlen
returns the same value).  According to the documentation this is supposed to
return the length of characters.  Shouldn't it therefore return 11?


dunno.

Also, if I read in from a socket to a gchar buffer[1024] and I then proceed
to print that information in the form

      g_message("socket input: %*s", bytes, buffer);

Does the * represent how many characters or bytes that are printed from the
buffer?


dunno.

-Peter

Give a man a fish, and you'll feed him for a day;
Give him a religion, and he'll starve to death while praying for a fish

References:
- Just a few UTF8 questions...
  - From: martyn . 2 . russell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]