Re: UTF-8 with GTK



Raymond Wan wrote:
> 
> Hi all,
> 
>         I was playing around with testtext, which comes with GTK+ 1.3.X
> and noticed that it can support non-Latin languages (i.e., Japanese) as
> long as it is encoded in UTF-8.  I also looked at the manpages of UTF-8
> and it said that it is "the way to go for using the Unicode character set
> under Unix-style operating systems."
> 
>         [Apologies in advance, but my knowledge of Unicode is somewhat
> limited...].  I was wondering why is this so?  Sure, I read the rest of
> the man pages and it mentioned some of the benefits...however, from the
> Japanese point of view (or Chinese, Korean, whatever), where every text
> file is most likely in an Asian language, isn't it a waste of space that
> some characters will take up 2, 3, or even more bytes (where 1 byte = 8
> bits).  If I used an encoding such as Shift-JIS for Japanese or ummmm,
> BIG5, I think, for Chinese, won't most characters be two bytes in size?
> 
>         I guess what I'm asking is that isn't UTF-8 more for accomodating
> Latin-based OS' to read Asian (and Middle Eastern) languages and not for
> Asian OS' to read Asian languages?  Presuming I'm right so far, is there a
> way to make GTK support alternative encodings like UTF-16 or S-JIS, BIG5,
> etc.?
> 
>         [Sorry, I realize there is a lot that I've said which indicates
> I'm a newbie...which I am.  I guess what I was really getting to is this
> last question about GTK support for other encodings.]
> 
>         Thank you!

In the days of Unicode 2.X I would have largely agreed, but things have
changed. Big 5, GB2312 and other common Chinese codes are quite
limiting. For example, many of the place names here in Hong Kong cannot
be displayed in these character sets. CCCII is a super big,
comprehensive, 3/4 bytes per character Chinese character set, but it
isn't widely supported. Unicode 3.1 basically dumps the whole of CCCII
(or its Library of Congress equivalent in the US) into the Unicode
table. This makes Unicode essentially all things to all Chinese readers.
Unicode 3.1 isn't yet widely supported, but I expect that to change in
the next year or two. UTF8 is an average of 3 bytes per character for
Chinese. Owen said there is usually some ASCII in a Chinese file. Some
actually means 1-2% for most files, so that doesn't get the average
down. On the other hand, the huge increase in characters in Unicode 3.1,
which take more than 3 bytes, doesn't really push the avaerage up - they
are rare (but often important) characters.

What is the bottom line of all this, to a Chinese user:

Big5 and Gb2312 are fairly compact, but limiting

CCCII is comprehensive, but is 3/4 bytes per char and poorly supported.

UTF8 for Unicode 3.1 is comprehensive, and support will surely come. It
takes about 3 bytes per character on average, which is a reasonable
compremise for comprehensive multi-lingual support.

Most of the world's recent documents are stored in Word files, which use
UTF16 for English. If English readers aren't bitching about their text
doubling in size, why should East Asian readers be too upset about a 50%
increase?

Regards,
Steve




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]