Re: UTF-8 with GTK
- From: Raymond Wan <rwan cs mu oz au>
- To: Steve Underwood <steveu coppice org>
- Cc: gtk-i18n-list gnome org
- Subject: Re: UTF-8 with GTK
- Date: Fri, 22 Jun 2001 00:37:27 +1000 (EST)
Hi,
On Thu, 21 Jun 2001, Steve Underwood wrote:
...
> Chinese. Owen said there is usually some ASCII in a Chinese file. Some
> actually means 1-2% for most files, so that doesn't get the average
> down. On the other hand, the huge increase in characters in Unicode 3.1,
> which take more than 3 bytes, doesn't really push the avaerage up - they
> are rare (but often important) characters.
I can't read Chinese myself, but I can imagine a Chinese document
with some ASCII...perhaps some HTML tags and some numbers...but I would
think most characters would be Chinese characters, all of which will
increase in size.
...
> UTF8 for Unicode 3.1 is comprehensive, and support will surely come. It
> takes about 3 bytes per character on average, which is a reasonable
> compremise for comprehensive multi-lingual support.
However, someone else on the mailing list via private e-mail also
managed to convince me that the extra diskspace used is worth it given
cheaper disks and the ability to unify languages around the world into a
single system.
I guess my original concern is how one would "sell" the idea of
using UTF-8 as a standard. I've always thought that UTF-16 would be the
standard and UTF-8 was just some way to bridge between current systems to
a unified UTF-16. And if so [yes, this is related to GTK+, still :) ],
why doesn't GTK support it. I guess the answer is that no one expects
East Asian users to give up on Shift-JIS and Big5 "overnight"...maybe in
the next year or two, as you've put it...maybe even more.
> Most of the world's recent documents are stored in Word files, which use
> UTF16 for English. If English readers aren't bitching about their text
> doubling in size, why should East Asian readers be too upset about a 50%
> increase?
Actually, I thought most documents nowadays are in HTML/XHTML/Web
documents. But yes, I do get your main point. I guess to unify
languages, someone will lose out and either Latin-based languages double
in size so that everyone uses UTF-16, or East Asians (and other languages
that use two-byte encodings) lose out by having characters ranging from
2-3 bytes in size.
Thank you to all (especially Owen for his original reply) for the
replies; it has been quite helpful!
Ray
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]