Re: UTF-8 with GTK


On Thu, 21 Jun 2001, Steve Underwood wrote:
> Chinese. Owen said there is usually some ASCII in a Chinese file. Some
> actually means 1-2% for most files, so that doesn't get the average
> down. On the other hand, the huge increase in characters in Unicode 3.1,
> which take more than 3 bytes, doesn't really push the avaerage up - they
> are rare (but often important) characters.

	I can't read Chinese myself, but I can imagine a Chinese document
with some ASCII...perhaps some HTML tags and some numbers...but I would
think most characters would be Chinese characters, all of which will
increase in size.

> UTF8 for Unicode 3.1 is comprehensive, and support will surely come. It
> takes about 3 bytes per character on average, which is a reasonable
> compremise for comprehensive multi-lingual support.

	However, someone else on the mailing list via private e-mail also
managed to convince me that the extra diskspace used is worth it given
cheaper disks and the ability to unify languages around the world into a
single system.

	I guess my original concern is how one would "sell" the idea of
using UTF-8 as a standard.  I've always thought that UTF-16 would be the
standard and UTF-8 was just some way to bridge between current systems to
a unified UTF-16.  And if so [yes, this is related to GTK+, still :) ],
why doesn't GTK support it.  I guess the answer is that no one expects
East Asian users to give up on Shift-JIS and Big5 "overnight"...maybe in
the next year or two, as you've put it...maybe even more.

> Most of the world's recent documents are stored in Word files, which use
> UTF16 for English. If English readers aren't bitching about their text
> doubling in size, why should East Asian readers be too upset about a 50%
> increase?

	Actually, I thought most documents nowadays are in HTML/XHTML/Web
documents.  But yes, I do get your main point.  I guess to unify
languages, someone will lose out and either Latin-based languages double
in size so that everyone uses UTF-16, or East Asians (and other languages
that use two-byte encodings) lose out by having characters ranging from
2-3 bytes in size.

	Thank you to all (especially Owen for his original reply) for the
replies; it has been quite helpful!


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]