Re: UTF-8 with GTK

Raymond Wan <rwan cs mu oz au> writes:

> Hi,
> On Thu, 21 Jun 2001, Steve Underwood wrote:
> ...
> > Chinese. Owen said there is usually some ASCII in a Chinese file. Some
> > actually means 1-2% for most files, so that doesn't get the average
> > down. On the other hand, the huge increase in characters in Unicode 3.1,
> > which take more than 3 bytes, doesn't really push the avaerage up - they
> > are rare (but often important) characters.
> 	I can't read Chinese myself, but I can imagine a Chinese document
> with some ASCII...perhaps some HTML tags and some numbers...but I would
> think most characters would be Chinese characters, all of which will
> increase in size.

Well, sure, a Chinese-language document is mostly in Chinese. But I
think if you looked at the files on a typical Chinese user's hard disk
(for some definition of typical), many of them would be all ascii, or
contain significant amounts of ascii.

But basically, except for text storage applications, the space used
for text isn't a huge deal these days. And if text storage size is a
concern - then you probably want to use some sort of compression - 
you can do a lot better than either UTF-8 or UTF-16, even with 
simple algorithms. (

> > UTF8 for Unicode 3.1 is comprehensive, and support will surely come. It
> > takes about 3 bytes per character on average, which is a reasonable
> > compremise for comprehensive multi-lingual support.
> 	However, someone else on the mailing list via private e-mail also
> managed to convince me that the extra diskspace used is worth it given
> cheaper disks and the ability to unify languages around the world into a
> single system.
> 	I guess my original concern is how one would "sell" the idea of
> using UTF-8 as a standard.  I've always thought that UTF-16 would be the
> standard and UTF-8 was just some way to bridge between current systems to
> a unified UTF-16.  And if so [yes, this is related to GTK+, still :) ],
> why doesn't GTK support it.  I guess the answer is that no one expects
> East Asian users to give up on Shift-JIS and Big5 "overnight"...maybe in
> the next year or two, as you've put it...maybe even more.

With Unicode 3.1, which adds characters beyond the basic-multilingual
plane, people now really need to care about surrogate pairs,
so pretending UTF-16 is fixed width works less well than before.

Of course, UTF-16 does have wide support because a lot of people 
went with Unicode-is-16-bit (Java, Windows, in particular), but
I don't think it is a particularly appealing encoding.

But there is also pretty wide spread acceptance for using UTF-8 as a
interchange and storage format. Using it as the primary format for
strings in memory is a little more controversial, but I think it works
quite well for GTK+. The exception is inside Pango, where I now regret
not using UTF-32; if you are doing a lot of manipulation, then a real
fixed width encoding does simplify things.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]