Re: Unicode and C++

From: Pablo Saratxaga <pablo mandrakesoft com>
To: gtk-i18n-list redhat com
Subject: Re: Unicode and C++
Date: Sat, 8 Jul 2000 23:02:54 +0200
Kaixo!

On Sun, Jul 09, 2000 at 01:05:01AM +0800, Steve Underwood wrote:

> fact, you have little choice. Don't you think that really sucks? Unicode
> is supposed to be the clean modern way of doing things, not a way that
> causes more trouble than older ways. Older codes are not completely free
> of duplicate characters. For example, in CNS/Big-5 the "hang" in Hang
> Seng Bank occurs twice. It occurs once in the proper sequence, and again
> at the very end of the code table. In the first case the reference
> document uses a representation very different from the one in the Hang
> Seng Bank logo, and the other matches the Bank's logo. This is purely a
> font difference, and should never have resulted in 2 code points.
> However, in older codes such oddities are rare. Unicode has quite a few.
> Dumb, huh? There really should be some kind of crud removal activity for
> Unicode, but I see no sign of such a thing.

The reason that hanzi s duplicated in unicode is because it is so in Big5.
THe goal is to allows *losless conversion* when doing 
anything->unicode->the same anything.

That quality of unicode is an essential quality it must have to be an
acceptable unificator replacement for current encodings.
It won't be acceptable to have a current document be modified by converting
to unicode; that would have desastrous consequences.
So if there are oddities in existing encodings, they are reflected in
Unicode; however only one of the "duplicate" chars is the one to use when
creating unicode strings; the others are labelled compatibility only.

> For any language with multiple meaningful sorting orders (e.g. Chinese),
> you must have some form of order translation table to make sorting work
> properly, whatever character code is used.

For *all* languages that is the case.
Or do you think that sorting of English in ASCII and EBCDIC are the same ?

That is why the definition of LC_COLLATE (the class of a locale defining the
sorting order) is done using symbolic names for each char, and not a hardcoded
value only valid with a given charset encoding.

of course the compiled result of LC_COLLATE depends on the charset encoding
choosent to be used.

> So, at least for East Asian
> languages, that is a non-issue in comparing character codes.

You compara caracter codes anyway; simply because a computer knows 
of nothing other than that.

The problem with Han ideographics is that there is not a single recognized
way to sort them, but several: phonetic, radical based (which is again
subdivided in two: classical radicals or modern ones), stroke count,
the shape of first stroke, SKIP, based on pre-existing prestigious
dictionnaries. So while for an alphabetic language there is only one
table produced for a given charset encoding, for han ideographics there may
potentially be several tables.

On the other hand, computer treatment of data allows a much more easier
management of that situation than paper based sorted lists (aren't you 
sometimes frustrated to found an otherwise good dictionnary that uses none
of the sort methods you are familiar with ?

> Because all
> East Asian characters are a single code point in all character codes I
> know of, no code is stronger or weaker than any other in this respect. I
> understand multiple possible representations of the same text makes life
> *real* interesting for some other languages, though. I was told that
> some Indic languages can have the code points for a syllable stored in a
> variety of orders, so some serious reordering work is needed to
> harmonise strings before any direct comparison can be made between them.

That is the case for Thai at least.

And two schools of thought oppose on the solution to that problem; for one
the solution is to define the right sequence, and force the usage of it,
trough education and making input system refuse illegal sequences.
For the other the solution is to define the right sequence and make the
input system accept everything and then normalize it to the right sequence.

I prefer the first, because it makes people conscious of the problem, which
is the real pre-requisite for any problem to be solved, while the second
just hides it, which means it will stay forever, and any failure on the
normalization in the input systm will have harmfull consequences (specially
as people won't understand why it won't work).

Note that for latin based alphabets for example, it is the first method
that is applied: you must type dead_circumflex + A to produce an Acircumflex;
it won't work the other way around (I know that the X11 Compose files allow
^ + A and A + ^, but people don't use it when there is a dead_circumflex key
on their keyboard), and that goes back to the age of the first mechanical
typewriters.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/		PGP Key available, key ID: 0x8F0E4975
Follow-Ups:
- Re: Unicode and C++
  - From: Steve Underwood
References:
- Re: Re[2]: Unicode and C++
  - From: Robert Brady
- Re: Unicode and C++
  - From: Steve Underwood
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]