Re: Unicode and C++

From: Steve Underwood <steveu coppice org>
To: gtk-i18n-list redhat com
Subject: Re: Unicode and C++
Date: Sun, 09 Jul 2000 20:01:37 +0800
Pablo Saratxaga wrote:

> Kaixo!
>
> On Sun, Jul 09, 2000 at 01:05:01AM +0800, Steve Underwood wrote:
>
> > fact, you have little choice. Don't you think that really sucks? Unicode
> > is supposed to be the clean modern way of doing things, not a way that
> > causes more trouble than older ways. Older codes are not completely free
> > of duplicate characters. For example, in CNS/Big-5 the "hang" in Hang
> > Seng Bank occurs twice. It occurs once in the proper sequence, and again
> > at the very end of the code table. In the first case the reference
> > document uses a representation very different from the one in the Hang
> > Seng Bank logo, and the other matches the Bank's logo. This is purely a
> > font difference, and should never have resulted in 2 code points.
> > However, in older codes such oddities are rare. Unicode has quite a few.
> > Dumb, huh? There really should be some kind of crud removal activity for
> > Unicode, but I see no sign of such a thing.
>
> The reason that hanzi s duplicated in unicode is because it is so in Big5.
> THe goal is to allows *losless conversion* when doing
> anything->unicode->the same anything.

That much is obvious, but what are we talking about? Loss of reversability, or
loss of usefulness. In achieving the former, Unicode often sacrifices the latter.
Anyway, the problem with Unicode isn't the above. I only quoted that as an
example of Unicode not being the only villain for making messy character codes.

For a Unicode screwup try U+7CA4 and U+7CB5. This is merely a font difference.
The fully extended CNS, which almost nobody uses, seems to have two code points,
so it looks like they made two code points in Unicode. However, Unicode does not
include the full CNS character set. It generally has only the basic CNS which is
identical to Big-5. Hell, if you look at charts.unicode.org the Big-5
representation shown for U+7CB5 is identical to the representations of U+7CA4. I
wasted a few hours once due to that. I couldn't puzzle out why a search for
U+7CB5 U+8A9E produced so few hits in a mass of stuff about Cantonese. It turned
out a large percentage actually said U+7CA4 U+8A9E. To date not too many problems
of this kind have shown up. Generally Unicode has been used as a working set,
with the input and output areas still working in the old codes. As Unicode
becomes more homogenised this is going to be a real pain. This example is far
from unique.

> That quality of unicode is an essential quality it must have to be an
> acceptable unificator replacement for current encodings.
> It won't be acceptable to have a current document be modified by converting
> to unicode; that would have desastrous consequences.
> So if there are oddities in existing encodings, they are reflected in
> Unicode; however only one of the "duplicate" chars is the one to use when
> creating unicode strings; the others are labelled compatibility only.

For the reason I described above, and other, I buy none of this paragraph!

> > For any language with multiple meaningful sorting orders (e.g. Chinese),
> > you must have some form of order translation table to make sorting work
> > properly, whatever character code is used.
>
> For *all* languages that is the case.
> Or do you think that sorting of English in ASCII and EBCDIC are the same ?

Its so long since I used a mainframe I forgot about EBCDIC. It isn't the case
that normal English ASCII requires such performance reducing fiddles, though.
EBCDIC is a more screw code than Unicode. It looks like it was designed to make
the processor work harder. In practice I suspect its screwy character order cut a
few cents off the cost of an early card reader. This would be typical of IBM,
like having the interrupts up the wrong way on the ISA bus.

> That is why the definition of LC_COLLATE (the class of a locale defining the
> sorting order) is done using symbolic names for each char, and not a hardcoded
> value only valid with a given charset encoding.

And just what is collate supposed to do with my above example? Both code points
are the same damned character. Of course, what you have to do is merge both
characters to be treated as the same thing for sorting purposes. Dumb, huh?

> of course the compiled result of LC_COLLATE depends on the charset encoding
> choosent to be used.
>
> > So, at least for East Asian
> > languages, that is a non-issue in comparing character codes.
>
> You compara caracter codes anyway; simply because a computer knows
> of nothing other than that.

The whole point is you can't simply compare character codes. You have to do
tortuous manipulations to make a comparison really work. This thread started on
the efficiency, long term merit, etc. of various representations of Unicode. The
screwed up sorting, comparisons, etc. would cause a far greater performance hit
than anything discussed here so far. No current Unicode implementation I have
seen so far properly addresses this mess.

> The problem with Han ideographics is that there is not a single recognized
> way to sort them, but several: phonetic, radical based (which is again
> subdivided in two: classical radicals or modern ones), stroke count,
> the shape of first stroke, SKIP, based on pre-existing prestigious
> dictionnaries. So while for an alphabetic language there is only one
> table produced for a given charset encoding, for han ideographics there may
> potentially be several tables.

There are several more sort orders you haven't mentioned (e.g. I don't generally
use one of the above), but that's really just repeating what I said before.

> On the other hand, computer treatment of data allows a much more easier
> management of that situation than paper based sorted lists (aren't you
> sometimes frustrated to found an otherwise good dictionnary that uses none
> of the sort methods you are familiar with ?

Yes. There is no currently available dictionary with a sane character based order
for Chinese. The only sane Chinese dictionaries currently available are phonetic.
Most Chinese people who also know English use an English to Chinese dictionary to
look up Chinese words. A good stroke based dictionary is possible, though. This
is getting rather off the point.

> > Because all
> > East Asian characters are a single code point in all character codes I
> > know of, no code is stronger or weaker than any other in this respect. I
> > understand multiple possible representations of the same text makes life
> > *real* interesting for some other languages, though. I was told that
> > some Indic languages can have the code points for a syllable stored in a
> > variety of orders, so some serious reordering work is needed to
> > harmonise strings before any direct comparison can be made between them.
>
> That is the case for Thai at least.
>
> And two schools of thought oppose on the solution to that problem; for one
> the solution is to define the right sequence, and force the usage of it,
> trough education and making input system refuse illegal sequences.
> For the other the solution is to define the right sequence and make the
> input system accept everything and then normalize it to the right sequence.
>
> I prefer the first, because it makes people conscious of the problem, which
> is the real pre-requisite for any problem to be solved, while the second
> just hides it, which means it will stay forever, and any failure on the
> normalization in the input systm will have harmfull consequences (specially
> as people won't understand why it won't work).

There is little Indic material in Unicode at present (I think there must be more
Thai Unicode than any other Indic script - all the stuff I saw in India was in
ISCII or something proprietary), so I guess a standardisation effort might not be
_that_ late. Standardisation is pretty important for efficient string comparison,
and many other things. The fact that order wasn't included in the Unicode
standard means systems will have to cope with varying orders for years, because
some oddly ordered files exist - efficiency has already been lost!

This type of standardisation seems counter to your first point - the
reversability of translation. The Indic language handling in Unicode is more or
less a direct mapping from ISCII (as far as I can tell). If you standardise on a
reordered code point sequence for Unicode, how do you achieve reversability?
When I read the ISCII document a couple of years ago it didn't appear to enforce
a single practice, but I have only a weak understanding of the issues involved.

> Note that for latin based alphabets for example, it is the first method
> that is applied: you must type dead_circumflex + A to produce an Acircumflex;

> it won't work the other way around (I know that the X11 Compose files allow
> ^ + A and A + ^, but people don't use it when there is a dead_circumflex key
> on their keyboard), and that goes back to the age of the first mechanical
> typewriters.

Steve
Follow-Ups:
- Re: Unicode and C++
  - From: Tom Tromey
References:
- Re: Re[2]: Unicode and C++
  - From: Robert Brady
- Re: Unicode and C++
  - From: Steve Underwood
- Re: Unicode and C++
  - From: Pablo Saratxaga
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]