Re: Unicode and C++



Robert Brady wrote:

> On Sat, 8 Jul 2000, Steve Underwood wrote:
>
> > characters). Without this you cannot reliably compare strings - a
> > significant
> > problem with Unicode.
>
> Yes, you can. You just need a Big Table. And you need a Big Table for
> sorting stuff anyway, so it is a non-issue.

OK, you can pre-process strings to take out all the problems caused by
the crudiness of the Unicode character set and then look for a match. In
fact, you have little choice. Don't you think that really sucks? Unicode
is supposed to be the clean modern way of doing things, not a way that
causes more trouble than older ways. Older codes are not completely free
of duplicate characters. For example, in CNS/Big-5 the "hang" in Hang
Seng Bank occurs twice. It occurs once in the proper sequence, and again
at the very end of the code table. In the first case the reference
document uses a representation very different from the one in the Hang
Seng Bank logo, and the other matches the Bank's logo. This is purely a
font difference, and should never have resulted in 2 code points.
However, in older codes such oddities are rare. Unicode has quite a few.
Dumb, huh? There really should be some kind of crud removal activity for
Unicode, but I see no sign of such a thing.

For any language with multiple meaningful sorting orders (e.g. Chinese),
you must have some form of order translation table to make sorting work
properly, whatever character code is used. So, at least for East Asian
languages, that is a non-issue in comparing character codes. Because all
East Asian characters are a single code point in all character codes I
know of, no code is stronger or weaker than any other in this respect. I
understand multiple possible representations of the same text makes life
*real* interesting for some other languages, though. I was told that
some Indic languages can have the code points for a syllable stored in a
variety of orders, so some serious reordering work is needed to
harmonise strings before any direct comparison can be made between them.
I can only read English and Chinese, so those are the only languages I
can directly comment on.

Steve






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]