Re: UTF-8: Case mapping

From: Pablo Saratxaga <pablo mandrakesoft com>
To: gtk-i18n-list gnome org
Subject: Re: UTF-8: Case mapping
Date: Thu, 28 Jun 2001 14:04:24 +0200

Kaixo!

On Thu, Jun 28, 2001 at 08:56:25AM +0800, Steve Underwood wrote:

> I agree with most of what you said, but I can see a practical reason why
> your last point is a poor solution. Most people only know one or two
> languages. We lack the skills needed to build any generally meaningful

...

I wonder... the GNU libc has very complex and comprehensive per language
(per locale even) sorting rules; and, from the source files at least, it seems
it is possible to find the base letter of a given char in case it has
accents, or know if it is an upper or lower case, or which script it is
from. So, I don't understand very much the reason of this thread; what is
exactly the problem?
- relying on GNU libc is not possible for portability reasons?
- that detailed info is not kept in compiled files with localedef?
- there is no way to retrieve that info with current versions of libc?

I understand strcoll() may not be enough (as it imposes a given view of the
sorting order (the most natural for users of a given language, but anyway),
and that the ability to discriminate upper and lower case, or diacritics,
may be a usefull option (not possible with strcoll()), but from some messages
of this thread it seems as if some people wonder how to provide the per
language data; but that data already exist (at least in GNU libc), so,
there is no need to know a given (human) language, that is to be done by
native speakers of that language (and is already done for most of them);
the problem is only how to use that info.

Note that the sorting data for each locale is based on a default sorting
algorithm for the whole unicode range; each locale/language only reorders
(if needed) a small subset corresponding to the script it uses; that makes
sense imho. If someone actually wants to sorter latin chars in swedish
order and arabic in persian order (which is slightly different from arabic),
then, it is to the user with such specific needs to break strings in small
latin and arabic pieces and sort each piece with the appropriate algo;
or create his own locale with own sorting rules. But the per language
default values are good enough for almost anybody (unless they are 
bugged/wrong, but that's another problem)

I think the wanted goal should be explained again (and maybe discussed), I
think there is some confusion here.

Thanks

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/		PGP Key available, key ID: 0x8F0E4975

Follow-Ups:
- Re: UTF-8: Case mapping
  - From: Raymond Wan

References:
- UTF-8: Normalization
  - From: Owen Taylor
- UTF-8: Case mapping
  - From: Owen Taylor
- Re: UTF-8: Case mapping
  - From: Steve Underwood

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]