Re: UTF-8: Collation
- From: Raymond Wan <rwan cs mu oz au>
- To: Owen Taylor <otaylor redhat com>
- Cc: gtk-i18n-list gnome org
- Subject: Re: UTF-8: Collation
- Date: Thu, 28 Jun 2001 08:41:43 +1000 (EST)
Hi Owen,
On 27 Jun 2001, Owen Taylor wrote:
> Collation: [ http://bugzilla.gnome.org/show_bug.cgi?id=55836 ]
> Ordering strings in a linguistically meaningfully fasion requires more
> work than simply ordering by codepoints. In POSIX, the distinction
> here is between strcmp() and strcoll()
...
I took a look at the bug report and there's something I feel like
I'm missing... What would be nice is to have a C-like qsort which can
sort strings provided there is a comparison function that returns -1, 0,
or 1. qsort at the moment can't do it because some characters are
represented in 1 byte in UTF-8 and others by 3 bytes. I feel like that's
the only problem.
As far as being locale dependent, I don't know what would suit
people's needs...but I have a feeling if I was English speaking and I had
text in English, German, and French to sort, I might arbitrarily want "e"
to come before "e"'s with accents on them. A French speaker, on the other
hand, might order it the opposite way, based on his/her phonetic knowledge
of their language. And from looking at some Japanese dictionaries, I
think there isn't a fixed way to order the same string in katakana or
hiragana. And let's not get into how kanji characters are ordered in a
dictionary...a nightmare for even native speakers, or so I've been told...
Also, the current C method of sorting strings is kind of
arbitrary, too. Using strcmp, upper case is ordered before lower case
based on ASCII code. A strcmp with a comparison function would be nice,
though, since I can give priority to lower case characters or to
punctuation -- but that's a C issue. :)
Anyway, I think it's good that string sorting is being looked at
now. However, I think having each individual programmer decide the order
of strings is best for now...until UTF-8 catches on. Perhaps have a base
method that sorts on Unicode order, but have some type of comparison
function that each user has to write and some good examples of how to
write it...
Just a thought... UTF-8 sorting would be handy right now
actually. :-) And the fact that strcmp puts spaces before characters
arbitrarily has always bugged me a bit...
Ray
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]