Re: UTF-8: Collation



Hi Owen,

On 27 Jun 2001, Owen Taylor wrote:
> Collation: [ http://bugzilla.gnome.org/show_bug.cgi?id=55836 ]
> Ordering strings in a linguistically meaningfully fasion requires more
> work than simply ordering by codepoints.  In POSIX, the distinction
> here is between strcmp() and strcoll()
...

	I took a look at the bug report and there's something I feel like
I'm missing...  What would be nice is to have a C-like qsort which can
sort strings provided there is a comparison function that returns -1, 0,
or 1.  qsort at the moment can't do it because some characters are
represented in 1 byte in UTF-8 and others by 3 bytes.  I feel like that's
the only problem.

	As far as being locale dependent, I don't know what would suit
people's needs...but I have a feeling if I was English speaking and I had
text in English, German, and French to sort, I might arbitrarily want "e" 
to come before "e"'s with accents on them.  A French speaker, on the other
hand, might order it the opposite way, based on his/her phonetic knowledge
of their language.  And from looking at some Japanese dictionaries, I
think there isn't a fixed way to order the same string in katakana or
hiragana.  And let's not get into how kanji characters are ordered in a
dictionary...a nightmare for even native speakers, or so I've been told... 

	Also, the current C method of sorting strings is kind of
arbitrary, too.  Using strcmp, upper case is ordered before lower case
based on ASCII code.  A strcmp with a comparison function would be nice,
though, since I can give priority to lower case characters or to
punctuation -- but that's a C issue.  :)

	Anyway, I think it's good that string sorting is being looked at
now.  However, I think having each individual programmer decide the order
of strings is best for now...until UTF-8 catches on.  Perhaps have a base
method that sorts on Unicode order, but have some type of comparison
function that each user has to write and some good examples of how to
write it...

	Just a thought...  UTF-8 sorting would be handy right now
actually.  :-)  And the fact that strcmp puts spaces before characters
arbitrarily has always bugged me a bit... 

Ray







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]