Re: sorting strings with non-ASCII characters



On Sun, Mar 13, 2005 at 05:52:09PM -0500, Morten Welinder wrote:
We simply use what we have from glib; we don't do our own collating.

All right. Pass my comment on to them then (or do you want me to do it
myself)?

The situation is much more complicated than you might be realize (even
though if you read Knuth you have seen it all).

This is not a reason to do nothing! 

(rant: That's a very annoying thing with many University-bred computer
scientists: they would mention that in theory, such problem is
NP-complete (even though in all practical cases it is easy), or find a
super exception. Therefore they cannot be 100% right (but only, maybe,
100-epsilon% right), and they do nothing. They seem to not understand
real life and the fact that "worse is better".)

For example, \oe -> oe would be wrong in Danish.

Right. I don't speak Danish and there are many languages out there with
problems I cannot even imagine.

Apparently, Russians can't sort words between latin and cyrillic: all
words have to be written in the same alphabet. There is probably no
country/language in the world with sorting rules able to apply to
several alphabets at the same time. Anyway, it is a country/Academy of
language-specific issue.

I can't think of a language where "é" should not be considered as "e"
*in first approximation*. Same for ô, è, ê, ... Start with the simple
cases... Latin1 and its simple letters cover most of the uses.

The Spanish have the two words "que" and "qué". I don't know which is
supposed to be first (and probably nearly no Spanish person would know
that--I wouldn't know in French unless I open a dictionary). But even if
you get it wrong, it is much better to have them next to each other that
to push the "é" in the end of the alphabet.

The Spanish used to have the special letters "ll" and "ch". Now those
just behave like they would in English with respect to sorting and
alphabet.

For the complex cases, I guess you may rely on the locales to know which
language the stylesheet is written in, and then contact competent native
speakers to know the rules of their language. If a stylesheet is written
in several languages at the same time then 1/ this is a special case,
very rare 2/ the user should not expect sorting to work 3/ maybe
Gnumeric can use a "language" attribute for such or such field

I doubt this issue is nowhere in the Unicode sites and FAQs. Maybe that
would be the right place to port the debate. If/when Gnumeric does
Chinese, Arabic, and the like, how are you (or the glib people) going to
deal with this?



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]