sorting strings with non-ASCII characters



Sorting a column with strings puts the e acute (é) after the y.

Sorting properly is quite difficult (Knuth wrote at length on the
subject in one of his books). Taking into account the whole of Unicode
is difficult too (how to put Cyrillic with respect to Latin?)

Nevertheless, Latin1/Latin9 are very used (whether they are encoded in
UTF8 or not) and I use the following Perl subroutine when I need to
produce their sorting keys--you will easily adapt it to another
programming language:

sub seven_bits {
  my $tmp = shift;
  return "" unless defined $tmp;
  $tmp =~ tr{\x80¡¢£¤¥¦§\¨©ª«¬\­®¯°±²³\´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ}
               { !cL*Y|S\"ca<n\-r_o+23\'uP.o1o>423?AAAAAAACEEEEIIIIDNOOOOO};
  $tmp =~ tr{×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö\÷øùúûüýþÿ}
            {x0UUUUYPBaaaaaaaceeeeiiiidnooooo\%0uuuuypy};
  return $tmp;
}

You could improve it: ö -> oe for example.
You could extend it to Latin9: \oe -> oe
...

Then one could sort two strings as follows:

(seven_bits($string1) cmp seven_bits($string2)) || ($string1 cmp $string2)

(the second part distinguishes between strings with the same seven_bits version)

This would make a quick and easy first approximation.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]