UTF-8: Case mapping



Case conversion: [ http://bugzilla.gnome.org/show_bug.cgi?id=55852 ]

Issues in case mapping and case folding are described
in Unicode Technical Report #21: Case mapping.
 
  http://www.unicode.org/unicode/reports/tr21/

Some of the less obvious attributes of case mapping:

 * Case mapping is locale sensitive, though total the number of
   locale-sensitive rules is quite small. (Most important one - 
   in Turkish I is paired with dotless i, and i is paired with
   a capital I with a dot.)

 * Case mapping is context-sensitive; for instance, the proper
   lowercase equivalent of the greek sigma depends on whether 
   the letter occurs at the beginning or the end of the word.

 * Case mapping can't be done character by character - for 
   instance, german  maps to SS in uppercase.

 * Converting to a fixed case is a poor way to do caseless 
   comparison; properly, they should be done using the
   of the unicode collation algorithm ignoring cased variants, 
   but as an approximation, it is possible to use a set of "case 
   folding" rules.
 
   Except for dotted i, doing it this way removes all locale 
   sensitivity - to get around the problem of dotted 
   i, there are two techniques:

    - skip case mapping on i and dotted i altogether
    - map all i and dotted i together

So, the abstract operations are:

 TOUPPER (string, locale)
 TOLOWER (string, locale)
 TOTITLE (string, locale)
 FOLD (string, dotted-i-method)

Since we don't have a method of representing locale in GLib
right now, I think we should start out with:

 g_utf8_toupper (string);  [ priority A ]
 g_utf8_tolower (string);  [ priority A ]

Defined to use the "current" locale as the minimum.
We can add g_utf8_to_upper_with_locale (string, locale) later.

It's not much work to add:

 g_utf8_totitle (string); [ priority C ]

Though I don't know any APIS that actually do this currently,
and title case only actually matters for some "compatibility"
characters in Unicode.

A case folding routine is probably also useful. I don't see
offering the choice of dotted-i-method as a good thing - 
I see no way a programmer would know what to pick. IMO,
we should simply pick one - probably the "merge all I's
together method", and have:

 g_utf8_casefold (string); [ priority B ]

There is also the question of "fuzzy" comparison routines -
the equivalent of strcasecmp - we actually have three axes
on which we can ignore differences:

 * Normalization (none, canonical, compat)
 * Case (unfolded, folded)
 * dotted-i-folding method

I _don't_ think we should offer all these possibilities; not
having a sense yet of what the right choices are, I'm inclined
to leave out such fuzzy comparison routines and let people
build what they need out of the primitives.

Regards,
                                        Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]