UTF-8: Case mapping

From: Owen Taylor <otaylor redhat com>
To: gtk-devel-list gnome org
Cc: gtk-i18n-list gnome org, trow ximian com, darin bentspoon com
Subject: UTF-8: Case mapping
Date: 27 Jun 2001 13:33:03 -0400

Case conversion: [ http://bugzilla.gnome.org/show_bug.cgi?id=55852 ]

Issues in case mapping and case folding are described
in Unicode Technical Report #21: Case mapping.
 
  http://www.unicode.org/unicode/reports/tr21/

Some of the less obvious attributes of case mapping:

 * Case mapping is locale sensitive, though total the number of
   locale-sensitive rules is quite small. (Most important one - 
   in Turkish I is paired with dotless i, and i is paired with
   a capital I with a dot.)

 * Case mapping is context-sensitive; for instance, the proper
   lowercase equivalent of the greek sigma depends on whether 
   the letter occurs at the beginning or the end of the word.

 * Case mapping can't be done character by character - for 
   instance, german ß maps to SS in uppercase.

 * Converting to a fixed case is a poor way to do caseless 
   comparison; properly, they should be done using the
   of the unicode collation algorithm ignoring cased variants, 
   but as an approximation, it is possible to use a set of "case 
   folding" rules.
 
   Except for dotted i, doing it this way removes all locale 
   sensitivity - to get around the problem of dotted 
   i, there are two techniques:

    - skip case mapping on i and dotted i altogether
    - map all i and dotted i together

So, the abstract operations are:

 TOUPPER (string, locale)
 TOLOWER (string, locale)
 TOTITLE (string, locale)
 FOLD (string, dotted-i-method)

Since we don't have a method of representing locale in GLib
right now, I think we should start out with:

 g_utf8_toupper (string);  [ priority A ]
 g_utf8_tolower (string);  [ priority A ]

Defined to use the "current" locale as the minimum.
We can add g_utf8_to_upper_with_locale (string, locale) later.

It's not much work to add:

 g_utf8_totitle (string); [ priority C ]

Though I don't know any APIS that actually do this currently,
and title case only actually matters for some "compatibility"
characters in Unicode.

A case folding routine is probably also useful. I don't see
offering the choice of dotted-i-method as a good thing - 
I see no way a programmer would know what to pick. IMO,
we should simply pick one - probably the "merge all I's
together method", and have:

 g_utf8_casefold (string); [ priority B ]

There is also the question of "fuzzy" comparison routines -
the equivalent of strcasecmp - we actually have three axes
on which we can ignore differences:

 * Normalization (none, canonical, compat)
 * Case (unfolded, folded)
 * dotted-i-folding method

I _don't_ think we should offer all these possibilities; not
having a sense yet of what the right choices are, I'm inclined
to leave out such fuzzy comparison routines and let people
build what they need out of the primitives.

Regards,
                                        Owen

Follow-Ups:
- Re: UTF-8: Case mapping
  - From: Mark Leisher
- Re: UTF-8: Case mapping
  - From: Steve Underwood

References:
- UTF-8: Normalization
  - From: Owen Taylor

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]