UTF-8: Case mapping
- From: Owen Taylor <otaylor redhat com>
- To: gtk-devel-list gnome org
- Cc: gtk-i18n-list gnome org, trow ximian com, darin bentspoon com
- Subject: UTF-8: Case mapping
- Date: 27 Jun 2001 13:33:03 -0400
Case conversion: [ http://bugzilla.gnome.org/show_bug.cgi?id=55852 ]
Issues in case mapping and case folding are described
in Unicode Technical Report #21: Case mapping.
http://www.unicode.org/unicode/reports/tr21/
Some of the less obvious attributes of case mapping:
* Case mapping is locale sensitive, though total the number of
locale-sensitive rules is quite small. (Most important one -
in Turkish I is paired with dotless i, and i is paired with
a capital I with a dot.)
* Case mapping is context-sensitive; for instance, the proper
lowercase equivalent of the greek sigma depends on whether
the letter occurs at the beginning or the end of the word.
* Case mapping can't be done character by character - for
instance, german ß maps to SS in uppercase.
* Converting to a fixed case is a poor way to do caseless
comparison; properly, they should be done using the
of the unicode collation algorithm ignoring cased variants,
but as an approximation, it is possible to use a set of "case
folding" rules.
Except for dotted i, doing it this way removes all locale
sensitivity - to get around the problem of dotted
i, there are two techniques:
- skip case mapping on i and dotted i altogether
- map all i and dotted i together
So, the abstract operations are:
TOUPPER (string, locale)
TOLOWER (string, locale)
TOTITLE (string, locale)
FOLD (string, dotted-i-method)
Since we don't have a method of representing locale in GLib
right now, I think we should start out with:
g_utf8_toupper (string); [ priority A ]
g_utf8_tolower (string); [ priority A ]
Defined to use the "current" locale as the minimum.
We can add g_utf8_to_upper_with_locale (string, locale) later.
It's not much work to add:
g_utf8_totitle (string); [ priority C ]
Though I don't know any APIS that actually do this currently,
and title case only actually matters for some "compatibility"
characters in Unicode.
A case folding routine is probably also useful. I don't see
offering the choice of dotted-i-method as a good thing -
I see no way a programmer would know what to pick. IMO,
we should simply pick one - probably the "merge all I's
together method", and have:
g_utf8_casefold (string); [ priority B ]
There is also the question of "fuzzy" comparison routines -
the equivalent of strcasecmp - we actually have three axes
on which we can ignore differences:
* Normalization (none, canonical, compat)
* Case (unfolded, folded)
* dotted-i-folding method
I _don't_ think we should offer all these possibilities; not
having a sense yet of what the right choices are, I'm inclined
to leave out such fuzzy comparison routines and let people
build what they need out of the primitives.
Regards,
Owen
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]