UTF-8: Normalization
- From: Owen Taylor <otaylor redhat com>
- To: gtk-devel-list gnome org, gtk-i18n-list gnome org
- Cc: trow ximian com, darin bentspoon com
- Subject: UTF-8: Normalization
- Date: 27 Jun 2001 00:27:27 -0400
Some of the last remaining bugs we have open before the the GLib-2.0
API freeze have to do with manipulation of Unicode strings.
The most crucial operations are generally:
- Normalization
- Collation
- Case mapping
The intent of these mails is to quickly lay out the problems with
some references, get some idea of where GLib should be going
in these areas, and then figure out what can be done in the
very short term to get minimal solutions in place.
Normalization: [ http://bugzilla.gnome.org/show_bug.cgi?id=55852 ]
Multiple unicode strings can correspond to the "same" string. In order
to deal with such differences while comparing strings or searching, it
is often convenient to convert to a normalization form which
standardizes these differences.
Unicode defines two sorts of "sameness", basically:
- Canonical decomposition - transformations between precomposed
and composed form that should be identically rendered.
- Compatibility decomposition - transformations between code
points that are the same under the rules for Unicode assignment,
but are separate for compatibility reasons (half and full-width
kana, fi ligatures, superscripts, etc.)
Four standard unicode normalization forms are described in Unicode
Technical Report #15: Unicode Normalization forms.
http://www.unicode.org/unicode/reports/tr15/
Roughly speaking these are:
Normalization form D: Canonical decomposition
Normalization form C: Compatibility decomposition followed
by canonical composition
Normalization form KD: Compatibility decomposition
Normalization form KC: Compatibility decomposition followed
by canonical composition
Some example use cases:
D: General text processing
C: Rendering using technology that doesn't handle combining accents
KD: Searching where you want to ignore compatibility distinctions
KC: ???
Currently, GLib has only:
/* Compute canonical ordering of a string in-place. This rearranges
decomposed characters in the string according to their combining
classes. See the Unicode manual for more information. */
void g_unicode_canonical_ordering (gunichar *string,
gsize len);
/* Compute canonical decomposition of a character. Returns g_malloc()d
string of Unicode characters. RESULT_LEN is set to the resulting
length of the string. */
gunichar *g_unicode_canonical_decomposition (gunichar ch,
gsize *result_len);
Which need quite a bit of code on top even to get to NFD. The
obvious API would be something like:
====
typedef enum
{
G_NORMALIZE_NONE,
G_NORMALIZE_D,
G_NORMALIZE_C,
G_NORMALIZE_KC,
G_NORMALIZE_KD
} GNormalizeMode;
gchar *g_utf8_normalize (gchar *str,
gssize len,
GNormalizeMode type);
===
The main disadvantage here is the naming of the constants -
normalized = g_utf8_normalize (str, -1, G_NORMALIZE_D);
Could be considered to be quite obscure. ICU has:
UCOL_NO_NORMALIZATION = 1,
UCOL_DECOMP_CAN = 2,
UCOL_DECOMP_COMPAT = 3,
UCOL_DEFAULT_NORMALIZATION = UCOL_DECOMP_COMPAT,
UCOL_DECOMP_CAN_COMP_COMPAT = 4,
UCOL_DECOMP_COMPAT_COMP_CAN =5,
UNORM_NONE = 1,
UNORM_NFD = 2,
UNORM_NFKD = 3,
UNORM_NFC = 4,
UNORM_DEFAULT = UNORM_NFC,
UNORM_NFKC =5,
I don't know if:
D => G_NORMALIZE_DECOMP_CAN
KD => G_NORMALIZE_DECOMP_COMPAT
C => G_NORMALIZE_DECOMP_CAN_COMP_COMPAT
KC => G_NORMALIZE_DECOMP_COMPAT_COMP_CAN
Is really clearer ... ;-) I think it's just longer. Another possibility
would be something like:
G_NORMALIZE_DECOMPOSE /* Unicode NFD */
G_NORMALIZE_COMPOSE /* Unicode NFC */
G_NORMALIZE_DECOMPOSE_FUZZY /* Unicode NFKD */
G_NORMALIZE_COMPOSE_FUZZY /* Unicode NFKC */
To try and describe things according to usage type. My guess is
that something like this is probably most user friendly.
Implementation shouldn't be that hard - probably an afternoon or so.
The algorithms are pretty straightforward, and the ICU code is
available for reference if necessary.
I'd rate adding something like this a "B" priority for GLib-2.0 --
it's quite important in some circumstances, but most people won't know
enough to know they need it. It should be noted that Java, the
.NET standard library, Qt, etc, don't offer routines for normalization.
ICU also has:
U_CAPI UNormalizationCheckResult U_EXPORT2
unorm_quickCheck(const UChar* source,
int32_t sourcelength,
UNormalizationMode mode,
UErrorCode* status);
To quickly check if a string is in a particular normalization form,
much more efficiently than converting it to that normalization.
While I could eventually see adding something like this, this
is even more specialized, and I don't consider it a candidate
for a GLib-2.0 API addition.
Regards,
Owen
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]