UTF-8: Normalization

From: Owen Taylor <otaylor redhat com>
To: gtk-devel-list gnome org, gtk-i18n-list gnome org
Cc: trow ximian com, darin bentspoon com
Subject: UTF-8: Normalization
Date: 27 Jun 2001 00:27:27 -0400
Some of the last remaining bugs we have open before the the GLib-2.0
API freeze have to do with manipulation of Unicode strings.

The most crucial operations are generally:

 - Normalization
 - Collation
 - Case mapping

The intent of these mails is to quickly lay out the problems with
some references, get some idea of where GLib should be going
in these areas, and then figure out what can be done in the
very short term to get minimal solutions in place.


Normalization: [ http://bugzilla.gnome.org/show_bug.cgi?id=55852 ]

Multiple unicode strings can correspond to the "same" string. In order
to deal with such differences while comparing strings or searching, it
is often convenient to convert to a normalization form which
standardizes these differences.

Unicode defines two sorts of "sameness", basically: 

  - Canonical decomposition - transformations between precomposed
    and composed form that should be identically rendered.

  - Compatibility decomposition - transformations between code
    points that are the same under the rules for Unicode assignment,
    but are separate for compatibility reasons (half and full-width
    kana, fi ligatures, superscripts, etc.)

Four standard unicode normalization forms are described in Unicode
Technical Report #15: Unicode Normalization forms.

 http://www.unicode.org/unicode/reports/tr15/

Roughly speaking these are:

 Normalization form D:  Canonical decomposition
 Normalization form C:  Compatibility decomposition followed
                        by canonical composition
 Normalization form KD: Compatibility decomposition
 Normalization form KC: Compatibility decomposition followed
                        by canonical composition

Some example use cases:

 D: General text processing 
 C: Rendering using technology that doesn't handle combining accents
 KD: Searching where you want to ignore compatibility distinctions 
 KC: ??? 


Currently, GLib has only:

/* Compute canonical ordering of a string in-place.  This rearranges
   decomposed characters in the string according to their combining
   classes.  See the Unicode manual for more information.  */
void g_unicode_canonical_ordering (gunichar *string,
				   gsize     len);

/* Compute canonical decomposition of a character.  Returns g_malloc()d
   string of Unicode characters.  RESULT_LEN is set to the resulting
   length of the string.  */
gunichar *g_unicode_canonical_decomposition (gunichar  ch,
					     gsize    *result_len);

Which need quite a bit of code on top even to get to NFD. The
obvious API would be something like:

====
typedef enum
{
  G_NORMALIZE_NONE,
  G_NORMALIZE_D,
  G_NORMALIZE_C,  
  G_NORMALIZE_KC,  
  G_NORMALIZE_KD  
} GNormalizeMode;

gchar *g_utf8_normalize (gchar          *str,
                         gssize          len,
                         GNormalizeMode  type);
===

The main disadvantage here is the naming of the constants -

 normalized = g_utf8_normalize (str, -1, G_NORMALIZE_D);

Could be considered to be quite obscure. ICU has:

 UCOL_NO_NORMALIZATION = 1,
 UCOL_DECOMP_CAN = 2,
 UCOL_DECOMP_COMPAT = 3,
 UCOL_DEFAULT_NORMALIZATION = UCOL_DECOMP_COMPAT, 
 UCOL_DECOMP_CAN_COMP_COMPAT = 4,
 UCOL_DECOMP_COMPAT_COMP_CAN =5,
 UNORM_NONE = 1, 
 UNORM_NFD = 2,
 UNORM_NFKD = 3,
 UNORM_NFC = 4,
 UNORM_DEFAULT = UNORM_NFC, 
 UNORM_NFKC =5,

I don't know if:

 D   => G_NORMALIZE_DECOMP_CAN
 KD  => G_NORMALIZE_DECOMP_COMPAT
 C   => G_NORMALIZE_DECOMP_CAN_COMP_COMPAT 
 KC  => G_NORMALIZE_DECOMP_COMPAT_COMP_CAN

Is really clearer ... ;-)  I think it's just longer. Another possibility
would be something like:

 G_NORMALIZE_DECOMPOSE /* Unicode NFD */
 G_NORMALIZE_COMPOSE   /* Unicode NFC */
 G_NORMALIZE_DECOMPOSE_FUZZY /* Unicode NFKD */
 G_NORMALIZE_COMPOSE_FUZZY /* Unicode NFKC */

To try and describe things according to usage type. My guess is
that something like this is probably most user friendly. 

Implementation shouldn't be that hard - probably an afternoon or so.
The algorithms are pretty straightforward, and the ICU code is
available for reference if necessary.

I'd rate adding something like this a "B" priority for GLib-2.0 --
it's quite important in some circumstances, but most people won't know
enough to know they need it. It should be noted that Java, the
.NET standard library, Qt, etc, don't offer routines for normalization.

ICU also has:

U_CAPI UNormalizationCheckResult U_EXPORT2
unorm_quickCheck(const UChar*       source,
                 int32_t            sourcelength, 
                 UNormalizationMode mode, 
                 UErrorCode*        status);

To quickly check if a string is in a particular normalization form,
much more efficiently than converting it to that normalization.
While I could eventually see adding something like this, this
is even more specialized, and I don't consider it a candidate
for a GLib-2.0 API addition.

Regards,
                                        Owen
Follow-Ups:
- UTF-8: Collation
  - From: Owen Taylor
- UTF-8: Case mapping
  - From: Owen Taylor
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]