Re: UTF-8 Functions

From: Owen Taylor <otaylor redhat com>
To: Darin Adler <darin bentspoon com>
Cc: gtk-devel-list gnome org, gtk-i18n-list gnome org, trow ximian com
Subject: Re: UTF-8 Functions
Date: 02 Jul 2001 19:18:43 -0400

Darin Adler <darin bentspoon com> writes:

> On Sunday, July 1, 2001, at 06:17  PM, Owen Taylor wrote:
> 
> > But if people have other easy-to-implement improvements, I'd be
> > happy to consider them as well.
> 
> My first thought is that if we plan to some day implement a more
> sophisticated collation algorithm, it might be good to have a single
> function that combines g_utf8_casefold and g_utf8_collate_key. It
> might be nice if the fact that these are two separate operations today
> doesn't prevent us from doing the casefolded collation efficiently
> later.

Well, remember, the most general interface is something like
what Java provides, something like:

  typedef enum {
    G_COLLATE_PRIMARY,    /* Accents */
    G_COLLATE_SECONDARY,  /* Case */
    G_COLLATE_TERTIARY,
    G_COLLATE_IDENTICAL
  } GCollateStrength;

  g_utf8_collate_key_extended (string, strength, normalization_mode);

But we'd only be able to meaningfully implement a very small
subset of that currently.

g_ut8_collate_key () corresponds roughly to 
g_utf8_collate_key_extended (string, 
                             G_COLLATE_TERTIARY,
                             G_NORMALIZE_ALL_COMPOSE);

And may be reimplemented as something like that in the future. The
question, I guess is whether it is worth adding:

g_ut8_collate_key_casefold (), which is currently

 g_utf8_collate_key (g_utf8_casefold (string));

But might eventually be implemented as:

 g_utf8_collate_key_extended (string, 
                              G_COLLATE_SECONDARY,
                              G_NORMALIZE_ALL_COMPOSE);

[ There are issues of correctness here as well as efficiency ]

It's certainly easy enough to do ... just a few lines of code.  My
main hesitation is whether we know yet whether that is the right part
of the parameter space to give a special name.

Enough thinking outload... I'll give it some consideration.

> Also, just out of curiosity, I'd like to understand if
> g_utf8_collate_key provides any guarantee about how it will work with
> strings and various normalizations of the same string. Will a
> normalized string collate == the same string before it was normalized?
> For which flavors of normalization?

The two collation functions both perform normalization with 
G_NORMALIZE_ALL_COMPOSE as the first step. NORMALIZE_ALL_COMPOSE
is Unicode NFKC - compatibility decomposition followed by
canonical composition. Since:

 NKFC(NK<X>(c)) == NFKC(c) 

For all normalization forms NF<X>, this means that normalization
before collation has no effect on collation order.

Regards,
                                        Owen

Follow-Ups:
- Re: UTF-8 Functions
  - From: Darin Adler

References:
- Re: UTF-8 Functions
  - From: Darin Adler

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]