Re: Character normalization ?

From: Owen Taylor <otaylor redhat com>
To: veillard redhat com
Cc: gnome-hackers gnome org
Subject: Re: Character normalization ?
Date: Mon, 25 Mar 2002 16:16:55 -0500 (EST)

Daniel Veillard <veillard redhat com> writes:

> On Mon, Mar 25, 2002 at 03:50:39PM -0500, Gnome CVS User wrote:
> > Log message:
> > Mon Mar 25 15:46:54 2002  Owen Taylor  <otaylor redhat com>
> > 
> > * modules/basic/basic-*.c: Convert U+00A0 (NON BREAK SPACE)
> > to U+0020 (SPACE)
> 
>   Hum, by the way, now that we have a decent internationalized
> framework, one of the annoyances of Unicode is character normalization,
> i.e. remapping sometimes sequences of Unicode chars to a single one.
> The I18N working group at W3C is pushing hard for "early" normalization
> [1] i.e. make sure that most of the APIs see only Normalized Content. 
>   Can you tell me/us a bit on this issue ? Is there anything in place,
> should we make any decision about this ? This can affect a number of things
> like string searches and compare which otherwise are real pain.

This *particular* change is just dealing with shaping... when it gets
to the level of finding the right glyph to render a particular character,
so it's not really related to the question of normalization.

GLib contains the necessary function to do normalization to any of the
four standard unicode forms:

typedef enum {
  G_NORMALIZE_DEFAULT,
  G_NORMALIZE_NFD = G_NORMALIZE_DEFAULT,
  G_NORMALIZE_DEFAULT_COMPOSE,
  G_NORMALIZE_NFC = G_NORMALIZE_DEFAULT_COMPOSE,
  G_NORMALIZE_ALL,
  G_NORMALIZE_NFKD = G_NORMALIZE_ALL,
  G_NORMALIZE_ALL_COMPOSE,
  G_NORMALIZE_NFKC = G_NORMALIZE_ALL_COMPOSE
} GNormalizeMode;

gchar *g_utf8_normalize (const gchar   *str,
			 gssize         len,
			 GNormalizeMode mode);

It's basically a two-dimensional grid:

               Compose Maximally               Decompose maximally
Handle 
compatibility   G_NORMALIZE_ALL_COMPOSE        G_NORMALIZE_ALL
equivalents

Don't handle
compatibility   G_NORMALIZE_DEFAULT_COMPOSE    G_NORMALIZE_DEFAULT
equivalents

I believe that NO-BREAK SPACE and SPACE normalize together at the level
of G_NORMALIZE_ALL, which is perhaps a very good example of why you
don't want to do that level of normalization on input text... Pango
does distinguish these two characters... it won't break a line at
a NO-BREAK SPACE, so if you normalize these two characters together,
you loose formatting information.

Though you do probably want to normalize at this level when doing
interactive searching.

I'm not really sure how to answer your question in general. It's certainly
an issue that we should be considering in various contexts ... e.g., when
the user enters a filename for a new file, we should normalize it to
one of the standard forms; I'm not sure if there are any easy overriding
guidelines.

Regards,
                                        Owen

Follow-Ups:
- Re: Character normalization ?
  - From: Daniel Veillard

References:
- Character normalization ?
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]