Re: Character normalization ?

From: Daniel Veillard <veillard redhat com>
To: Owen Taylor <otaylor redhat com>
Cc: gnome-hackers gnome org
Subject: Re: Character normalization ?
Date: Mon, 25 Mar 2002 16:40:38 -0500

On Mon, Mar 25, 2002 at 04:16:55PM -0500, Owen Taylor wrote:
> 
> Daniel Veillard <veillard redhat com> writes:
> 
> > On Mon, Mar 25, 2002 at 03:50:39PM -0500, Gnome CVS User wrote:
> > > Log message:
> > > Mon Mar 25 15:46:54 2002  Owen Taylor  <otaylor redhat com>
> > > 
> > > * modules/basic/basic-*.c: Convert U+00A0 (NON BREAK SPACE)
> > > to U+0020 (SPACE)
> > 
> >   Hum, by the way, now that we have a decent internationalized
> > framework, one of the annoyances of Unicode is character normalization,
> > i.e. remapping sometimes sequences of Unicode chars to a single one.
> > The I18N working group at W3C is pushing hard for "early" normalization
> > [1] i.e. make sure that most of the APIs see only Normalized Content. 
> >   Can you tell me/us a bit on this issue ? Is there anything in place,
> > should we make any decision about this ? This can affect a number of things
> > like string searches and compare which otherwise are real pain.
> 
> This *particular* change is just dealing with shaping... when it gets

 yeah, it's just that it made me think about the normalization problem.

> to the level of finding the right glyph to render a particular character,
> so it's not really related to the question of normalization.
> 
> GLib contains the necessary function to do normalization to any of the
> four standard unicode forms:

  Excellent. I will need to look at it ! BTW is Pango generally based on
a given version of Unicode ?

> I believe that NO-BREAK SPACE and SPACE normalize together at the level
> of G_NORMALIZE_ALL, which is perhaps a very good example of why you
> don't want to do that level of normalization on input text... Pango
> does distinguish these two characters... it won't break a line at
> a NO-BREAK SPACE, so if you normalize these two characters together,
> you loose formatting information.

  Hum, I need to dig a bit more. I haven't yet digested my last meeting
with people from the I18N group :-)

> Though you do probably want to normalize at this level when doing
> interactive searching.

  Well the text you would compare to would have to be normalized too,
and this may be expensive if done on each search queries on a document...

> I'm not really sure how to answer your question in general. It's certainly
> an issue that we should be considering in various contexts ... e.g., when
> the user enters a filename for a new file, we should normalize it to
> one of the standard forms; I'm not sure if there are any easy overriding
> guidelines.

  Well clearly, for XML the people in charge of I18N would like early
normalization i.e. parsers would check on input. Which means that text
saved as XML data would have to have that processing done before serialization.
I think this certainly would affect the example you give. Still XML
being used for data exchange this still make sense IMHO.

  Well it is still in the future for me, there are more urgent concerns
but this might be one point to check in the next year(s),

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
gnome-hackers mailing list
gnome-hackers gnome org
http://mail.gnome.org/mailman/listinfo/gnome-hackers

References:
- Character normalization ?
  - From: Daniel Veillard
- Re: Character normalization ?
  - From: Owen Taylor

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]