Re: Hyphenation status



Damon Chaplin <damon kendo fsnet co uk> writes:

> My code is almost ready for Unicode as well. The main remaining issue is
> normalization. I need to:
> 
>  a) Normalize the words and the hyphenation patterns so that
>     matching works correctly (i.e. different forms still match), and
>  b) Convert the resulting hyphenation pattern back to the positions
>     of the original characters, so we insert hyphens in the right
>     place.
> 
> g_utf8_normalize() is a problem because it is very slow and I have no
> way to do (b).
> 
> So I'm thinking of writing an optimized normalization function just for
> the code ranges that use hyphenation. (We can just ignore other
> characters as they won't make any difference.)
> 
> I think hyphenation is used for Latin, Greek and Cyrillic characters.
> Are there any others?

Hebrew is hyphenated at least sometimes (InDesign apparently can
do it.)

I don't see how a "optimized normalization function" is going to 
get you significantly faster.... maybe you save a few percent from
smaller tables, but you aren't going to get 10x as fast or anything.

And quite a chunk of the Unicode normalization stuff _is_ for 
Latin/Greek. (Few languages are going to give you more normalization
opportunities than Greek.)

IMO, all you are going to end up with is "my function that sort
of does Unicode normalization, but not quite".  

If you have ideas about how to write a fast normalization function,
they should be applied to g_utf8_normalize() 

(You probably can avoid the double pass through the string and the
full-size intermediate wide character buffer if you are willing to 
reallocate/copy the ouput buffer in some case; the search function
in find_decoposition probably can be speeded up a bit.)

If you need extended interfaces, we should plan on getting them into
glib eventually. (The same need for reverse mappings comes up in
adding normalization into the Pango shapers.)

Regards,
                                      Owen



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]