Hyphenation status



Hi,

I've been working on code to do hyphenation, hopefully to add to Pango.
My new code is faster than libhnj and groff and uses less memory.

Here's a rough comparison, using the US hyphenation patterns, and on
an 850MHz P3:

              Speed in Words/Sec          Memory Use
  ---------------------------------------------------------
  groff             310000                    140K
  libhnj            360000                    200K
  my code           630000                     43K

The TeX code may be a wee bit more efficient, but it is complicated and
I'm not sure about the license. (We may also have problems with the
various licenses in the hyphenation patterns files at some point.)


My code is almost ready for Unicode as well. The main remaining issue is
normalization. I need to:

 a) Normalize the words and the hyphenation patterns so that
    matching works correctly (i.e. different forms still match), and
 b) Convert the resulting hyphenation pattern back to the positions
    of the original characters, so we insert hyphens in the right
    place.

g_utf8_normalize() is a problem because it is very slow and I have no
way to do (b).

So I'm thinking of writing an optimized normalization function just for
the code ranges that use hyphenation. (We can just ignore other
characters as they won't make any difference.)

I think hyphenation is used for Latin, Greek and Cyrillic characters.
Are there any others?

Anyone else have better ideas to handle normalization?

Damon




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]