Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)



On Tue, 2010-05-04 at 23:08 +0200, Aleksander Morgado wrote:

Anyway I agree that the fastest and perfect solution would be the one
doing all the needed things in a single iteration: NFC normalization,
word-break detection, a proper case-folding (not
character-per-character!)... even accent stripping and stemming could be
done if we were to develop such function (and that would really actually
be a great performance improvement, btw), but that is probably a huge
work only useful for the Tracker case, and very difficult to maintain.

True but as its likely to be the most cpu intensive part of tracker, a
small gain will have a significant effect

[snip]

I really wouldn't split between non-CJK and CJK, if the performance of
ASCII is comparable using libunistring/libicu (which it seems it is).

we cant be sure of that until you add the extra word discrimination to
your unicode versions so that output of all is equal (barring bugs with
normalizations!). Also try benchmarking with removal of the precheck for
encoding from tracker as its very likely we will ditch pango and by
doing so we could be much more dynamic with how we deal with words. I
would be very surprised if those unicode libs could match tracker on
straight ASCII without the precheck!  


The best thing of libunistring/libicu based parsers is really that there
is a single algorithm for any string, whatever characters they have, and
maintaining such algorithms should be trivial compared to the glib/pango
case.

Also, the split algorithm for non-CJK and CJK would again be faulty for
documents with strings in both English and Chinese for example. Probably
not the case in my computer or yours, but a really high chance in a
Japanese's or Chinese's computer.

Anyway, tomorrow I will spend some time doing additional tests for the
ASCII-only case, and will try to compare the three parsers in this
specific situation.

Great look forward to it!

thanks

jamie




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]