Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)




I really wouldn't split between non-CJK and CJK, if the performance of
ASCII is comparable using libunistring/libicu (which it seems it is).

we cant be sure of that until you add the extra word discrimination to
your unicode versions so that output of all is equal (barring bugs with
normalizations!). Also try benchmarking with removal of the precheck for
encoding from tracker as its very likely we will ditch pango and by
doing so we could be much more dynamic with how we deal with words. I
would be very surprised if those unicode libs could match tracker on
straight ASCII without the precheck!  


Oh, wait, but the current glib/pango doesn't do the split between ASCII
and non-ASCII. It's doing it between CJK and non-CJK. I agree in that
some ASCII-only improvements could be really useful in our case. But
ASCII-only, not non-CJK.

The initial NFC normalization fix for glib/pango parser is really not
trivial if we need to keep the offsets of the bytes in the original
string, but will try to think about it. And then it remains the issue
with the case-folding, which shouldn't be done unichar by unichar in
non-ASCII (including Latin encodings). Thus, a real comparison for all
cases between the three parsers would need time. But just did an
ASCII-only comparison, as all three parsers return same outputs in this
case.


The best thing of libunistring/libicu based parsers is really that there
is a single algorithm for any string, whatever characters they have, and
maintaining such algorithms should be trivial compared to the glib/pango
case.

Also, the split algorithm for non-CJK and CJK would again be faulty for
documents with strings in both English and Chinese for example. Probably
not the case in my computer or yours, but a really high chance in a
Japanese's or Chinese's computer.

Anyway, tomorrow I will spend some time doing additional tests for the
ASCII-only case, and will try to compare the three parsers in this
specific situation.

Great look forward to it!


Using a 50k lorem-ipsum file, plain ASCII with whitespace separators and
other punctuation marks, removing the g_prints I had before, got the
following results (averages of several runs):
 * libicu --> 0.140 seconds
 * libunistring --> 0.136 seconds
 * glib(custom) --> 0.135 seconds

With a 200k lorem-ipsum file:
* libicu --> 0.384 seconds
* libunistring --> 0.358 seconds
* glib(custom) --> 0.345 seconds

So for the ASCII-7 only case, the custom algorithm performs a little bit
better.

I will modify the libunistring and libicu based algorithms tomorrow so
that if ASCII-7 only, normalization and casefolding is not done, just a
tolower() of each character. That would make the values more approximate
to the glib/custom parser.

But again, this would be an improvement for "ASCII-only" (equivalent to
not doing UNAC stripping for ASCII) not for "non-CJK", as any other
Latin encoding needs proper normalization and case-folding.

More tomorrow :-)

Cheers!




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]