Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)



Hi Jamie,

A few comments i would ask (I have only looked at your unicode parsers
in the libtracker-fts directory in your branch  so apologies if my
assumptions are wrong):

1) I assume the glib parser in your benchmarks is the tracker parser
unmodified?


Yes, didn't touch it.

2) Tracker parser ignores words that start with numbers or odd
characters (only a..z/A..Z or underscore is allowed for first character
- the latter so that c function names get indexed). This keeps out a lot
of useless junk from entering the FTS index and will almost certainly
account for the discrepencies in word counts (including using pango) in
your benchmarks?

(I see from your comments you allow words beginning with numbers in your
unicode implmentations)

Yes and no, I would say. I enabled number-only words, thinking in this:
https://bugzilla.gnome.org/show_bug.cgi?id=503366

Of course, that could be something configurable if needed.

Some of the discrepancies in the word counts will probably come from
this allowance of words starting with numbers, and also some from the
allowance of all symbols as word-starters.

But there are still some issues with wrong word-breaks if input text
comes decomposed in NFD form. The glib-based parser would need to be
modified so that NFC normalization is done just when the string is set
in the parser, but that is quite difficult assuming that the start/end
offsets of the original words need to be preserved for the offsets() and
snippet() FTS methods. Didn't find any normalization method which stores
the original offsets.


3) UNAC benchmarking would also make sense as it converts to UTF16 to
perform accent stripping. Of course if word breaking is faster in UTF16
then it might give your unicode libs some advantage in the benchmarks?


Well, the libunistring-parser uses exactly the same unaccent method than
the glib-parser, as both have UTF-8 as input and output. UNAC processing
will probably be faster in libicu, as in this case the UChars passed as
input to the unaccent method are already in UTF-16, so only a conversion
to UTF-16BE (ensuring always big-endian way of UTF-16 for libunac) is
needed. So basically, for the benchmarking UNAC is not really an issue,
I would say.

4) I personally feel that whatever parser we use, it should perform
optimally for ascii as its more prevalent in source code and indexing
source code is really cpu intensive. We could of course use a unicode
lib for non-ascii stuff. I note you include some ascii checking in your
unicode stuff but its not used for word breaking but for UNAC
eligibility and it causes an additional iteration of the characters in
the word (the tracker one tests for ascii whilst doing word breaking
iteration)


Yes, you are right, the extra ASCII check is not needed in the original
version, but it was really needed to improve the performance of the
unicode-based parsers for the cases where UNAC stripping was not needed.

But apart from that, the performance difference between the glib-parser
tests and the unicode-based-parsers are really not comparable: If all
processed the same number of words, it really seems that both
libunistring-based one and libicu-based one would behave better even for
ASCII, and all the normalization and case-folding issues would be
solved, and using a single implementation for any kind of input string
(even with mixed CJK and non-CJK).

Cheers!






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]