Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)



On Tue, 2010-05-04 at 22:13 +0200, Aleksander Morgado wrote:


But apart from that, the performance difference between the glib-parser
tests and the unicode-based-parsers are really not comparable: If all
processed the same number of words, it really seems that both
libunistring-based one and libicu-based one would behave better even for
ASCII, and all the normalization and case-folding issues would be
solved, and using a single implementation for any kind of input string
(even with mixed CJK and non-CJK).


Its more likely tracker pre-checking the encoding to decide whether to
use pango or not is causing too much overhead especailly if input string
is small

The ideal solution IMO would still be for tracker to perhaps remove the
pre-check, iterate and use current ASCII or libunistring/libicu
depending on encoding of the current word. It should be easy to remove
pango and pass non-ascii stuff to be treated differently

For Ascii, we just do what it currently does (iterate, break, convert to
lowercase and validate without any further iterations). Theres no need
for Normalization or any other treatments so it should be optimal as can
be. It would indeed be interesting to see how that benchmarks with your
unicode stuff

For Non-Ascii, we can easily add your libunistring-based stuff. The
parser upon hitting a non-ascii character simply rollsback and passes
the start of the word to your unicode libs and does the additional
normalization and UNAC steps

Thats probably the easiest way to get best of both worlds (assuming
theres a significant difference between tracker and unicode libs for
ASCII)

Obviously if everything could be done in a single iteration then it
would rock but as you say that might be a lot of work

jamie 






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]