Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)



Hi Jamie,

Just to make it clear, I believe the tracker way of iterating over the
characters is the optimal solution as it allows us to detect encoding,
keep track of length, perform word validation (aforementioned parse
rules concerning starting character)  and perform word breaking without
any further character iterations (unlike the unicode implementations)


I wouldn't be worried for the extra character iteration looking for
ASCII characters, as that operation is just a comparison per byte until
a non-ASCII found in the word. In the other hand, for example, the
current glib-based parser needs conversion of each UTF-8 character found
in the per-character iteration to 32-bit gunichars (g_utf8_get_char()
calls) and back to UTF-8 (g_ucs4_to_utf8() call), which is not needed if
using libunistring. Also, if NFC normalization needs to be applied to
the whole string before using the glib/pango parser (as it seems),
storing the offsets of all the characters in the original non-normalized
string (for the offsets() and snippet() FTS stuff), that's an extra
iteration that would be needed for glib/pango parser, not needed with
libunistring or libicu as the word-break algorithm works nice with NFD
strings.

All three algorithms have good points and bad points, so that's why they
should be compared as a whole, I would say.

Anyway I agree that the fastest and perfect solution would be the one
doing all the needed things in a single iteration: NFC normalization,
word-break detection, a proper case-folding (not
character-per-character!)... even accent stripping and stemming could be
done if we were to develop such function (and that would really actually
be a great performance improvement, btw), but that is probably a huge
work only useful for the Tracker case, and very difficult to maintain.

Therefore one question I would ask if the unicode libs provide ability
to check if a single character is a word break? 


Actually, that's the wrong way of doing it :-) Unicode states "where not
to break a word"-rules, and if none of them apply to the given case,
"break everywhere". It's explained in UAX#29:
http://unicode.org/reports/tr29/#Default_Word_Boundaries

There are something like 14 rules to check before deciding that there is
a word-break between two unicode characters (thus, not considering
unicode characters themselves to be word-breakers).

If so it could be added to the existing tracker parser with little
modification and overhead (and only used if word is non-ascii of
course!)


That would mean re-implementing the word-break algorithm from libunicode
or libicu instead of re-using the originals :-)

BTW, it seems chinese-strings are currently not properly word-breaked by
the glib/pango parser. Pango is not being used, I mean. Probably the
ranges of CJK characters need to be reviewed.

Also the pango way of allocation structs to store attributes for each
character is indeed crazy and insanely slow but fortunatley we only use
that for CJK text but in any case we could indeed optimize that area
with some of the work you have done


I really wouldn't split between non-CJK and CJK, if the performance of
ASCII is comparable using libunistring/libicu (which it seems it is).
The best thing of libunistring/libicu based parsers is really that there
is a single algorithm for any string, whatever characters they have, and
maintaining such algorithms should be trivial compared to the glib/pango
case.

Also, the split algorithm for non-CJK and CJK would again be faulty for
documents with strings in both English and Chinese for example. Probably
not the case in my computer or yours, but a really high chance in a
Japanese's or Chinese's computer.

Anyway, tomorrow I will spend some time doing additional tests for the
ASCII-only case, and will try to compare the three parsers in this
specific situation.

Cheers!

-- 
Aleksander





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]