[Tracker] libunistring-based parser in libtracker-fts



Hi all,

I've been playing with substituting the two word break algorithms in
libtracker-fts (custom for non-CJK and pango-based for CJK) with a
single one using GNU libunistring (LGPLv3). Note that libicu (ICU
license) is also probably a good choice instead of libunistring.
http://www.gnu.org/software/libunistring
http://site.icu-project.org

The first impressions have been good so far. I did some functionality
tests and some performance tests, and at least in my tests in both cases
the libunistring-based parser seems to behave better. Would be glad if
someone else than me does some performance tests to compare, anyway.

You can get the "parser-libunistring-review" branch from git.gnome.org.
The libunistring-based parser can be enabled configuring with
--enable-libunistring or automatically if libunistring-dev package is
installed. If you don't do any of these two previous things, the old
parser will be used. Not sure anyway, if it's desirable to have this
option if the libunistring stuff is accepted.


Performance (single-cpu Celeron)
----------------------------------
1) Recursively indexing /usr/src/linux-headers-2.6.32-21
   (73 Mbytes, 23410 processed & unprocessed files)
With the current parser, it took 21m (0.0538s/file)
With the libunistring-based parser, it took 19m26s (0.0498s/file)

2) Recursively indexing libunistring's sources (1425 processed files)
With the current parser, it took 340s (0.239s/file)
With the libunistring-based parser, it took 185s (0.130/file)

3) Memory usage tests
Total memory usage of results reported by valgrind in tracker-store
where quite similar in my tests, maybe a little bit bigger in the
libunistring-based parser, but probably because more words where
processed (see Pending Issues below). I tried to use as many
stack-buffers as possible, so for the same words processed, the
libunistring-based parser should be more heap-usage friendly. Philip,
could you check anyway how's the memory fragmentation?


Functionality issues
----------------------------------
1) https://bugzilla.gnome.org/show_bug.cgi?id=579756
Current version fails to properly detect word boundaries if
NFD-normalized, as it considers combining marks as word-breakers. Thus,
when looking for "Ãcole" (NFC) or "ecole", tracker will only report
files with the NFC-version of Ãcole, and not those with the NFD-version
of the word. With the libunistring-based parser, all searches return the
proper results, no matter if NFC or NFD.

2) Case folding for string comparison is also broken in current version.
Just did a simple test with two text files, one with the german word
"groÃ" and another one with the lowercase version "gross". A proper
search should return both results if searching either for one or the
other, but the current version fails to return the "gross" result when
looking for "groÃ". The issue comes because in the non-cjk parser,
strings are lowercased character by character, and the proper thing
would be to do case-folding in the entire word at once. With the
libunistring-based parser, all searches return the proper results, and
even better because the method doing case-folding also does
normalization in the same run if desired.

3) Word-breaking.
With the current parser, mixed CJK and non-CJK strings are either parsed
with the CJK-parser or with the non-CJK parser, which is probably not a
good way (choice between either one or the other is done looking at the
first characters in the string). With the libunistring-based parser,
word breaking is always done using the same algorithm for all CJK,
non-CJK and mixed CJK/non-CJK strings.

4) NFC Normalization & UNAC
UNAC does not work with NFD-normalized strings. Thus, NFC normalization
needs to be before UNAC always. The current parser doesn't do this, so
even if the word-break algorithm was corrected to fix the "Ãcole" issue,
UNAC would still not strip the accent from the word if it comes
NFD-normalized. Fixed this already in the libunistring-based parser (as
said before, the case-folding method in libunistring allows giving a NFC
normalized output directly, so no need for g_utf8_normalize()).


Pending issues
----------------------------------
1) The current non-CJK word-break algorithm assumes that a word starts
either with a letter, a number or a underscore (correct me if wrong,
please). Not sure why the underscore, but anyway in the
libunistring-based parser I also included any symbol as a valid word
starter character. This actually means that lots of new words are being
considered, specially if parsing source code (like '+', '-' and such).
Probably symbols should be removed from the list of valid word starter
characters, so suggestions welcome.

2) UNAC needs NFC input, but the output of UNAC is not NFC, it's the
unaccented string in NFKD normalization. I avoided an extra
normalization back to NFC, but not sure how it should go. This applies
to both non-libunistring and libunistring versions of the parser.

3) libunistring currently finds all word breaks in the whole input
string in a single function call. This could be improved so that words
are found one by one, which allows stopping the word-break operation at
any time. Already asked this in libunistring mailing list and the author
added it in his TODO list.


Comments welcome!

-- 
Aleksander





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]