[Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)



Hi all again,


I've been playing with substituting the two word break algorithms in
libtracker-fts (custom for non-CJK and pango-based for CJK) with a
single one using GNU libunistring (LGPLv3). Note that libicu (ICU
license) is also probably a good choice instead of libunistring.
http://www.gnu.org/software/libunistring
http://site.icu-project.org


I developed the libicu-based parser using its unicode algorithms for
word-breaking, normalization and such, as I did for GNU libunistring
last week; and made some tests to compare all three of the
implementations (libunistring-based, libicu-based, glib/pango-based).

You can get the changes from the 'parser-unicode-libs-review' branch in
gnome git.

I added a new option in configure to be able to select the desired
unicode support library:
--with-unicode-support=[glib|libunistring|libicu]
Currently it defaults to 'glib' if not specified.

Also developed a tester which uses the parser in libtracker-fts,
available in tests/libtracker-fts/tracker-parser-test.c
Once compiled, you can use --file to specify the file to parse.

I did several tests using the new tester, which seem to be more accurate
than the first tests I did last week, as in these new ones the results
only depend on the parser implementation, and not on the miner-fs for
example.

Attached is a short spreadsheet with some numbers I got using my set of
test files. I measured three different things:
 * The time it takes for each parser to parse each file.
 * The number of words obtained with each parser in each file.
 * The contents of the output words.

All the result files are available at:
http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/

Some conclusions from the tests

1) Both libunistring and libicu based parsers have exactly the same
output in all tests: same number of words, same word contents.

2) The number of words detected by the glib(custom/pango) parser and
their contents are usually completely different than the number of words
detected by the others:
 * In a chinese-only file, for example, while libunistring/libicu both
detect 1202 words, the glib(custom/pango) parser detects only 188.
 * In a file with mixed languages, glib(custom/pango) detects 22105
words while the others detect 33472 words.

3) GNU libunistring seems to be around 9%-10% faster than libicu
(probably because of the conversions to/from UChars, which are UTF-16
encoded strings. libunistring's API can work directly with UTF-8). This
comparison is very realistic considering that both parsers have exactly
the same output results.

4) glib(custom/pango) time results are almost all of them better than
the ones from libunistring/libicu. This is not surprising as the number
of words detected by glib parser are much less. Thus, these timing
values cannot really be compared.

5) Pango-based word break is really slow. In a 180k mixed-language file:
 * libunistring needed 1.01 seconds
 * libicu needed 1.10 seconds
 * glib(pango) needed 22 seconds!

6) More situations where glib(custom/pango) parser doesn't work
properly:
 * When input string is decomposed (NFD) (as with the "Ãcole issue" in
the testcaseNFD.txt file in the tests)
 * Special case-folding cases (as with the "groÃ/gross issue" in the
gross-1.txt file in the tests)
Both libunistring and libicu behave perfectly in the previous cases.

Finally, I re-paste the pending issues, as they still are the same:


Pending issues
----------------------------------
1) The current non-CJK word-break algorithm assumes that a word starts
either with a letter, a number or a underscore (correct me if wrong,
please). Not sure why the underscore, but anyway in the
libunistring-based parser I also included any symbol as a valid word
starter character. This actually means that lots of new words are being
considered, specially if parsing source code (like '+', '-' and such).
Probably symbols should be removed from the list of valid word starter
characters, so suggestions welcome.


Now applies to both libunistring and libicu based parsers.

2) UNAC needs NFC input, but the output of UNAC is not NFC, it's the
unaccented string in NFKD normalization. I avoided an extra
normalization back to NFC, but not sure how it should go. This applies
to both non-libunistring and libunistring versions of the parser.

Applies to all 3 parsers.


3) libunistring currently finds all word breaks in the whole input
string in a single function call. This could be improved so that words
are found one by one, which allows stopping the word-break operation at
any time. Already asked this in libunistring mailing list and the author
added it in his TODO list.


Applies still to libunistring. libicu already can do a one-by-one word
search (with UChars).


Comments welcome,

-- 
Aleksander

Attachment: unicode-libraries-report.ods
Description: application/vnd.oasis.opendocument.spreadsheet



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]