Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)



Hi Martyn,


I added a new option in configure to be able to select the desired
unicode support library:
--with-unicode-support=[glib|libunistring|libicu]
Currently it defaults to 'glib' if not specified.

Makes sense.

Allowing build time configuration of libglib/libicu/libunistring seems
fine given the circumstances with the licensing. I would make it
optional using libunistring, then libicu, then glib as a fallback (using
automatic detection in configure).

Done. New order is libunistring/libicu/glib


Also developed a tester which uses the parser in libtracker-fts,
available in tests/libtracker-fts/tracker-parser-test.c
Once compiled, you can use --file to specify the file to parse.

Perfect. Testing is quite important here. If we have a test case that is
automated, you get extra marks :) since we run that before each release
to sanity check our code base.


Yeah, will add some automated unit tests, at least checking the outputs
of the parsing.

Attached is a short spreadsheet with some numbers I got using my set of
test files. I measured three different things:
 * The time it takes for each parser to parse each file.
 * The number of words obtained with each parser in each file.
 * The contents of the output words.

All the result files are available at:
http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/

I think we should put the results in docs/ so people can see why we have
decided to use these libraries in that order and what tests have been
done.


You mean in the wiki, right? Will prepare some text.

6) More situations where glib(custom/pango) parser doesn't work
properly:
 * When input string is decomposed (NFD) (as with the "Ãcole issue" in
the testcaseNFD.txt file in the tests)
 * Special case-folding cases (as with the "groÃ/gross issue" in the
gross-1.txt file in the tests)
Both libunistring and libicu behave perfectly in the previous cases.

These cases are really what we need to fix.


Those would be fixed already without any further fix with the
libunistring/libicu implementations.


Pending issues
----------------------------------
1) The current non-CJK word-break algorithm assumes that a word starts
either with a letter, a number or a underscore (correct me if wrong,
please). Not sure why the underscore, but anyway in the
libunistring-based parser I also included any symbol as a valid word
starter character. This actually means that lots of new words are being
considered, specially if parsing source code (like '+', '-' and such).
Probably symbols should be removed from the list of valid word starter
characters, so suggestions welcome.

Jamie mentions this is for functions in source code. Personally, I
wouldn't mind ignoring those. They are usually private functions and
less interesting. As for numbers, I am sitting on the fence with that
one. It is quite hard to predict useful numbers without context. Mikael
will have an opinion here I would think.


That's fully right. Without the proper context, it's very difficult to
see if numbers are really useful information or not. Phone numbers are a
clear case of useful info we shouldn't be filtering, I guess.


Cheers!

-- 
Aleksander




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]