Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)




So, with this improvement considering ASCII-only words a special case,
libunistring really beats them all.


yeah libunistring looks like good stuff - I must check the source!

I still note you need to apply word filtering rules on words beginning
with numbers or symbols - Im sure thats easy to do?


Probably words starting with symbols other than underscore can be
avoided. BTW, Why underscore not?

we only allowed underscore as some function names start with underscore
in source files


And regarding filtering numbers, is this something we want to do?
There's a bugreport regarding this:
https://bugzilla.gnome.org/show_bug.cgi?id=503366

most numbers are junk - especially in source files and would bloat up
the index.

we used to have an option where if a number was longer than x characters
we would accept it (on the grounds it was a telephone number and
therefore actually useful - im not sure if this preference is still
available or used)

An interesting limitation of that is the convention of writing numbers
like this (012) 345 6789.


Yes, you are fully right. Probably the best option then is to make it
configurable, disabled by default. There are probably lots of use cases
needing full numbers being parsed which we are not aware of, and making
that configurable is not a big work...




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]