Re: [Tracker] Syntax of searchings
- From: Laurent Aguerreche <laurent aguerreche free fr>
- To: Ulrik Mikaelsson <ulrik mikaelsson gmail com>
- Cc: tracker-list gnome org
- Subject: Re: [Tracker] Syntax of searchings
- Date: Wed, 15 Nov 2006 22:02:19 +0100
Le mercredi 15 novembre 2006 Ã 21:32 +0100, Ulrik Mikaelsson a Ãcrit :
Currently it is not possible and it is not normal... I do not
know if
QDBM (which stores file names associated with keywords) can be
set to
split string like "ziegler-nichols" into "ziegler" and
"nichols"
automatically for searching or if we need to split strings
ourselves.
This question were raised before, regarding filenames with dashes and
underscores in them. That time, the reply were that in C-code, dashes
and underscores often have a meaning. I think we'll need to be context
sensitive in this case, where regular documents and filenames usually
require word-splitting, while sourcecode usually don't. (However, in
c, the string "difference=alpha-beta" actually have three interesting
lexemes and a dash should neither here create the lexeme
"alpha-beta".)
And if you code in Lisp, you can write "foo-bar" as a variable, the all
name!
IMHO we should split with characters:
. , ; - * / \ ! ? ' < > & ~ " | `
and I think it is all... I consider only these characters because there
are used in shells. So we do not split words around "_".
But perhaps we should add a parameter to do a search which splits words
or not.
What I also dislike with libstemmer (which aims to "reduce"
strings to
radicals to ignore plural for instance) is that it does not
ignore
accentuated characters, so if I have a file which contains
"ÃlÃphant",
then "Ãlephant" or "elephant" will not be found. "ÃlÃphant" is
the
correct orthography but it happens very often that french
people miss
some accents or add superflus ones... and it is the same
problem in
other languages.
Unfortunately, this is not always applicable. For instance in Swedish,
there's a big difference in the words "Ãst" and "ost", where the
meanings is "east" and "cheese", respectively. However, "cafÃ" is
often spelled "cafe", with the same meaning.
I'm not sure at all how to handle this.
I have begun to search algorithms and I found:
* N-grams
http://en.wikipedia.org/wiki/N-gram
* levenshtein
http://www.php.net/manual/en/function.levenshtein.php
* similar text
http://www.php.net/manual/en/function.similar-text.php
* soundex
http://www.php.net/manual/en/function.soundex.php
* metaphone
http://www.php.net/manual/en/function.metaphone.php
I really do not know what algorithm is the best or what are the pros and
cons for each of them.
I found this library which implements N-grams:
http://hyperestraier.sourceforge.net/
It seems that it (or its predecessor Estraier) is used by Strigi...
Laurent.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]