Re: [Tracker] Syntax of searchings



Le mercredi 15 novembre 2006 Ã 21:32 +0100, Ulrik Mikaelsson a Ãcrit :
        Currently it is not possible and it is not normal... I do not
        know if
        QDBM (which stores file names associated with keywords) can be
        set to 
        split string like "ziegler-nichols" into "ziegler" and
        "nichols"
        automatically for searching or if we need to split strings
        ourselves.

This question were raised before, regarding filenames with dashes and
underscores in them. That time, the reply were that in C-code, dashes
and underscores often have a meaning. I think we'll need to be context
sensitive in this case, where regular documents and filenames usually
require word-splitting, while sourcecode usually don't. (However, in
c, the string "difference=alpha-beta" actually have three interesting
lexemes and a dash should neither here create the lexeme
"alpha-beta".) 

And if you code in Lisp, you can write "foo-bar" as a variable, the all
name!

IMHO we should split with characters:
  . , ; - * / \ ! ? ' < > & ~ " | `
and I think it is all... I consider only these characters because there
are used in shells. So we do not split words around "_".
But perhaps we should add a parameter to do a search which splits words
or not.


        What I also dislike with libstemmer (which aims to "reduce"
        strings to 
        radicals to ignore plural for instance) is that it does not
        ignore
        accentuated characters, so if I have a file which contains
        "ÃlÃphant",
        then "Ãlephant" or "elephant" will not be found. "ÃlÃphant" is
        the 
        correct orthography but it happens very often that french
        people miss
        some accents or add superflus ones... and it is the same
        problem in
        other languages.

Unfortunately, this is not always applicable. For instance in Swedish,
there's a big difference in the words "Ãst" and "ost", where the
meanings is "east" and "cheese", respectively. However, "cafÃ" is
often spelled "cafe", with the same meaning.

I'm not sure at all how to handle this.

I have begun to search algorithms and I found:

* N-grams
  http://en.wikipedia.org/wiki/N-gram
* levenshtein
  http://www.php.net/manual/en/function.levenshtein.php
* similar text
  http://www.php.net/manual/en/function.similar-text.php
* soundex
  http://www.php.net/manual/en/function.soundex.php
* metaphone
  http://www.php.net/manual/en/function.metaphone.php

I really do not know what algorithm is the best or what are the pros and
cons for each of them.

I found this library which implements N-grams:
http://hyperestraier.sourceforge.net/
It seems that it (or its predecessor Estraier) is used by Strigi...


Laurent.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]