Re: [Tracker] Syntax of searchings



Mikkel Kamstrup Erlandsen wrote:
2006/11/15, Eyal Oren <eyal oren deri org <mailto:eyal oren deri org>>:

    On 11/15/06/11/06 22:02 +0100, Laurent Aguerreche wrote:

     >I have begun to search algorithms and I found:
     >
     >* N-grams
     >  http://en.wikipedia.org/wiki/N-gram
    <http://en.wikipedia.org/wiki/N-gram>
     >* levenshtein
     >  http://www.php.net/manual/en/function.levenshtein.php
     >* similar text
     >   http://www.php.net/manual/en/function.similar-text.php
     >* soundex
     >  http://www.php.net/manual/en/function.soundex.php
    soundex allows you to find term that *sound* similar to an indexed
    term, so
    that might actually solve the french/swedish/danish transliteration
    problem.

    I'll ask a computational linguist colleague tomorrow, maybe he has some
    ideas.

    I do see one problem, namely that in one context (programming code)
    people
    seem to prefer exact matches, without stemming or similarity-matching,
    while in other contexts (words in text, file names) people do want
    stemming
    and some form of similarity search regarding the orthography
    (spelling).
    There is probably not one solution that fits these two uses, but
    probably a
    search based on similarity would be fine also for source code.


I see there has been a lot of focus on how wording breaking would work for various programming languages. I must say that I find that the least important use case. People writing programs very often are quite able to search everything and nothing and find what they want. It is of course still a case we should consider, while I do consider natural languages more important than their programmatic cousins.

The swedish example brought up about "öst" vs "ost" is a good one. It demonstrates the need for language specific transliteration - and I expect the same to apply to word breaking. But don't we already have language sensitive stemming - maybe only french and english, but others could be added no?

we already have this - we have both stemmers and stopword lists for :

french, german, danish, spanish, findlandish, norwegian, italian, dutch, portugese, russian and swedish

for stopwords see http://cvs.gnome.org/viewcvs/tracker/data/languages/

and for stemmers see http://cvs.gnome.org/viewcvs/tracker/src/libstemmer/src_c/


--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]