Re: [Tracker] Syntax of searchings

From: Jamie McCracken <jamiemcc blueyonder co uk>
To: Mikkel Kamstrup Erlandsen <mikkel kamstrup gmail com>
Cc: tracker-list gnome org
Subject: Re: [Tracker] Syntax of searchings
Date: Thu, 16 Nov 2006 12:16:17 +0000

Mikkel Kamstrup Erlandsen wrote:

2006/11/15, Eyal Oren <eyal oren deri org <mailto:eyal oren deri org>>:

    On 11/15/06/11/06 22:02 +0100, Laurent Aguerreche wrote:

     >I have begun to search algorithms and I found:
     >
     >* N-grams
     >  http://en.wikipedia.org/wiki/N-gram
    <http://en.wikipedia.org/wiki/N-gram>
     >* levenshtein
     >  http://www.php.net/manual/en/function.levenshtein.php
     >* similar text
     >   http://www.php.net/manual/en/function.similar-text.php
     >* soundex
     >  http://www.php.net/manual/en/function.soundex.php
    soundex allows you to find term that *sound* similar to an indexed
    term, so
    that might actually solve the french/swedish/danish transliteration
    problem.

    I'll ask a computational linguist colleague tomorrow, maybe he has some
    ideas.

    I do see one problem, namely that in one context (programming code)
    people
    seem to prefer exact matches, without stemming or similarity-matching,
    while in other contexts (words in text, file names) people do want
    stemming
    and some form of similarity search regarding the orthography
    (spelling).
    There is probably not one solution that fits these two uses, but
    probably a
    search based on similarity would be fine also for source code.
I see there has been a lot of focus on how wording breaking would workfor various programming languages. I must say that I find that the leastimportant use case. People writing programs very often are quite able tosearch everything and nothing and find what they want. It is of coursestill a case we should consider, while I do consider natural languagesmore important than their programmatic cousins.
The swedish example brought up about "öst" vs "ost" is a good one. Itdemonstrates the need for language specific transliteration - and Iexpect the same to apply to word breaking. But don't we already havelanguage sensitive stemming - maybe only french and english, but otherscould be added no?


we already have this - we have both stemmers and stopword lists for :

french, german, danish, spanish, findlandish, norwegian, italian, dutch,portugese, russian and swedish


for stopwords see http://cvs.gnome.org/viewcvs/tracker/data/languages/

and for stemmers seehttp://cvs.gnome.org/viewcvs/tracker/src/libstemmer/src_c/



--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/

Follow-Ups:
- Re: [Tracker] Syntax of searchings
  - From: Luca Ferretti

References:
- [Tracker] Syntax of searchings
  - From: Javier Arantegui
- Re: [Tracker] Syntax of searchings
  - From: Laurent Aguerreche
- Re: [Tracker] Syntax of searchings
  - From: Ulrik Mikaelsson
- Re: [Tracker] Syntax of searchings
  - From: Laurent Aguerreche
- Re: [Tracker] Syntax of searchings
  - From: Eyal Oren
- Re: [Tracker] Syntax of searchings
  - From: Mikkel Kamstrup Erlandsen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]