Re: [Tracker] Syntax of searchings



2006/11/15, Eyal Oren <eyal oren deri org>:
On 11/15/06/11/06 22:02 +0100, Laurent Aguerreche wrote:

>I have begun to search algorithms and I found:
>
>* N-grams
>  http://en.wikipedia.org/wiki/N-gram
>* levenshtein
>  http://www.php.net/manual/en/function.levenshtein.php
>* similar text
>   http://www.php.net/manual/en/function.similar-text.php
>* soundex
>  http://www.php.net/manual/en/function.soundex.php
soundex allows you to find term that *sound* similar to an indexed term, so
that might actually solve the french/swedish/danish transliteration
problem.

I'll ask a computational linguist colleague tomorrow, maybe he has some
ideas.

I do see one problem, namely that in one context (programming code) people
seem to prefer exact matches, without stemming or similarity-matching,
while in other contexts (words in text, file names) people do want stemming
and some form of similarity search regarding the orthography (spelling).
There is probably not one solution that fits these two uses, but probably a
search based on similarity would be fine also for source code.

I see there has been a lot of focus on how wording breaking would work for various programming languages. I must say that I find that the least important use case. People writing programs very often are quite able to search everything and nothing and find what they want. It is of course still a case we should consider, while I do consider natural languages more important than their programmatic cousins.

The swedish example brought up about "öst" vs "ost" is a good one. It demonstrates the need for language specific transliteration - and I expect the same to apply to word breaking. But don't we already have language sensitive stemming  - maybe only french and english, but others could be added no?

Cheers,
Mikkel



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]