Re: Spelling suggestions

From: Debajyoti Bera <dbera web gmail com>
To: Lukas Lipka <lukaslipka gmail com>
Cc: Joe Shaw <joe joeshaw org>, dashboard-hackers <dashboard-hackers gnome org>
Subject: Re: Spelling suggestions
Date: Sun, 16 Dec 2007 08:45:39 -0500

> * Lucene only stores stemmed forms of the words (beagle becomes beagl)
>
>     We have to figure out a way to unstem the word:
> 	1.) Hack the analyzer to get the unstemmed word
> 	2.) Traverse through our TextCache and find a word which
> 	    which contains the stem part.
>     This is what I'll be looking into today/tomorrow.

You might want to check the Highlighter.net package (in Lucene.Net/contrib 
from their website). They highlight matched words. They use StandardAnalyzer 
in their example but I wrapped a PorterStemmer around it and asked it to 
highlight words with same stem and it was able to do it.
One way I had in mind was to create a tokenstream, check if the tokentext is 
the same as the suggested stem, if yes use the token.startoffset, 
token.endoffset to extract the actual text. Of course its easier said than 
done ;-)

>     We need to only return the highest relevant suggestions, based on:
> 	1.) Term frequency in index
> 	2.) Levenshtein distance score

Add to that there could be multiple indexes so results from multiple indexes 
need to be intelligently merged.

> Sorry, for the exhausting email and lets make Beagle rock! :-)

Yayyyyyyyyyyyy !!!

- dBera

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user

References:
- Spelling suggestions
  - From: Lukas Lipka

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]