Spelling suggestions



Hi,

Yesterday, as a result of my afternoon hacking, I commited a first-pass
of spelling suggestions into trunk. This is still far from finished and
needs some work. I have received a lot of emails about this, so I will
try to reply and sum up what I have done so far.

Spelling suggestions was an idea first worked on by Fredrik[1] in the
past. His implementation had to build and use a separate index for
storing all the stemmed words. This wasn't very efficient.

My first implementation was very similar to his, but it used a
FuzzyTermEnum to generate suggestions on the fly. FuzzyTermEnum uses
Levenshtein distance to find similar words. This eliminated the need for
maintaining a separate spelling index for each queryable.

In this implementation suggestions were always generated for each query.
Everything worked fine but I wanted to make sure there is no overhead
and that queries won't take much longer than usual.

Profiling showed that generating suggestions for relatively small
indexes doesn't make much of a difference (2%-8% of the whole query
time). However, to my dismay larger indexes proved to be a problem (45%
of the whole query time on FSQ! Whoa!).

That is why I chose another implementation, similar to how snippets work
- spelling suggestions only on-demand. Now you can request suggestions
only when you need them, by sending a SuggestionsReques which is
followed by SuggestionsResponse reply. In my personal point of view this
is much more efficient and doesn't affect the query responsiveness.

There are still some things that need to be done, before we can all
celebrate the "Did you mean to search for..." stuff. :-)

* Lucene only stores stemmed forms of the words (beagle becomes beagl)

    We have to figure out a way to unstem the word:
	1.) Hack the analyzer to get the unstemmed word
	2.) Traverse through our TextCache and find a word which
	    which contains the stem part.
    This is what I'll be looking into today/tomorrow.

* Rank the suggestions

    We need to only return the highest relevant suggestions, based on:
	1.) Term frequency in index
	2.) Levenshtein distance score

* Suggestions limiter

    I've currently limited the suggestions to those that have
    the same first character. This is based on the assumption that
    most likely the first letter the user types will be correct.
    I'm not sure if this is correct, but it does generate better
    results.

* Multi-term queries

    We also need to look into handling multi-term queries,
    where suggestions should contain the whole phrase and not
    just the individual words.

Sorry, for the exhausting email and lets make Beagle rock! :-)

Best,
Lukas




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]