Re: searching for parts of words


On Tue, 2005-11-15 at 10:09 +0100, Bernhard Kleine wrote:
> the search algorithm is either biased for the english language or
> something else is wrong:

Beagle assumes English at present.

This is actually kind of a hard problem to fix generally.  Maybe we
could assume a language based on the LANG environment variable, but that
means that a great many web pages and emails which are in English would
be incorrectly indexed.

Perhaps we could do some sort of analysis on the document up front to
try to guess a language, but that wouldn't bode well for mixed-language

So really, there's no ideal solution here and we've just punted the
issue. :)

> We are, therefore, not concerned any more with exact word matching,
> since this does obviously not take place otherwise  "genes" were not
> found, but with an intelligent input of word derivatives. 

Well, we are concerned.

> -- Is there any possibility to configure this input? 

It would require some hacking of the code and basic Lucene knowledge.
The place to make the change is LuceneCommon.cs.  I know that Lucene has
some support for German, but I don't know how extensive it is.
Additional development work might be necessary here.

> -- Would it be helpful to make it configurable?

We need a strategy for how to determine the language of a document.  I
don't think that just going with LANG is the right thing, because the
number of English documents a user will encounter on the Internet is
significant, and we'd really like those to be indexed correctly.

> Are there any international lists of stop words that will not be indexed
> which can be used with beagle?
> How can I add stop words?

Right now, it's just English.  You can pass a list of stop words into
the StandardAnalyzer constructor, however.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]