Re: searching for parts of words

Joe Shaw wrote:
> On Tue, 2005-11-15 at 10:09 +0100, Bernhard Kleine wrote:
> > the search algorithm is either biased for the English language or
> > something else is wrong:
> Beagle assumes English at present.
> This is actually kind of a hard problem to fix generally.  Maybe we
> could assume a language based on the LANG environment variable, but that
> means that a great many web pages and emails which are in English would
> be incorrectly indexed.
> Perhaps we could do some sort of analysis on the document up front to
> try to guess a language, but that wouldn't bode well for 
> documents.

This code is needed anyway sooner or later.
For example it would be great to have it in the word processor so you
don't have to select the spelling language of the individual paragraphs.
Or when using the IM application to select the spelling language when
talking different languages to different people.

I think google assigns a language to every search item.
Is that also possible with the beagle engine (lucene?).

Beagle has to work OK when there's content in other languages other than
English, or beagle will be of little use between non English native speakers.

> So really, there's no ideal solution here and we've just punted the
> issue. :)
> > We are, therefore, not concerned any more with exact word matching,
> > since this does obviously not take place otherwise  "genes" were not
> > found, but with an intelligent input of word derivatives. 
> Well, we are concerned.
> > -- Is there any possibility to configure this input? 
> It would require some hacking of the code and basic Lucene knowledge.
> The place to make the change is LuceneCommon.cs.  I know that Lucene has
> some support for German, but I don't know how extensive it is.
> Additional development work might be necessary here.

Do you know if search engines such as google has different code to
support different languages?

> > -- Would it be helpful to make it configurable?
> We need a strategy for how to determine the language of a document.  I
> don't think that just going with LANG is the right thing, because the
> number of English documents a user will encounter on the Internet is
> significant, and we'd really like those to be indexed correctly.

Sure, LANG is _not_ the right thing.

Prueba el Nuevo Correo Terra; Seguro, Rápido, Fiable.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]