Re: searching for parts of words

From: Kevin Kubasik <kevin kubasik net>
To: Richard Boulton <richard tartarus org>
Cc: Beagle-Mailing-List <dashboard-hackers gnome org>
Subject: Re: searching for parts of words
Date: Tue, 15 Nov 2005 15:41:54 -0500

The other issue seems to be that we keep focusing on what google can
do with multi-language support, however, they only offer comprehensive
desktop searching in one language at a time (near as I can tell) We
forget that Google has millions of servers worldwide to handle the
processing, beagle has to do all this every time it reads a document ,
and it has to do this with whatever memory and processing power this
home user has to offer. Granted, we are allowed to assume a certain
minimum, but there is a limit to what can be easily accomplished.

While a LANG variable would certainly not be out of our reach, the
lack of a resident linguist might make a language guesser thingy a
challenging project. (Unless of course we find some way to implement a
plugin or language definition architecture.)

While the ability to better handle multiple languages is defiantly a
feature to shoot for in a 1.0 release, I think we might want to visit
the drawing board (ie IRC and perhaps a virtual whiteboard) and map
out features. The ability to handle mized language documents would be
extremely challenging, so that would definitely be lower on the list
of TODO's. however, adding a language field to the index, and
encouraging the addition of language detection in filters would not be
outlandish. (as we have already said, html docs can easily be handled,
and I am willing to bet the OpenDocument format has some field for
language).

On 11/15/05, Richard Boulton <richard tartarus org> wrote:
> On Tue, Nov 15, 2005 at 06:50:33PM +0100, DANIELLLANO terra es wrote:
> > Do you know if search engines such as google has different code to
> > support different languages?
>
> I don't know about Google specifically, but certainly some engines process
> text in different languages differently.  Specifically, they use different
> stemming algorithms, term splitting algorithms (particularly for languages
> such as German which really need word-splitting, and languages which need
> multibyte characters).  Also, accents in english tend to be best ignored
> (cafe has the same meaning with or without an accent), while in other
> languages they tend to be quite important.
>
> Automatic language guessing is not actually that tricky to implement -
> there tend to be characteristic words which occur frequently only in a
> certain language, or infrequently in a certain language.  You can find
> these words by taking reasonably large sets of documents in each language
> you wish to be able to detect, and then analysing the frequencies of words
> in each language.  It is usually then possible to extract a fairly small
> set of words and associated probabilities for each language, and compare
> these probabilities with the frequencies of words in a sample document, and
> get a reasonably reliable guess as to the language of the sample document.
> This technique can even be used with some success on reasonably sized
> paragraphs of text, so mixed language documents could be processed
> appropriately.
>
> --
> Richard
> _______________________________________________
> Dashboard-hackers mailing list
> Dashboard-hackers gnome org
> http://mail.gnome.org/mailman/listinfo/dashboard-hackers
>

--
Kevin Kubasik
240-838-6616

References:
- Re: searching for parts of words
  - From: DANIELLLANO terra es
- Re: searching for parts of words
  - From: Richard Boulton

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]