Re: searching for parts of words

From: Richard Boulton <richard tartarus org>
To: Beagle-Mailing-List <dashboard-hackers gnome org>
Subject: Re: searching for parts of words
Date: Tue, 15 Nov 2005 19:28:58 +0000

On Tue, Nov 15, 2005 at 06:50:33PM +0100, DANIELLLANO terra es wrote:
> Do you know if search engines such as google has different code to
> support different languages?

I don't know about Google specifically, but certainly some engines process
text in different languages differently.  Specifically, they use different
stemming algorithms, term splitting algorithms (particularly for languages
such as German which really need word-splitting, and languages which need
multibyte characters).  Also, accents in english tend to be best ignored
(cafe has the same meaning with or without an accent), while in other
languages they tend to be quite important.

Automatic language guessing is not actually that tricky to implement -
there tend to be characteristic words which occur frequently only in a
certain language, or infrequently in a certain language.  You can find
these words by taking reasonably large sets of documents in each language
you wish to be able to detect, and then analysing the frequencies of words
in each language.  It is usually then possible to extract a fairly small
set of words and associated probabilities for each language, and compare
these probabilities with the frequencies of words in a sample document, and
get a reasonably reliable guess as to the language of the sample document.
This technique can even be used with some success on reasonably sized
paragraphs of text, so mixed language documents could be processed
appropriately.

-- 
Richard

Follow-Ups:
- Re: searching for parts of words
  - From: Kevin Kubasik
- Re: searching for parts of words
  - From: Colin Marquardt

References:
- Re: searching for parts of words
  - From: DANIELLLANO terra es

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]