Re: Opera backend for Beagle

> > Hmmm.. well, we do our best with encoding detection, but since Opera
> > kinda mangles the content in its storage, our Encoding detection is
> > pretty unreliable... In general we don't handle other languages very
> > well, we try, but mixed languages is a known issue.
> I understand, its hard to handle all the encodings for all the
> languages, only for Russian we have CP1251, ISO8859-5, KOI8-R, CP866,
> and a common unicode UTF-8 which is not a problem I hope. Lets take
> browsers - they can do automatic encoding detection very well,
> especially when the target language is defined. User tells browser to
> autodetect cyrillic, for example, and it does all the work
> automatically. Firefox always could do this fine, Opera is good at
> least since 9.x, Konqueror sources were recently (3.5.7 I think)
> changed to use heuristic cyrillic encoding detection which also work
> very well now. I'm not a developer, but maybe you could use some idea
> or ready encoding autodetection implementation from Konqueror/Firefox?
> I'm sure that you have other importaint things to do for the next
> Beagle release, but if the problem has some importance in your
> opinion, it would be very cool to see "enchanced encoding
> autodetection" in Beagle roadmap.

Doing language specific indexing has always been one of our long terms
goals. In general language detection is a problem and then there is
the step of using the language information correctly during indexing.
Thankfully a lot of the data sources beagle indexes has explicit
information about the language e,.g. the web cache pages. We recently
switched to Snowball stemmer which has stemmers for a lot of
languages. Currently the english stemmer is hardcoded. It will take
some more work to plug the right stemmer and complete the loop. It
wont be ready for the current release, but hopefully it will be
implemented sometime soon after that.

- dBera

Debajyoti Bera @
beagle / KDE fan
Mandriva / Inspiron-1100 user

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]