Re: Opera backend for Beagle



I'm very glad to hear that better encoding detection is in your ToDo
list. I've looked through Snowball site, looks like it rocks :)

Looking forward to hear news about its support in Beagle. Would be
really nice to see an option somewhere in preferences to select which
language should Beagle use to autodetect encoding.

Let the users know about the changes somehow, when they will be done.
I think that you'll find at least one tester.


On 11/7/07, D Bera <dbera web gmail com> wrote:
> > > Hmmm.. well, we do our best with encoding detection, but since Opera
> > > kinda mangles the content in its storage, our Encoding detection is
> > > pretty unreliable... In general we don't handle other languages very
> > > well, we try, but mixed languages is a known issue.
> >
> > I understand, its hard to handle all the encodings for all the
> > languages, only for Russian we have CP1251, ISO8859-5, KOI8-R, CP866,
> > and a common unicode UTF-8 which is not a problem I hope. Lets take
> > browsers - they can do automatic encoding detection very well,
> > especially when the target language is defined. User tells browser to
> > autodetect cyrillic, for example, and it does all the work
> > automatically. Firefox always could do this fine, Opera is good at
> > least since 9.x, Konqueror sources were recently (3.5.7 I think)
> > changed to use heuristic cyrillic encoding detection which also work
> > very well now. I'm not a developer, but maybe you could use some idea
> > or ready encoding autodetection implementation from Konqueror/Firefox?
> > I'm sure that you have other importaint things to do for the next
> > Beagle release, but if the problem has some importance in your
> > opinion, it would be very cool to see "enchanced encoding
> > autodetection" in Beagle roadmap.
>
> Doing language specific indexing has always been one of our long terms
> goals. In general language detection is a problem and then there is
> the step of using the language information correctly during indexing.
> Thankfully a lot of the data sources beagle indexes has explicit
> information about the language e,.g. the web cache pages. We recently
> switched to Snowball stemmer which has stemmers for a lot of
> languages. Currently the english stemmer is hardcoded. It will take
> some more work to plug the right stemmer and complete the loop. It
> wont be ready for the current release, but hopefully it will be
> implemented sometime soon after that.
>
> - dBera
>
> --
> -----------------------------------------------------
> Debajyoti Bera @ http://dtecht.blogspot.com
> beagle / KDE fan
> Mandriva / Inspiron-1100 user
>


-- 
-wbr,
Andrey Melentyev
andrey melentyev gmail com


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]