Re: Updated LaTeX filter - support indexing of tex files inside a compressed archive



> From what I know it is terribly hard to detect
encodings i.e.
> differentiate between an iso-* encoding and utf8
encoding. Any
> document with any iso* encoding is also a valid utf8
encoded document.
>

I have found the program chardet at
http://chardet.feedparser.org/, which is
based on statistical methods for detecting the
encoding of files and is an
adaptation of the method used in netscape browsers,
written in python. This
would be very useful for beagle, so I wonder whether
beagle will implement
this algorithm (for a description of it, check
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html)
or
should I propose this to the mono guys? I can start
working on it, though you
shouldn't expect much, as I'm not a CS guy.

There is a ongoing work regarding detecting of "language" of a piece
of text http://bugzilla.gnome.org/show_bug.cgi?id=354742 Detecting
charset sounds similar but not quite the same. If there is language
detection, then there is no reason we cannot have charset detection
:-)

Several of the filetypes specify the charset themselves like html, pdf
and other binary formats. The charset detector could be useful for the
others e.g. text files, latex files which do not specify the encoding.
Of course, all these slow down the indexing process quite a lot, so we
need options to turn them on or off, but those can come later.

AFAIK no one is yet working on this. If you are interested in this,
you can start by porting one of the libraries to C# or finding one or
convincing someone to port ;-). Keep us posted.

Thanks,
- dBera

--
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]