Re: How to deal with different encodings ?



> >>I have no idea how it determines if data is  in non-UT8 encoding.
> I found this with google:
> http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>
> and linux file command can be used (some times it write it is english text
> even if it is danish, and the MPEG detection is a fail)

Thanks for the pointers. If someone can implement them in C# or at the
worst find a C library than we can give it a shot. There is a partial
implementation of a Textcat patch in the bugzilla, which does language
determination (we need to determine the correct language to figure out
which stemmer to use). If Textcat can determine encoding as well then
that will be easier to use.

There are some caveats though,
- Most of tools however require a chunk of data to determine encoding.
That is actually not that bad, since we already read off the first 1K
for mimetype determination. We could reuse that for language, encoding
etc.
- This works mostly for data, but for metadata which are say 20
character strings, there is no way other than to have a default
encoding.
- Each of this slows down the indexing. Currently we take about
0.05-0.1 sec per file, so too much slowdown would not be nice.

The encoding scenario is not handled very well in beagle, even if LANG
is properly set. Mostly for historic reasons. Someone should look into
it.

- dBera

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]