How to deal with different encodings ?



Hey folks,
  We are having a bit of trouble in deciding (*) how to deal with files in an 
encoding different than the system encoding. By default, we use UTF8 
everywhere and assume everything is in UTF8. Some file formats or data 
sources specify their encoding (emails, html files, office documents etc.) so 
those are not a problem.

 If non-UTF8 is used for filenames and such, a lot of non-beagle things also 
break; we are trying to use MONO_EXTERNAL_ENCODINGS to deal with this case. 
(**).

 For other files, depending on the file format, either UTF8 or the platform 
encoding is used. Its really a clumsy affair. Apparently Windows XP has a 
system setting "how should I handle non-unicode programs" where it is posible 
to assign a ISO8859-1 codepage. I have no idea how it determines if data is 
in non-UT8 encoding. So, even though someone could have a different system 
encoding, a completely different encoding could be used for file data and 
metadata. Its a perfect encoding mess :-/.

 I know its not possible to always determine the right encoding. We could have 
a BEAGLE_LANG variable, which if set, would specify the encoding to use while 
extracting data regardless of the System encoding. Probably most apps will 
fail while displaying that data, but being an indexer how far should beagle 
push its indexing ability.

 Any suggestions on what could be done to use the right encoding as closely as 
possible ?

- dBera

(*) http://bugzilla.gnome.org/show_bug.cgi?id=524077
(**) "non UTF8 folders are not indexed" - in progress - 
http://bugzilla.gnome.org/show_bug.cgi?id=440458

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE / Mandriva / Inspiron-1100


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]