Re: Detecting file encoding

From: Jonathon Jongsma <jonathon quotidian org>
To: Adrián Ortega <elfus0 1 gmail com>
Cc: gtkmm-list gnome org
Subject: Re: Detecting file encoding
Date: Wed, 11 Aug 2010 16:04:54 -0500

On Wed, 2010-08-11 at 15:21 -0500, Adrián Ortega wrote:
> Hello, 
> 
> 
> I'm making a small text editor to learn as much as I can from gtkmm
> and I've come across with one problem which I haven't been able to
> solve. 
> 
> 
> The main issue is I don't know how to detect the file encoding of a
> given file. I've been reading a lot about this, and found that glibmm
> has some functions that could help me, i.e.
> 
> 
>                               bool 
> get_charset ()
> 
> 
> 
>                               bool 
> get_charset (std::string& charset)
>  
>  
> 
>                        std::string 
> convert (const std::string& str,
> const std::string& to_codeset,
> const std::string& from_codeset)
> 
> 
> 
>                      Glib::ustring 
> locale_to_utf8 (const std::string&
> opsys_string)
>  
> 
> 
>                        std::string 
> locale_from_utf8 (const Glib::ustring& utf8_string)
>  
> 
> 
> 
> however, I haven't been able to detect the enconding of a file. I know
> these functions help me to convert from one encoding to another one,
> but for that I need to know the current file encoding.
> Do you have any idea, suggestion or reference that could help me?
> Sorry if this is not totally related with gtkmm but I think it's
> somewhat related to glibmm.
> Thanks in advance!

glib doesn't really provide any way to do this reliably.  It's not a
simple problem to solve.  ICU can do this
(http://userguide.icu-project.org/conversion/detection), and mozilla
also has their own character set detection algorithms
(http://www.mozilla.org/projects/intl/chardet.html).  But most people
are not very excited about adding a dependency on either of those, so
some applications (e.g. gedit, I believe) just do a poor-man's charset
detection by trying a few common ones and using the first one that
succeeds (which is often good enough for 99% of common cases).

References:
- Detecting file encoding
  - From: =?UTF-8?Q?Adri=C3=A1n_Ortega?=

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]