Re: How to get character encoding...

From: "David Necas (Yeti)" <yeti physics muni cz>
To: Micah Carrick <email micahcarrick com>
Cc: gtk-list <gtk-list gnome org>
Subject: Re: How to get character encoding...
Date: Sun, 29 May 2005 22:08:12 +0200

On Sun, May 29, 2005 at 01:56:26PM -0400, Micah Carrick wrote:
> Is there a routine I can use to determine the character encoding of a 
> text file so I can then convert it to UTF-8 for display in a gtkTextView?

Generally, no.  For a short text in arbitrary language
and arbitrary encoding even humans may not be able to
determine it.

It's quite easy to tell apart legacy 8bit encoding and
unicode variants UTF-8, UTF-16, UCS-4.  Quite a few programs
can do it (e.g., file) although there's no such routine in
GLib AFAIK.

But if you need to recognize legacy 8bit encodings, you are
in trouble (I've written a program Enca, that does it for
some East-European languages, but that's probably of little
help here; various detection routines for Asian languages
can be found on the web too; and methods to determine both
language and encoding exist too, but they need fairly
long/typical text).  If it's reasonable to assume the text
is related to current locale somehow, you can simply try
nl_langinfo(CODESET) from non-Unicode version of that
locale.  Or something like that, depending on the situation.

In all cases, if the file is user-supplied allow user to
choose the encoding.

Yeti

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

References:
- How to get character encoding...
  - From: Micah Carrick

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]