Re: [Vala] Get file characterset

From: Jan Hudec <bulb ucw cz>
To: Mark Dewey <dewemark gmail com>
Cc: vala-list gnome org
Subject: Re: [Vala] Get file characterset
Date: Thu, 3 Dec 2009 21:54:53 +0100

On Wed, Dec 02, 2009 at 21:53:48 -0700, Mark Dewey wrote:

How do I determine the encoding of a .txt file without knowing what it
is beforehand? That would make GLib.convert quite useful.


You can try a few wild guesses, but that's really all. Basically:

 - If it begins with one of the byte-order marks, than it is that unicode
   encoding, ie. if the file begins with bytes:

    - \xff\xfe, than it is UTF-16LE
    - \xfe\xff, than it is UTF-16BE
    - \xef\xbb\xbf, than it is UTF-8

  (the bytes are zero-width non-breaking space U+0xFEFF encoded in
  corresponding encoding, used because there is no U+0xFFFE character.

 - Basically nobody uses UTF-16 without byte-order mark, because it is hard
   to guess the byte-order without it, so no point trying that if no
   byte-order mark.

 - Than if it is well-formed UTF-8, it probably is. UTF-8 has sufficient
   redundancy that almost no meaningful non-ascii string in legacy encoding
   is well-formed UTF-8.

 - Otherwise you'll have to assume it is in current locale's legacy encoding.

This is the algorithm used for example by vim and is probably the most
complete available. Most programs simply assume all text from outside is in
current locale's encoding unless told otherwise by the user (of course vim
can be also told manually).

Since all the iso-8859-*, cp* and similar, euc-jp and similar encodings
use the same ranges of bytes with different meanings, there is no even
moderately reliable method for telling them apart.

I suppose the something like which encoding leads to least typos reported
by spellcheck or which encoding leads to character distribution closest to
normal for some language would be the only way if you really needed to
guess encoding of large set of documents of mixed encodings.

-- 
                                                 Jan 'Bulb' Hudec <bulb ucw cz>

Follow-Ups:
- Re: [Vala] Get file characterset
  - From: Mark Dewey

References:
- Re: [Vala] Printing Unicode Characters
  - From: Mark Dewey
- [Vala] Get file characterset
  - From: Mark Dewey

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]