Charset guessing & translation for the viewer

From: "bulia byak" <bulia dr com>
To: mc gnome org
Subject: Charset guessing & translation for the viewer
Date: Wed, 23 Oct 2002 16:42:09 -0500
It is convenient to be able to view files of arbitrary charsets in
your native display charset, by having the source charset guessed
and files reencoded correspondingly on the fly. Here's how I
implemented this for Russian cyrillic encodings. Perhaps this might
be useful for someone.

First, install enca: http://trific.ath.cx/software/enca/. This is a
wonderful piece of software that can correctly guess charsets of
text files in many languages and convert files to a specified
encoding (or the encoding of your locale).

Then, edit your ~/.mc/bindings file. I assume that your display
charset is koi8-r. The most obvious solution is to add enca to the
default rule:

# Default target for anything not described above
default/*
    View=%view{ascii} enca -c -x koi8-r < %f

This works, but the problem is that it attempts to reencode _all_
files (except those which were processed by other rules in the
bindings file) and thus always pipes the result to the viewer. This
may be slow for big files or files on a network. The default view
behavior is much faster because it works with a file on disk, not a
pipe, and therefore can e.g. quickly skip to the end of the file
when requested. Also, piped view cannot jump to a line when e.g. you
view a file from a "Find file" results panel.

The solution is to only reencode files that are known to use
cyrillic charsets other than KOI8-R. It looks like this:

# recode everything to koi
type/^IBM866
    View=%view{ascii} enca -c -x koi8-r < %f

type/^CP1251
    View=%view{ascii} enca -c -x koi8-r < %f

# And leave the default rule alone:

# Default target for anything not described above
default/*
        Open=
        View=
        Drop=
        Title=%p

However, this won't work as is because the type of a file is
determined by calling the "file" command, which does not have the
enca's capabilities to guess charsets. Therefore we need to
substitute the standard file command with our own version which
calls the standard "file" for all files except text files, for 
which enca is used. Create the following shell script:

!/bin/sh
STANDARDFILE=`/usr/bin/file $1 $2`
if echo $STANDARDFILE | grep text > /dev/null;
then
   echo $2: `enca -r $2 2> /dev/null` text;
else
   echo $STANDARDFILE;
fi

call it e.g. "myfile", make it executable, and put into a PATH
directory. This script assumes that you have 

#define FILE_L 1

in the config.h of your mc source, so the file command is
called from mc with the first argument of "-L" and the second 
argument being the file name. 

Now, open src/ext.c from mc source, find and replace "file -L" with
"myfile -L" and recompile/reinstall mc. 

That's all. Now, only actually reencoded files are piped; all other
files are opened by the viewer directly. Also, now you can make
other uses of the the enca capabilities through the standard syntax
of the bindings file. 

Of course it would be easier to do the above if the file command
name was stored as an editable option in the ~/.mc/ini, not
hardwired into the source. In this case no recompilation of mc would
be necessary.

Also, what is yet missing is the ability to use the built-in editor 
to edit files in arbitrary charsets, so they are reencoded to your 
display charset on opening and reencoded back to the original charset 
on saving. Perhaps someone will take time to implement this.

-- 
__________________________________________________________
Sign-up for your own FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]