Re: mc and utf8

From: Egmont Koblinger <egmont uhulinux hu>
To: Rostislav Beneš <xbenes5 fi muni cz>
Cc: mc-devel gnome org
Subject: Re: mc and utf8
Date: Wed, 11 Apr 2007 18:22:38 +0200
Hi,

I'm very glad to hear that there's going to be some work on it!


> My work leader is author of the original utf-8 patch for mc.

Oh my God... Who is he? (Just for curiosity.) There were lots of names:
Jakub Jelinek, Vladimir Nadvornik, Jindrich Novy - AFAIK they're all
involved in the current UTF-8 patches.


> I have read some old posts about this theme and source codes of mc, too.

I recommend reading the following two threads. Especially because I also
wrote my detailed opinion in both threads and I don't want to re-type them
:-))) so imagine they're #included here.

"Proposal for simplification" (2005 Sep-Oct) is (amongst others) about
possibly dropping support for one of ncurses and slang. If, after some
investigations, you think that dropping these would make your work much
easier than probably this is the way to go.

"Request for discussion - how to make MC unicode capable" (2007 Feb-Mar)
contains lots of useful ideas.

Please also see my UTF-8 related patches at
https://svn.uhulinux.hu/packages/dev/mc/patches/


The most important goal I think is to get work accepted in mainstream mc.
This means we need clean code that is well-designed, modularized, easy to
understand, easy to verify, easy to modify/improve/fix. And of course that
works correctly :)


I think the first step should be to decide which scripts to support (or plan
future support for). This should probably include testing commonly used
terminal emulators what they do at certain circumstances. Here's what I
mean:

- Handling single-width characters is trivial.

- Handling double-width (CJK) shouldn't be hard, but there are some tricky
  questions arising. E.g. what to do if only the left half of a double-width
  character is visible in the rightmost column during editing a file? What
  if wordwrap mode is on and it should continue in the next row? I don't
  think terminal emulators support wrapping CJK characters (and it would
  make no sense actually) so probably some special symbol (e.g. "�") should
  be displayed at the end of the line if word wrapping is off, and if word
  wrapping is enabled then probably the whole new character should be
  wrapped to the next line. What if some CJK text (maybe a filename) needs
  to be printed in a smaller box, probably with a ~ in the middle, probably
  by cutting at its end...? I guess you'll need several helper functions
  similar to (but more complex than) my "00-70-utf8-common.patch".

- What to do with zero-width characters, including combining characters? 
  Very few terminal emulators support them correctly (e.g. plain old xterm).
  Should we address supporting them on these terminals? (Or at least design
  mc now so that it can be added easily later, without a complete rewrite?)

- What to do with BIDI issues (Right-To-Left writing)? I don't know if there
  are terminal emulators out there at all that support RTL. But maybe mc
  could reverse these strings on its own and send them out without sending
  LTR or RTL marks so that eventually the user sees them correctly. Needless
  to say, this would make editing a line or a file much trickier. Maybe you
  should study emacs/vim whether they support BIDI...

- How much support does ncurses or slang give to make these complicated
  things easier?

The current version of mc with utf8 patches works well with single-width
characters, but behaves quite bad with CJK. According to my experiences so
far, most of the terminal emulators and applications handle double-width
correctly, but other issues (zero-width, combining, bidi) still suffer from
plenty of bugs. So for me it would seem to be a wise decision to address
single and double wide characters, but not yet support other tricks. (Of
course by "not supporting" them I mean that mc still does something
reasonable in these cases, e.g. prints the Unicode value within <> signs or
similar. It's not affordable if the screen gets completely damaged or
something out of mc's control happens.)


Some more random pieces of ideas you might found useful:

There's a stuff called "gnulib". I have absolutely no info on it, except
that once I sent a bugreport to the findutils folks that case insensitive
UTF-8 matching didn't work, and later they reported they were able to fix it
due to an upgraded gnulib. MC with utf8 patches also suffer from such
problem, case insensitive search in the viewer only works for non-accented
letters. Probably gnulib provides a nice function that could solve it.

In order to be able to view or edit half-text half-binary files and fully
work on them, you'll need string searching and regexp matching functions
that perfectly tolerate invalid byte sequences, but still find matches
within the valid parts. Maybe you should take a look whether there's an
already existing solution that you can use. (Maybe glibc's regex stuff,
maybe pcre... I don't know whether these work correctly on mixed text/binary
strings.)

Currently mc with utf8 patches has a nasty bug that if a filename is invalid
utf8 and you copy it with F5, the newly created filename will have literal
question marks. My guess is that the shell pattern matching (a "*" by
default for the "source mask") might work incorrectly if invalid UTF-8 is
seen. One possible way to solve it is to use the encoding called UTF-8b, see
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html , option "D".
However, as some conversion is needed, this encoding is only suitable for
relatively short strings such as filanemes, not for file contents. And if
you have functions I've outlined in the previous paragraphs then they may
handle this case correctly.



Good luck!


-- 
Egmont
Follow-Ups:
- Re: mc and utf8
  - From: Jindrich Novy
References:
- mc and utf8
  - From: =?utf-8?B?Um9zdGlzbGF2IEJlbmXFoQ==?=
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]