byte vs. character approach [was: Terminology concerning strings]

From: Koblinger Egmont <egmont uhulinux hu>
To: MC Devel <mc-devel gnome org>
Subject: byte vs. character approach [was: Terminology concerning strings]
Date: Thu, 7 Apr 2005 01:07:22 +0200

Hi all,

> According to
> http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html
> wchar_t on GNU systems is 4 bytes by default. Internal representation of
> multibyte strings always uses fixed widths or something like x[3] wouldn't
> work (without scanning the string). So in case x in the above example is a
> wchar_t you overflow the buffer nicely ;) .

As I see, this is now a completely different approach to the whole situation
than what the current UTF-8 hack patchset uses.

The current UTF-8 patchset still _thinks_ in _bytes_, but tries to correctly
display them using UTF-8 or whatever the current locale is.

Using wchar_t all over the source gives me a feeling that this approach
wants mc to _think_ in _characters_.

I'm not sure at all that this is the right way to go for a file manager and
text editor.

Unix philosophy says filenames are sequences of bytes (as opposed to Windows
which says filenames are sequences of characters). Whenever you use a
multibyte locale, you might face filenames that are not valid according to
this locale. But these are still valid filenames on the system, just cannot
be displayed with your current locale, but maybe they're okay with another
locale. For a file manager I expect that it can handle these kind of files
without a problem. Hence the filenames should be handled as byte sequences
and then mc should try to do the best to display this filename as good as
possible, but even if it cannot display it correctly and needs to use some
question marks, it should perfectly be able to remove, rename, edit this
file, invoke an external command on it etc. Typing a command and using
Esc+Enter to put this filename into the command line should also work. So
its name should be converted from the original byte stream to anything else
sequence only for displaying purposes, but stored as the original byte
stream inside mc's memory segment.

Similar things happen e.g. with file editing. Suppose I receive a large
English text file and I find a typo and want to fix that. I do it in mcedit
and then save the file. I didn't even realize that this file also contained
some French words encoded in Latin-1, while my whole system is set to UTF-8.
mcedit must save the file leaving the original Latin-1 accents the same, no
matter if it's not a valid UTF-8. It's definitely a bug if these characters
disappeared from the file or if in any other way mc couldn't handle them.

Actually, will mcedit be able to edit UTF-8 encoded files inside a Latin-1
terminal? Or edit Latin-1 files inside an UTF-8 terminal? Will mc be able to
assume UTF-8 filenames while the terminal is Latin-1? ...


I recommend everyone to take a look at the 'joe' text editor, version 3.1 or
3.2 to see how it handles charsets. I don't mean to look at the
implementation, only the user-visible behavior of the software. IMHO this is
the way things have to work.

'joe' thinks the file being edited is always a byte stream. It knows the
behavior of the terminal from the locale settings, this is not overrideable
in joe, which is a perfect decision (as opposed to vim) since this is
exactly what the locale environment variables are for. The default encoding
assumed for a file is the current locale, however, you can easily change it
any time pressing ^T E. Changing this assumed character set does not change
anything in the file, it just changes the way the file is displayed on the
screen, what bytes a keypress will insert, how many bytes a backspace or
delete or overtyping will remove etc. Obviously, byte sequences that are
invalid in the selected charset are displayed by some special symbol, maybe
using special color. This whole approach guarantees that joe can edit files
of arbitrary encodings over arbitrary terminals, and in the same time, it is
still binary safe and keeps the byte sequence unchanged even if that is not
valid according to the assumed character set.

As opposed to joe, take a look at Gnome and KDE, especially KDE, their
bugzilla etc. to see how many bug reports they have about accented
filenames. The complete KDE system thinks of filenames as sequence of human
readable characters and hence it usually fails to handle out-of-locale
filenames.

Just think how many complaints and bug reports you would receive that
someone uses a modern Linux system with its default UTF-8 locale,
recursively downloads some stuff from an ftp server and then blames on
mc-4.7.0 that it cannot cope with these filenames (whoops, they're in
Latin-1), cannot access, delete, rename etc. them. These users would have to
quit to the shell to properly rename them which means that mc fails to
perform one of its most basic jobs. I hope this won't happen.


So while the approach of "thinking in characters" is the better for most of
the desktop applications, I'm pretty sure that for file managers as mc, text
editors as mcedit "thinking in bytes" is the right way to go and convert the
byte stream solely for displaying purposes.



-- 
Egmont

Follow-Ups:
- Re: byte vs. character approach [was: Terminology concerning strings]
  - From: The One

References:
- Terminology concerning strings
  - From: Roland Illig
- Re: Terminology concerning strings
  - From: Koblinger Egmont
- Re: Terminology concerning strings
  - From: Leonard den Ottolander
- Re: Terminology concerning strings
  - From: Koblinger Egmont
- Re: Terminology concerning strings
  - From: Leonard den Ottolander

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]