Re: Terminology concerning strings


On Mon, Apr 04, 2005 at 11:35:44AM +0200, Roland Illig wrote:

> * the _size_ of a string (as well as for other objects) is the number of
>   bytes that is allocated for it. For arrays, it is the number of
>   entries of the array. For strings it is at least _length_ + 1.
> * the _length_ of a string is the number of characters in it, excluding
>   the terminating '\0'.
> * the _width_ and _height_ of a string are the size of a box on the
>   screen that would be needed to display the string.

It seems to me that this terminology is not yet multibyte-aware. Since UTF-8
becomes an everyday issue and AFAIR is planned for mainstream mc 4.7.0, IMHO
it is very important to create a clear terminology for this even if it's not
yet officially implemented now.


Byte and character are two completely different notions. A byte is clear
what it means. A character is a human-visible entity, e.g. an accented
letter. A character may be represented by one or more bytes. It should be
clarified whether composing symbols (e.g. to put an accent on the top of the
previous letter) is a character on its own or not. Pressing a letter on the
keyboard usually inserts one character, and a backspace/delete is supposed
to remove one character, not one byte.

Is the _length_ of a string the number of bytes in it or the number of
characters in it? If it is the number of bytes, then the second definition
(in the quoted part) should be corrected. If it is the number of characters,
then the last sentence of the first definition doesn't really have a meaning
since then the size and the length have really nothing to do with each other
and hence the size >= length + 1 constraint is misleading (even though it
isn't false supposing that every character takes at least one byte to

Actually, what does string mean? Is it an arbitrary sequence of bytes
terminating with the first zero byte in it that we sometimes try do display
somehow, or is it a technical representation of a human-readable text? These
two approaches might lead to a completely different programming philosophy.
I recommend the latter version since that one really thinks in the term
which is the most important for the user interface, that is, it thinks in
the meaning of the byte sequence rather than in the pure byte sequence on
its own. Another consequence is that according to the second possible
definition the byte sequence must always be valid according to one
well-defined character set (e.g. valid UTF-8) while the first version also
allows invalid byte sequences that still should be displayed somehow.

Furthermore, it should be emphasized that the width of a character is not
necessarily 1, so the number of bytes, number of characters and the width of
a string may be three completely different values.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]