Re: Terminology concerning strings



On Wed, Apr 06, 2005 at 12:36:26PM +0200, Leonard den Ottolander wrote:

> > > * the _size_ of a string (as well as for other objects) is the number of
> > >   bytes that is allocated for it. For arrays, it is the number of
> > >   entries of the array. For strings it is at least _length_ + 1.
> > > 
> > > * the _length_ of a string is the number of characters in it, excluding
> > >   the terminating '\0'.
> 
> > It seems to me that this terminology is not yet multibyte-aware. Since UTF-8
> > becomes an everyday issue and AFAIR is planned for mainstream mc 4.7.0, IMHO
> > it is very important to create a clear terminology for this even if it's not
> > yet officially implemented now.
> 
> It seems you haven't read Roland's post very well. He clearly
> differentiates between size (raw number of bytes) and length (number of
> characters represented on the screen). From discussions with him I know
> he writes this post explicitly with multibyte charsets in mind. "ecs" in
> ecssup.{c,h} stands for "extended charset".
> 
> Or am I missing your point?

No, it seems that I missed Roland's point.

Roland says that size >= length + 1. Just to clarify things: I guess there
are two completely different reasons why size can be greater than (and not
equal to) length + 1.

a) One can allocate a larger buffer than strlen+1. For example,
x=malloc(10); strcpy(x, "asdf"); in this example length is 4, size is 10.
Or is size==5 in this case?

b) Each multibyte character (e.g. any accented letters in UTF-8) counts as 1
for length, but at least two for size.


Am I right?



-- 
Egmont



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]