Re: [Vala] how can I get the number of unicode points in a string?




于 2011/4/3 21:30, Adam Dingle 写道:

 On Sun, Apr 03, 2011 at 03:59:23PM +0800, 琉璃井 wrote:
 At 2011-04-03 16:06:32,"Luca Bruno"<lethalman88 gmail com>  wrote:

 On Sun, Apr 03, 2011 at 03:59:23PM +0800, 琉璃井 wrote:
 I see that since 0.11.0 vala string.length returns number of bytes
 rather than that of unicode characters, and string[i] returns only
 one byte. I wonder how to deal with east Asian character strings.
 There are other methods in string that deal with utf8. For example
 char_count() and next_char().

 thank you.
 I find char_count(), get_char() and next_char() in gtk+ document.
 Looks like these methods are not covered in vala tutorial and document.
 Is there something like string[i] for index access to utf8? I didn't
 get it in docs.

 To get the i-th character, you could do this:

 str.get_char(str.index_of_nth_char(i));

 But the current string methods are designed for iteration by offsets,
 not characters. So you should *not* do this, which will be inefficient:

 for (int i = 0 ; i<  str.char_count() ; ++i) // don't do this
 str.get_char(str.index_of_nth_char(i));

 Instead, you want to iterate over the string using get_char() and
 next_char(). This is slightly inconvenient since these functions use
 pointers rather than integer offsets. In Vala trunk, Jürg has just
 committed a new method string.get_next_char() which will make it
 easier to iterate over strings:

 // in class string
 public bool get_next_char (ref int index, out unichar c);

 That isn't in any Vala release yet, though. (In the meantime, you
 might be able to copy and paste his implementation from glib-2.0.vapi
 in Vala trunk.)

 adam
I know get_char and next_char are used for reducing iteration overhead,
but there may be other convenient way to access a utf8 string with
efficency. After all, getting a byte from a string using offset is not
so resonable because people seldom needs to get a byte in a whole
character.


The idea behind the API isn't to fetch the byte at an offset - instead,
you'll typically fetch the *character* at an offset, and then advance the
offset by the number of bytes in the character.  In other words, you could
iterate over the characters in a string using the new get_next_char() method
like this:

int i = 0;
unichar c;
while (get_next_char(ref i, out c))
  handle_character(c);

On each loop iteration, the offset (i) will increment by the size of the
character c.


Is it possible to design the string like this:
class string
{
private unichar* buffer;
private int* offset_array;
... ...
public unichar operator [](const int i)
{
int offset=offset_array[i];
return buffer[offset];
}
}
offset_array stores the offset of utf8 charater by index. It is
initialized in constructor or something.
Then we can use string[index] with no iteration overhead.


That would add lots of overhead (both in time and space) for every string,
and would have limited benefit.  Iterating over characters in a string is a
common operation, and is both easy and efficient with the current API.
 Fetching the n-th character in a string is less often necessary, so it's OK
for it to be less efficient.  In the rare case where you really do need
random access to characters by index, you could always iterate over all
characters in a string and store them in a unichar[] array for that purpose,
or you could construct a data structure similar to the one you've outlined
above.

adam


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]