Re: [Vala] how can I get the number of unicode points in a string?



Adam Dingle wrote:
[...] Fetching the n-th character in a string is less often
necessary, so it's OK for it to be less efficient.  In the rare case
where you really do need random access to characters by index, you
could always iterate over all characters in a string and store them
in a unichar[] array for that purpose, or you could construct a data
structure similar to the one you've outlined above.

Yes, converting the whole UTF8 string to a unichar[] is definitely a
better solution than using the offset array -- at least each UTF8
character would be decoded only once that way.  Where the user needs a
lot of random access, it may be worth the memory allocation and copying.

Where they are scanning from the start to the end, it is likely to be
more efficient to work from the UTF8 directly.

Basic operations on UTF8 strings which are quick (where N is length of
string):

- Get unichar at pointer, and advance pointer: O(1)

- Compare a fixed prefix-string at pointer (without decoding UTF8),
  and advance pointer if matches: O(1)

- Search for a fixed string within the string (without decoding UTF8):
  O(N)

You can do a lot with these basic operations.  It is quicker to do
matches in UTF8 than to decode characters.  So testing for a given
character at the pointer location is likely to be quicker with a
prefix-string match than a decode and compare as an integer.

Slow operations on UTF8 strings, to avoid:

- Fetch unichar at index measured in unichars: O(N) for each character
  fetch, O(N*N) in a loop

Jim

-- 
 Jim Peters                  (_)/=\~/_(_)                 jim uazu net
                          (_)  /=\  ~/_  (_)
 Uazú                  (_)    /=\    ~/_    (_)                http://
 in Peru            (_) ____ /=\ ____ ~/_ ____ (_)            uazu.net



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]