RE: Glib::ustring tradeoffs?

From: "Foster, Gareth" <gareth foster siemens com>
To: Chris Vine <chris cvine freeserve co uk>, gtkmm-list gnome org
Cc: Matthias Kaeppler <matthias finitestate org>
Subject: RE: Glib::ustring tradeoffs?
Date: Mon, 31 Oct 2005 09:17:52 +0000

> UTF-8 represents Unicode characters by a series of bytes, of 
> between 1 and 6 
> bytes in length - true ASCII characters (of value less than 
> 128) are also 
> valid UTF-8 and represented by 1 byte, and all other characters are 
> represented by more than one byte.  You can put any char 
> value you want 
> (including null characters and UTF-8 byte sequences) into a 
> std::string 
> object.  UTF-8 is just another series of bytes as far as a 
> std::string object 
> is concerned, as is any other byte-based encoding such as ISO8859-1.
> 
> A Glib::ustring object stores its UTF-8 contents as a series 
> of bytes in the 
> same way that a std::string object does (in fact, it contains 
> a std::string 
> object for that purpose).  The main difference between a 
> std::string object 
> and a Glib::ustring object is that the Glib::ustring object 
> counts it size, 
> iterates and indexes itself with operator[]() by reference to 
> whole Unicode 
> characters rather than bytes  - operator[]() will return an 
> entire Unicode 
> (gunichar) character for the index rather than a byte, as 
> will dereferencing 
> a Glib::ustring iterator.  It can also search by reference a Unicode 
> (gunichar) character and a Unicode (gunichar) character can 
> be inserted into 
> it (for that purpose the character will be converted into the 
> equivalent 
> UTF-8 byte representation and then inserted in the underlying 
> std::string 
> object).
> 
> In many applications this extra functionality is irrelevant 
> and using a 
> std::string object for storing and manipulating UTF-8 byte 
> sequences will be 
> fine and have less overhead.  In addition, if you try to manipulate a 
> Glib::ustring object after putting an invalid UTF-8 byte 
> sequence into it the 
> Glib::ustring object will be in an undefined state, so you 
> need to know that 
> what you are putting into it is valid.  (You can check this before 
> manipulating it with Glib::ustring::validate().)
> 
> You can check whether a std::string object contains valid UTF-8 with 
> g_utf8_validate(), and extract a Unicode character from the 
> byte stream it 
> contains with Glib::get_unichar_from_std_iterator(), so you 
> can take your 
> choice between using std::string or Glib::ustring depending 
> on your needs.
> 

That was very informative Chris, thanks. In fact, it would make a nice
introduction to glib:ustring in the gtkmm book me thinks (assuming there
isn't a better one already).

Gaz

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]