Re: Glib::ustring tradeoffs?

From: Chris Vine <chris cvine freeserve co uk>
To: gtkmm-list gnome org
Cc: Matthias Kaeppler <matthias finitestate org>
Subject: Re: Glib::ustring tradeoffs?
Date: Sat, 29 Oct 2005 00:59:55 +0100

On Friday 28 October 2005 13:00, Matthias Kaeppler wrote:

> Let's say I have a filename named "übung1.txt" (Note the umlaut--if your
> newsreader can display it hehe).
> Will this filename make trouble with std::string, or be lost/replaced
> when converting to Unicode?

UTF-8 represents Unicode characters by a series of bytes, of between 1 and 6 
bytes in length - true ASCII characters (of value less than 128) are also 
valid UTF-8 and represented by 1 byte, and all other characters are 
represented by more than one byte.  You can put any char value you want 
(including null characters and UTF-8 byte sequences) into a std::string 
object.  UTF-8 is just another series of bytes as far as a std::string object 
is concerned, as is any other byte-based encoding such as ISO8859-1.

A Glib::ustring object stores its UTF-8 contents as a series of bytes in the 
same way that a std::string object does (in fact, it contains a std::string 
object for that purpose).  The main difference between a std::string object 
and a Glib::ustring object is that the Glib::ustring object counts it size, 
iterates and indexes itself with operator[]() by reference to whole Unicode 
characters rather than bytes  - operator[]() will return an entire Unicode 
(gunichar) character for the index rather than a byte, as will dereferencing 
a Glib::ustring iterator.  It can also search by reference a Unicode 
(gunichar) character and a Unicode (gunichar) character can be inserted into 
it (for that purpose the character will be converted into the equivalent 
UTF-8 byte representation and then inserted in the underlying std::string 
object).

In many applications this extra functionality is irrelevant and using a 
std::string object for storing and manipulating UTF-8 byte sequences will be 
fine and have less overhead.  In addition, if you try to manipulate a 
Glib::ustring object after putting an invalid UTF-8 byte sequence into it the 
Glib::ustring object will be in an undefined state, so you need to know that 
what you are putting into it is valid.  (You can check this before 
manipulating it with Glib::ustring::validate().)

You can check whether a std::string object contains valid UTF-8 with 
g_utf8_validate(), and extract a Unicode character from the byte stream it 
contains with Glib::get_unichar_from_std_iterator(), so you can take your 
choice between using std::string or Glib::ustring depending on your needs.

Chris

Follow-Ups:
- Re: Glib::ustring tradeoffs?
  - From: Matthias Kaeppler

References:
- Glib::ustring tradeoffs?
  - From: Matthias Kaeppler
- Re: Glib::ustring tradeoffs?
  - From: Matthias Kaeppler

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]