Re: Glib::ustring and C++11 utf-8 literals



On Thu, 16 Feb 2012 11:04:19 +0100
Dénes Almási <denes rudanium org> wrote:
> Hi! 
> 
> As one can see on wikipedia, C++11 offers the ability to
> create utf-8 string literals.
> (http://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals [1])
> Is it
> possible to pass these safely to Glib::ustring when constructing them?
> 
> 
> It is suspicious that 
> Glib::ustring::ustring(const char * _src_,
> size_type _n_ )
> constructor will do the job. Is this right?

All string literals are null terminated after conversion to the
execution character set, so you pass it to the constructor taking a
const char*.  The requirement of Glib::ustring is that this execution
character set must be UTF-8.

I have not done any playing about with C++11 string literals, but as I
understand it you should be OK with a string literal with the 'u8'
prefix assuming the compiler is able to perform the conversion from the
source character set, what your code editor spits out, to this execution
character set (see §2.2/5: "Each source character set member in a
character literal or a string literal, as well as each escape sequence
and universal-character-name in a character literal or a non-raw string
literal, is converted to the corresponding member of the execution
character set"; and §2.4.15/7: "A string literal that begins with u8,
such as u8"asdf", is a UTF-8 string literal and is initialized with the
given characters as encoded in UTF-8").

The key to this is my "assuming the compiler ..." above: the problem is
that you have to let the compiler know what your source character set is
in order for it to perform this conversion, and gcc will in the absence
of the appropriate switch assume your source file is in your locale
encoding, which makes source files non-portable with non-ASCII string
literals unless you are importing whole unicode code points into your
u8 string (for which purpose the u8 prefix is genuinely useful).

For more on gcc, see
http://gcc.gnu.org/onlinedocs/cpp/Character-sets.html and also the gcc
switch documentation:

  "-finput-charset=charset:  Set the input character set, used for
  translation from the character set of the input file to the source
  character set used by GCC. If the locale does not specify, or GCC
  cannot get this information from the locale, the default is UTF-8.
  This can be overridden by either the locale or this command line
  option. Currently the command line option takes precedence if there's
  a conflict. charset can be any encoding supported by the system's
  iconv library routine."

If you use windows, I believe VS uses Windows ANSI as its default
source encoding, but you would need to look it up if that is your
platform.

This makes it almost always better to pass in your string literals
programmatically, say via gettext().

Chris


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]