Re: Glib::ustring and C++11 utf-8 literals



[Oops, sorry Chris, forgot to CC to the list]

Thank you for the explanation, made many things clearer for me! So when
developing a desktop application, the best rule would be to use gettext() almost everywhere, and on the rare occasions when one can't use that, use u8 with unicode code points. I will have a closer look at gettext as I've
never used it before.

Messing with the encoding of the source code seems to be problematic and
I'de rather avoid it.

Thanks,
Dennis

On 2012-02-16 13:05, Chris Vine wrote:

On Thu, 16 Feb 2012 11:04:19 +0100
Dénes Almási <denes rudanium org> wrote:

Hi! As one can see on wikipedia, C++11 offers the ability to create
utf-8 string literals.
(http://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals [1] [1]) Is
it possible to pass these safely to Glib::ustring when constructing
them? It is suspicious that Glib::ustring::ustring(const char * _src_,
size_type _n_ ) constructor will do the job. Is this right?

All string literals are null terminated after conversion to the
execution character set, so you pass it to the constructor taking a
const char*. The requirement of Glib::ustring is that this execution
character set must be UTF-8.

I have not done any playing about with C++11 string literals, but as I
understand it you should be OK with a string literal with the 'u8'
prefix assuming the compiler is able to perform the conversion from the source character set, what your code editor spits out, to this execution
character set (see §2.2/5: "Each source character set member in a
character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string
literal, is converted to the corresponding member of the execution
character set"; and §2.4.15/7: "A string literal that begins with u8,
such as u8"asdf", is a UTF-8 string literal and is initialized with the
given characters as encoded in UTF-8").

The key to this is my "assuming the compiler ..." above: the problem is that you have to let the compiler know what your source character set is in order for it to perform this conversion, and gcc will in the absence
of the appropriate switch assume your source file is in your locale
encoding, which makes source files non-portable with non-ASCII string
literals unless you are importing whole unicode code points into your
u8 string (for which purpose the u8 prefix is genuinely useful).

For more on gcc, see
http://gcc.gnu.org/onlinedocs/cpp/Character-sets.html and also the gcc
switch documentation:

"-finput-charset=charset: Set the input character set, used for
translation from the character set of the input file to the source
character set used by GCC. If the locale does not specify, or GCC
cannot get this information from the locale, the default is UTF-8.
This can be overridden by either the locale or this command line
option. Currently the command line option takes precedence if there's
a conflict. charset can be any encoding supported by the system's
iconv library routine."

If you use windows, I believe VS uses Windows ANSI as its default
source encoding, but you would need to look it up if that is your
platform.

This makes it almost always better to pass in your string literals
programmatically, say via gettext().

Chris



Links:
------
[1] http://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]