Re: Unicode and C++

From: Nathan Myers <ncm nospam cantrip org>
To: gtk-i18n-list redhat com
Cc: hp redhat com
Subject: Re: Unicode and C++
Date: Mon, 10 Jul 2000 18:17:46 -0700

Owen Taylor <otaylor@redhat.com> wrote:
> Nathan Myers wrote:
> > There is no problem with wide character string literals in C or C++, 
> > and hasn't been for a very long time.  Simply prefix with a letter 'L'.
> > 
> >   wchar_t hello[] = L"hello, world";
> 
> Perhaps it wouldn't be if there was agreement on:
> 
>  - The encoding for source code files. (It certainly shouldn't
>    be dependent on the user's locale.)

All current gettext usage I know of assumes the "source encoding" --
the encoding of literal strings -- is ASCII.  I doubt this assumption
has caused any portability problems for GNOME or Pango thus far, or 
will.

>  - The width of wchar_t

Again, this is 32 bits on all interesting targets.  (Under Win32
it tends to be 16 bits, but that's the least of Win32's problems.)

>  - The encoding of wchar_t

The 7-bit subset is ASCII on all interesting build environments,
and I believe that this is assumed in all or most of the GNU 
software.  This assumption is compatible with Unicode as well
as all other common wchar_t encodings.

> The C standard makes no guarantees about any of these, and
> my copy of GCC (CVS snapshot from a few weeks ago, I think)
> certainly doesn't do what I would consider the right thing. 
> 
>  L"utf8-text"
> 
> Gives you a string where each 4-byte wide character contains
> one byte of the UTF-8 string. 

This is exactly what a reasonable person expects.  If the characters
are ASCII, that's also what UCS32 specifies, and exactly what you want.  

Non-ASCII string literals are almost always a mistake; they belong in 
a gettext archive.  

> Maybe I'm missing something,
> but relying on the L"" to do _anything_ predictable in 
> a portable program seems like a very poor idea.

You do appear to be missing this: depending on L"" notation doesn't 
expose you to any additional nonstandard semantics over regular "" 
literals.  You can use "\x" notation in either case to insert values 
that are not in the source encoding, where it seems appropriate.

Nathan Myers
ncm at cantrip dot org

Follow-Ups:
- Re: Unicode and C++
  - From: Steve Underwood

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]