Re: Unicode & C++

From: Nathan Myers <ncm nospam cantrip org>
To: gtk-i18n-list gnome org
Subject: Re: Unicode & C++
Date: Tue, 11 Jul 2000 14:52:02 -0700
Steve Underwood <steveu@coppice.org> wrote:
> Nathan Myers wrote various stuff, then got to the interesting bit:
> 
> > > The C standard makes no guarantees about any of these, and
> > > my copy of GCC (CVS snapshot from a few weeks ago, I think)
> > > certainly doesn't do what I would consider the right thing.
> > >
> > >  L"utf8-text"
> > >
> > > Gives you a string where each 4-byte wide character contains
> > > one byte of the UTF-8 string.
> >
> > This is exactly what a reasonable person expects.  If the characters
> > are ASCII, that's also what UCS32 specifies, and exactly what you want.
> >
> > Non-ASCII string literals are almost always a mistake; they belong in
> > a gettext archive.
> 
> A reasonable mono-lingual English reader may expect this. 

I suspect you have entirely missed the point.

> Most of the population of planet earth expect something like a UTF-8 
> string in the totally compatible ASCII source code ends up as a UCS32 
> string in memory. 

Most of the population of the planet earth uses neither UTF-8 nor UCS32.
It would be very odd if they were to expect anything to be converted 
to UCS32.

> They might possibly hope for the odd #pragma that 
> allows the source encoding or target coding to be selected, too. What 
> they don't expect is what they actually get - "English readers only. 
> Tough luck to the rest.". 

Note that I said "a reasonable person".  A reasonable person doesn't
expect the compiler to implement encodings it knows nothing about.
The only characters that a compiler is obliged to recognize are 
spelled out in the Standard in the "basic source character set"
(ISO 14882:1998, section 2.2, paragraph 1):

  abcdefghijklmnopqrstuvwxyz
  ABCDEFGHIJKLMNOPQRSTUVWXYZ
  0123456789
  _{}[]#()<>%:;.?*+-/^&|~!=,\"'

Note this is an ISO, or "International Standards Organization", standard.
It was worked on, and voted in, by representatives of countries from all
over the world.  Native English speakers cast a distinct minority of votes
(but did, I will note, a disproportionate amount of the hard work).

Paragraph 2 specifies a format (apparently not supported in gcc-2.95.2)
to enter non-ASCII ISO 10646 character literals in hex, using a notation 
like '\u3ef7'.  Compilers are expected to convert these values into 
whatever "execution characters set encoding" they have assumed; of
course that is easiest if the latter is ISO 10646.

> In my experience people play for hours 
> trying to figure out how to get their own language strings into source 
> code, before they finally figure out they can't. 

If they would do a little research, they would find out, further, that
they shouldn't.  String literals are not a reasonable place for normal
interaction text, because that makes a program unportable.

> GCC is nearly 8-bit
> clean (at least I have never found problems). You can put UTF-8, Big-5,
> GB2312, and other things directly into normal byte oriented strings, 
> but only simple ASCII into wide character strings. 

That it is not terribly easy (although possible!) to make non-ASCII 
wide-character literal strings leads the sensitive and aware programmer 
to sounder programming practices.

> Am I the only one who thinks that's dumb. For many people its a 
> good reason to avoid UCS32, and stick with a byte stream encoding. 
> The documentation needs a large:
> 
> A N G L O - P H I L E S      O N L Y

You are certainly far from the only person who thinks that sound 
programming practices are "dumb".  The fix for all cases of ignorance
is learning.  The simple fact is that characters in a source file that 
are not in the "basic source character set" render your code entirely 
unportable.  There are other, better, ways to handle literal text.

The orientation of C and C++ to American encodings is not a matter of 
English chauvinism, but an accident of history: C was invented here.  
C++, you should notice, was invented by a Dane, but was obliged to adopt
C conventions by inheritance (as it were).  

Nathan Myers
ncm at cantrip dot org
Follow-Ups:
- Re: Unicode & C++
  - From: Steve Underwood
- Re: Unicode & C++
  - From: Petr Tomasek
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]