Re: Unicode & C++



Nathan Myers wrote:

> Steve Underwood <steveu@coppice.org> wrote:
> > Nathan Myers wrote various stuff, then got to the interesting bit:
> >
> > > > The C standard makes no guarantees about any of these, and
> > > > my copy of GCC (CVS snapshot from a few weeks ago, I think)
> > > > certainly doesn't do what I would consider the right thing.
> > > >
> > > >  L"utf8-text"
> > > >
> > > > Gives you a string where each 4-byte wide character contains
> > > > one byte of the UTF-8 string.
> > >
> > > This is exactly what a reasonable person expects.  If the characters
> > > are ASCII, that's also what UCS32 specifies, and exactly what you want.
> > >
> > > Non-ASCII string literals are almost always a mistake; they belong in
> > > a gettext archive.
> >
> > A reasonable mono-lingual English reader may expect this.
>
> I suspect you have entirely missed the point.
>
> > Most of the population of planet earth expect something like a UTF-8
> > string in the totally compatible ASCII source code ends up as a UCS32
> > string in memory.
>
> Most of the population of the planet earth uses neither UTF-8 nor UCS32.
> It would be very odd if they were to expect anything to be converted
> to UCS32.

This is true, since most of the population of planet Earth still does not use
a computer. However sticking to issues relevant to those who do:

Word and Excel have stored their data in Unicode for years, so most of the
world's modern documents are in Unicode. Its UCS-2, and not UCS-4 or UTF-8,
but that's a minor issue.

Both Windows 98 and NT have used a hotch podge of Unicode with other codes for
years.

In ten years of developing stuff in Chinese I have never seen anyone use wide
characters for anything but Unicode. There isn't much motivation to, and the
compiler and library support has been both patchy and quirky.

>From this I deduce that anyone finding shiny new effective wide character
support in the latest compilers and libraries would assume it handles some
form of Unicode. The Micro-drones would probably assume UCS-2, and the rest of
us UTF-8 on the byte stream side and UCS-4 on the wide character side.

Is there some fatal flaw in my logic?

> > They might possibly hope for the odd #pragma that
> > allows the source encoding or target coding to be selected, too. What
> > they don't expect is what they actually get - "English readers only.
> > Tough luck to the rest.".
>
> Note that I said "a reasonable person".  A reasonable person doesn't
> expect the compiler to implement encodings it knows nothing about.
> The only characters that a compiler is obliged to recognize are
> spelled out in the Standard in the "basic source character set"
> (ISO 14882:1998, section 2.2, paragraph 1):

I think you are still refering to a reasonable Anglo-phile. Most of the rest
of us (OK, I happen to actually be English, but that's irrelevant at the
moment) might look for support for a few selectable representations of large
character sets - like UTF-8, UTF-7 and high bit triggered MBCS maybe. I have
seen people disappointed to find no code translation. I wasn't, as I never
expected it. I agree with you that this would represent a rather odd form of
bloat.

>   abcdefghijklmnopqrstuvwxyz
>   ABCDEFGHIJKLMNOPQRSTUVWXYZ
>   0123456789
>   _{}[]#()<>%:;.?*+-/^&|~!=,\"'
>
> Note this is an ISO, or "International Standards Organization", standard.
> It was worked on, and voted in, by representatives of countries from all
> over the world.  Native English speakers cast a distinct minority of votes
> (but did, I will note, a disproportionate amount of the hard work).

> Paragraph 2 specifies a format (apparently not supported in gcc-2.95.2)
> to enter non-ASCII ISO 10646 character literals in hex, using a notation
> like '\u3ef7'.  Compilers are expected to convert these values into
> whatever "execution characters set encoding" they have assumed; of
> course that is easiest if the latter is ISO 10646.

What clown came up with that? I somehow can't imagine all those non-native
English speakers saying "What we really need is a character representation in
our source code that is completely unreadable, and requires that we encode and
decode strings character by character from a code point chart. We've been able
to easily type our Chinese/Japanese/Russian/whatever into our source code
strings for years, and we are heartily sick of it. ****Please**** give us
something meaningless to a human reader".

> > In my experience people play for hours

> > trying to figure out how to get their own language strings into source
> > code, before they finally figure out they can't.
>
> If they would do a little research, they would find out, further, that
> they shouldn't.  String literals are not a reasonable place for normal
> interaction text, because that makes a program unportable.

Are all your programs 100% free of all literal strings? Are 100% of all your
programs certain to find widespread use, and be translated into a multitude of
languages? Of course not. Most apps are quite specialised, and will never be
used beyond the organisation which developed them. It makes no sense to keep
all the literals out of the source code in such cases. It justs creates admin
work getting all the strings and references to tie up.

I would fully agree that most apps which will find widespread use are written
without much regard for how tough grafted on i18n will become later. However,
I wouldn't agree that this means all literals should be out of the source
code. Despite its drawbacks, gettext is one of the best tools currently
available for multi-lingualisation. Its based on having the strings in one
language in the source code. In a large percentge of cases that original text
is in English. Its absurd to demand that be so. Is native standard English to
be a pre-requisite for software development?

> > GCC is nearly 8-bit
> > clean (at least I have never found problems). You can put UTF-8, Big-5,
> > GB2312, and other things directly into normal byte oriented strings,
> > but only simple ASCII into wide character strings.
>
> That it is not terribly easy (although possible!) to make non-ASCII
> wide-character literal strings leads the sensitive and aware programmer
> to sounder programming practices.

I thought tools that are hard to use were usually just dumped.

> > Am I the only one who thinks that's dumb. For many people its a
> > good reason to avoid UCS32, and stick with a byte stream encoding.
> > The documentation needs a large:
> >
> > A N G L O - P H I L E S      O N L Y
>
> You are certainly far from the only person who thinks that sound
> programming practices are "dumb".  The fix for all cases of ignorance
> is learning.  The simple fact is that characters in a source file that
> are not in the "basic source character set" render your code entirely
> unportable.  There are other, better, ways to handle literal text.

I don't think sound programming practices are dumb. I think clumsy programming
practices are dumb. They take more time and cause more errors. Think carefully
about paying in time or cash for flexibility - 9 times out of 10 it is never
needed.

If it is bad for me to have my Chinese text in my source code, why is it good
for you to have your English text in there? Are you suggesting that people
shouldn't even put comments in their source code in a language they actually
understand?

> The orientation of C and C++ to American encodings is not a matter of
> English chauvinism, but an accident of history: C was invented here.
> C++, you should notice, was invented by a Dane, but was obliged to adopt
> C conventions by inheritance (as it were).

I believe he was working in the US for a US company when he did most of that
work.

Steve






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]