Re: Unicode & C++



Nathan Myers wrote:

> Steve Underwood <steveu@coppice.org> wrote:
> > In ten years of developing stuff in Chinese I have never seen anyone use
> > wide characters for anything but Unicode.
>
> Evidently you have only developed for Microsoft targets.

Why on Earth would you think that? I have mostly worked with Unix server side,
and I have never seen anything done in wide character form. Of course, on the
client side I have developed for DOS+Eten and Windows. What else could I do?
There has been no effective Chinese platform anywhere else. The Mac sucked at
Chinese (I haven't looked recently to see if this has changed), and Unix has
been far worse. I hope that a year from now good work like Pango, and some
other key things in progress, will get the Unixen on par with Windows, and from
there it should be a smooth path to leave Windows in the dust.

> > From this I deduce that anyone finding shiny new effective wide
> > character support in the latest compilers and libraries would assume
> > it handles some form of Unicode. The Micro-drones would probably
> > assume UCS-2, and the rest of us UTF-8 on the byte stream side and
> > UCS-4 on the wide character side.
>
> Chinese and Japanese people, in my experience, are almost unanimously
> hostile to Unicode in any form.

People keep telling me the Japanese are hostile to Unicode. I have no idea. I
do know that most Chinese aren't hostile, but just treat it as irrelevant. They
think they are working with Big-5 or GB-2312, and life is OK that way. However,
since almost all of them are working with Windows, Unicode has been their daily
fare for several years. Odd, really. I used to be hostile to Unicode, but V3
has reformed my attitude - it actually offers enough to make a change from the
older codes worthwhile.

> > Is there some fatal flaw in my logic?
>
> Yes.  You are talking about library support.  What a compiler must
> interpret in source code is only tenuously related to that.

I thought you were talking about the C standard. Isn't that a bundle of
compiler and support library?

> > > > They might possibly hope for the odd #pragma that
> > > > allows the source encoding or target coding to be selected, too. What
> > > > they don't expect is what they actually get - "English readers only.
> > > > Tough luck to the rest.".
> > >
> > > Note that I said "a reasonable person".  A reasonable person doesn't
> > > expect the compiler to implement encodings it knows nothing about.
> > > The only characters that a compiler is obliged to recognize are
> > > spelled out in the Standard in the "basic source character set"
> > > (ISO 14882:1998, section 2.2, paragraph 1):
> > >   abcdefghijklmnopqrstuvwxyz
> > >   ABCDEFGHIJKLMNOPQRSTUVWXYZ
> > >   0123456789
> > >   _{}[]#()<>%:;.?*+-/^&|~!=,\"'
> > >
> >
> > I think you are still refering to a reasonable Anglo-phile.
>
> No, I'm speaking of a reasonable reader-of-the-ISO-C++-Standard.
>
> When languages are created that specify interpretation of arbitrary
> Unicode characters, it will be reasonable to feed such arbitrary
> Unicode characters to compilers for those languages.  C++ is not
> such a language, and neither is C.
>
> > > Paragraph 2 specifies a format (apparently not supported in gcc-2.95.2)
> > > to enter non-ASCII ISO 10646 character literals in hex, using a notation
> > > like '\u3ef7'.  Compilers are expected to convert these values into
> > > whatever "execution characters set encoding" they have assumed; of
> > > course that is easiest if the latter is ISO 10646.
> >
> > What clown came up with that?
>
> Shall I pass along your personal evaluation to the individuals involved?

OK, that was a slight over-reaction on my part.

> > I somehow can't imagine all those
> > non-native English speakers saying "What we really need is a character
> > representation in our source code that is completely unreadable, and
> > requires that we encode and decode strings character by character
> > from a code point chart. We've been able to easily type our
> > Chinese/Japanese/Russian/whatever into our source code strings for
> > years, and we are heartily sick of it. ****Please**** give us
> > something meaningless to a human reader".
>
> No, they reasonably said, "It is sometimes necessary to mention
> particular characters in the 'execution character set' regardless
> of whether they are permitted in the 'basic source character set',
> and we would like a portable way to do it."
>
> This is a step forward, because previously they would be obliged
> to embed a Big-5 or JIS character code in the source code; now,
> on compilers that support it, they can use a universal Unicode
> value and expect it to be converted appropriately.  As they move
> to Unicode execution environments they will not be obliged to
> change their source, and that's good.

They can insert codes, but they can't read them. Not exactly developer
friendly, or conducive to low error rates, is it? As you pointed out, the
standard method isn't portable - it doesn't actually work with major compliers.
On the other hand, I have thrown source code containing Big-5 comments and
literals at every C and C++ compiler I have used in the last 10 years, and
never had a compiler problem. A couple of times a support librray routine has
coughed a bit, but it has been a pretty low hassle experience. I have no doubt
the same would have been true if I had been working with UTF-8. That seems
pretty real-world portable to me.

What would have been darned useful in a C spec. updated so recently would have
been for it to embrace i18n, and end the fudging we have been doing since way
back when. UTF-8 is totally compatible with ASCII, and the use of UTF-8 in
comments and literals should be considered a proper thing. Even if you have a
deep loathing of anything between double quotes in C source, surely you
wouldn't object to making UTF-8 in comments compliant? It think this was a
missed opportunity, since the latest C update has occured just as Unicode is
really starting to move on a platform other than Windows.

> Of course some proprietary compilers encourage programmers to use
> non-portable extensions.  That's good for the compiler vendor, but
> bad for programmers who get locked in.

Every real world compiler extends the spec., for entirely practical reasons. A
lot of the best (or most useful) additions started out as proprietary
extensions. That's how their popularity and value has been assessed prior to
standardisation. Its unreasonable to consider this simply a lock in tactic,
although it often is. Do you know a compiler that doesn't support "#pragma
pack()"? Its a long time since I saw one. You may consider this a dirty trick,
but it can be a darned useful dirty trick, so everyone provides it.

[...]

> > I would fully agree that most apps which will find widespread use
> > are written without much regard for how tough grafted on i18n will
> > become later. However, I wouldn't agree that this means all literals
> > should be out of the source code.
>
> Ease of i18n has nothing to do with it.  As I have pointed out several
> times, Chinese (and Arabic, and Devanagari) characters simply are not
> in the list of characters that compilers are required by ISO Standard
> to recognize.  It happens that most of the ASCII characters are in the
> list, so those are the ones you can put in your source code.

Ease of i18n has everything to do with it. Do you think of standards as you
master or your servant? Our tools are supposed to help us, not hinder us.

> > Despite its drawbacks, gettext is
> > one of the best tools currently available for multi-lingualisation.
> > Its based on having the strings in one language in the source code.
> > In a large percentge of cases that original text is in English. Its
> > absurd to demand that be so. Is native standard English to be a
> > pre-requisite for software development?
>
> There is no requirement that gettext() key strings are in English.
> You can write any of
>
>   gettext("aKfjw82.xc38cjll23ku092lk3")
>   gettext("31337 h4x0rZ R00l")
>   gettext("file not found")
>
> for the same message.  Probably the last is best, for documentation
> purposes, if people reading the program source know English.  Otherwise,
> an ASCII transcription of a Chinese expression might be better.  Since
> only programmers see this string, it need only be unique, not pretty.

Hum. This seems a little inconsistent, since what we want to put in those
literals is something you think we shouldn't.

Many of us would like to see future C/C++ compilers accept any
non-special-symbol UTF-8 characters in identifiers. That would greatly enhance
the readability of source for most software authors. It doesn't seem like that
should be much hassle for the compiler, but I guess the psycological leap to
non-English variables might be a bit much for some folk. If by "ASCII
transcription" you mean translation then the text might then be meaningless to
the software's author. If you mean transliteration it usually ends up as
meaningless to everybody (Chinese Pin Yin is really not readable). A
non-English reader already has a tough enough life writing C/C++ when the only
identifier names thay can use are as meaningful to them as
"aKfjw82.xc38cjll23ku092lk3" is to me. Please don't take away their last
refugee for producing something they themselves can follow the day after the
wrote it.

Steve






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]