Re: Unicode & C++



Steve Underwood <steveu@coppice.org> wrote:
> Nathan Myers wrote:
> > Steve Underwood <steveu@coppice.org> wrote:
> > > Most of the population of planet earth expect something like a UTF-8
> > > string in the totally compatible ASCII source code ends up as a UCS32
> > > string in memory.
> >
> > Most of the population of the planet earth uses neither UTF-8 nor UCS32.
> > It would be very odd if they were to expect anything to be converted
> > to UCS4.
> 
> In ten years of developing stuff in Chinese I have never seen anyone use 
> wide characters for anything but Unicode. 

Evidently you have only developed for Microsoft targets.

> From this I deduce that anyone finding shiny new effective wide 
> character support in the latest compilers and libraries would assume 
> it handles some form of Unicode. The Micro-drones would probably 
> assume UCS-2, and the rest of us UTF-8 on the byte stream side and 
> UCS-4 on the wide character side.

Chinese and Japanese people, in my experience, are almost unanimously
hostile to Unicode in any form.

 
> Is there some fatal flaw in my logic?

Yes.  You are talking about library support.  What a compiler must
interpret in source code is only tenuously related to that.

 
> > > They might possibly hope for the odd #pragma that
> > > allows the source encoding or target coding to be selected, too. What
> > > they don't expect is what they actually get - "English readers only.
> > > Tough luck to the rest.".
> >
> > Note that I said "a reasonable person".  A reasonable person doesn't
> > expect the compiler to implement encodings it knows nothing about.
> > The only characters that a compiler is obliged to recognize are
> > spelled out in the Standard in the "basic source character set"
> > (ISO 14882:1998, section 2.2, paragraph 1):
> >   abcdefghijklmnopqrstuvwxyz
> >   ABCDEFGHIJKLMNOPQRSTUVWXYZ
> >   0123456789
> >   _{}[]#()<>%:;.?*+-/^&|~!=,\"'
> >
> 
> I think you are still refering to a reasonable Anglo-phile. 

No, I'm speaking of a reasonable reader-of-the-ISO-C++-Standard.  

When languages are created that specify interpretation of arbitrary 
Unicode characters, it will be reasonable to feed such arbitrary
Unicode characters to compilers for those languages.  C++ is not
such a language, and neither is C.

 
> > Paragraph 2 specifies a format (apparently not supported in gcc-2.95.2)
> > to enter non-ASCII ISO 10646 character literals in hex, using a notation
> > like '\u3ef7'.  Compilers are expected to convert these values into
> > whatever "execution characters set encoding" they have assumed; of
> > course that is easiest if the latter is ISO 10646.
> 
> What clown came up with that? 

Shall I pass along your personal evaluation to the individuals involved?


> I somehow can't imagine all those 
> non-native English speakers saying "What we really need is a character 
> representation in our source code that is completely unreadable, and 
> requires that we encode and decode strings character by character 
> from a code point chart. We've been able to easily type our 
> Chinese/Japanese/Russian/whatever into our source code strings for 
> years, and we are heartily sick of it. ****Please**** give us
> something meaningless to a human reader".

No, they reasonably said, "It is sometimes necessary to mention 
particular characters in the 'execution character set' regardless 
of whether they are permitted in the 'basic source character set', 
and we would like a portable way to do it."  

This is a step forward, because previously they would be obliged 
to embed a Big-5 or JIS character code in the source code; now, 
on compilers that support it, they can use a universal Unicode 
value and expect it to be converted appropriately.  As they move 
to Unicode execution environments they will not be obliged to 
change their source, and that's good.

Of course some proprietary compilers encourage programmers to use
non-portable extensions.  That's good for the compiler vendor, but
bad for programmers who get locked in.

 
> > > In my experience people play for hours
> > > trying to figure out how to get their own language strings into source
> > > code, before they finally figure out they can't.
> >
> > If they would do a little research, they would find out, further, that
> > they shouldn't.  String literals are not a reasonable place for normal
> > interaction text, because that makes a program unportable.
> 
> Are all your programs 100% free of all literal strings? Are 100% of 
> all your programs certain to find widespread use, and be translated 
> into a multitude of languages? Of course not. Most apps are quite 
> specialised, and will never be used beyond the organisation which 
> developed them. It makes no sense to keep all the literals out of 
> the source code in such cases. It justs creates admin work getting 
> all the strings and references to tie up.

We are getting off-topic, now.  This is the GTK+ internationalization
list.  For throw-away programs you can do whatever you feel like, and
it's nobody's (especially not my) business; so I won't discuss that.

I would like to see GTK+ (and related) library code to be written using 
sound coding practices. 

 
> I would fully agree that most apps which will find widespread use 
> are written without much regard for how tough grafted on i18n will 
> become later. However, I wouldn't agree that this means all literals 
> should be out of the source code. 

Ease of i18n has nothing to do with it.  As I have pointed out several 
times, Chinese (and Arabic, and Devanagari) characters simply are not 
in the list of characters that compilers are required by ISO Standard
to recognize.  It happens that most of the ASCII characters are in the 
list, so those are the ones you can put in your source code.


> Despite its drawbacks, gettext is 
> one of the best tools currently available for multi-lingualisation. 
> Its based on having the strings in one language in the source code. 
> In a large percentge of cases that original text is in English. Its 
> absurd to demand that be so. Is native standard English to be a 
> pre-requisite for software development?

There is no requirement that gettext() key strings are in English.
You can write any of

  gettext("aKfjw82.xc38cjll23ku092lk3")
  gettext("31337 h4x0rZ R00l")
  gettext("file not found")

for the same message.  Probably the last is best, for documentation
purposes, if people reading the program source know English.  Otherwise,
an ASCII transcription of a Chinese expression might be better.  Since
only programmers see this string, it need only be unique, not pretty.
 
 
> > > Am I the only one who thinks that's dumb. For many people its a
> > > good reason to avoid UCS32, and stick with a byte stream encoding.
> > > The documentation needs a large:
> > >
> > > A N G L O - P H I L E S      O N L Y
> >
> > You are certainly far from the only person who thinks that sound
> > programming practices are "dumb".  The fix for all cases of ignorance
> > is learning.  The simple fact is that characters in a source file that
> > are not in the "basic source character set" render your code entirely
> > unportable.  There are other, better, ways to handle literal text.
> 
> I don't think sound programming practices are dumb. I think clumsy 
> programming practices are dumb. They take more time and cause more 
> errors. Think carefully about paying in time or cash for flexibility 
> - 9 times out of 10 it is never needed.
> 
> If it is bad for me to have my Chinese text in my source code, why is 
> it good for you to have your English text in there? Are you suggesting 
> that people shouldn't even put comments in their source code in a 
> language they actually understand?

It is not "bad" to have Chinese characters in your source code.  It 
is unsupported by standard-conforming compilers, and thus an unsound
practice.  You are as free to be "bad" or to use unsound practices 
as ever.  Just don't ask others to approve of it.
 

> > The orientation of C and C++ to American encodings is not a matter of
> > English chauvinism, but an accident of history: C was invented here.
> > C++, you should notice, was invented by a Dane, but was obliged to 
> > adopt C conventions by inheritance (as it were).
> 
> I believe he was working in the US for a US company when he did most 
> of that work.

I think this is getting into conspiracy-theory territory.  
Please pardon me if I do not respond to the above.


Nathan Myers
ncm at cantrip dot org





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]