Re: Quotation marks: Using =?UTF-8?B?4oCc4oCd?= instead of ""

From: Alan Cox <alan lxorguk ukuu org uk>
To: "Alexander Jones" <alex weej com>
Cc: desktop-devel-list gnome org
Subject: Re: Quotation marks: Using “” instead of ""
Date: Fri, 13 Jun 2008 15:44:16 +0100

> Some people are worried about string functions breaking. I really
> don't see how this is the case, seeing as we're doing g_some_function
> (_("Some ASCII string")) which is replaced with a UTF-8 string at
> runtime anyway.
> 
> Does anyone have any actual proof of UTF-8 in our translatable strings
> breaking C?

Your reasoning is completely wrong. Please take a bit of time to
understand how internationalisation and localisation logic actually work.
Your model for making the decision is also wrong. The question is not
"have we made it break yet" it is "is the action we propose to take one
which has defined correct behaviour". It's as wrong to say "oh it'll
work" about this as to say "gcc happens to do this in the right order,
who cares about correctness" or "I've never seen a NULL pointer here so
why check".

In the kernel world we have made those assumptions now and then (usually
as an oversight) and when gcc or tools updates broke them the tools
people were quite definitely *not* going to make their compiler work
around our problem. So you can get burned badly in the future even if not
today.

If your string is untranslated then _("foo") -> "foo". If your locale is
not unicode then this places utf8 symbols into non-utf8 locales.
Similarly if you are in the default locale (which is where you end up if
you don't set one or the environment variable gets lost etc) you end up
with ("foo")->foo.

Now if the resulting translation is ASCII all is well because ASCII is a
strict subset of the locales we support. If your input string is not
ASCII then functions like:

	strcoll, strxfrm, strcasecmp, isupper, islower, isalpha, ... etc

all start giving undefined answers.

You've also ignored the fact that output of utf-8 bytes in a non utf-8
mode is going to have undefined results as well.

Keep the "nice" quoting in the translations. If need be generate
en_US.utf-8 from the Makefile using a script. en_GB.utf-8 is already
mostly done this way so teaching en_GB.utf-8 to use nice quoting is
trivial. For French and German the rules are different anyway so will
need to be done in those translations separately.

The po system is designed to let you do smart quoting, it is also
designed so you can do this in a defined correct and proper manner rather
than trying to cheat and digging a huge hole to fall down later.

> Somebody said that any byte with a the MSB set (i.e. 0x80-0xFF) will
> cause some compilers to break. Is this true? 

You are out of the C language spec at that point. It is entitled to play
cribbage if it wants. That of itself is not a problem as you can use
slash notation for unprintable symbols anyway - it just looks uggggly.
Also don't forget Gnome supports multiple languages not just C, and many
use po files. Whatever is chosen must work for all of these.

If you want to embed those bytes in a C program use \xxx notation for
them .. ie  _("\2??\0??hello\2??\0?? said the dog")

Far cleaner to generate an en_US po file really isn't it 8)

Alan

Follow-Ups:
- Re: Quotation marks: Using =?utf-8?B?4oCc?= =?utf-8?B?4oCd?= instead of ""
  - From: Wouter Bolsterlee

References:
- =?UTF-8?Q?Re:_Quotation_marks:_Using_=E2=80=9C=E2=80=9D_instead_of_""?=
  - From: Alexander Jones

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]