Re: UTF-8

From: Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer <damien donlon sun com>
To: Sven Neumann <sven gimp org>
Cc: Karl Eichwalder <keichwa gmx net>,GNOME i18n list <gnome-i18n gnome org>
Subject: Re: UTF-8
Date: Wed, 10 Jul 2002 17:28:46 +0100

Sven Neumann wrote:

> Hi,
>
> Karl Eichwalder <keichwa@gmx.net> writes:
>
> > Sven Neumann <sven@gimp.org> writes:
> >
> > > If I'd have to check all 157 po files in the GIMP source tree
> > > for correct encoding everytime I do a tarball I'd never get a release
> > > out.
> >
> > Add "msgconv --to UTF-8 |" to the traditional $(MSGFMT) command and you
> > are done.  I don't expect you to do it by hand file by file ;)
>
> I'd do so if the Makefile in the po directory was under my control.
> No, I'm not going to add yet another patch to be applied after
> gettextize and intltoolize did their dirty work. I'd appreciate if
> intltoolize could take care of it.

Hi All.,
The  'Great UTF-8 Debate' again! I want to give my 2 cents this
time (my 2 cents - not necessarily Sun's 2 cents! ). Let me preface this
by saying that I think Gnome have made great leaps in it's adoption of
UTF-8.  Nonetheless, I do wish to make a few points to ensure all
translation team members see the wider impact of continuing to commit
translations in non UTF-8 encodings.

File Encoding and Tools/Application Development
________________________________________________

As you know, we are delivering back all our translations in UTF-8.
Because we are working across languages it is safer and less confusing for
our tools (and us) if we can depend on the encoding of the files.  I think
some
translation team members may overlook the consideration and time
that tools and application developers have to give to multiple encodings
when writing programs to parse/merge/configure/display files.  In this
thread I have seen mention already of changes needed to intltools,
msgconv, msgcat and the Makefiles to accomodate different encodings.  This
represents 4 instances where a developer had to think about file encoding
when he ought not have had to do so. Yet, this is only the tip of the
iceberg as far as programmatically having to deal with different file
encodings.  And this is only in Gnome. And it is still prone to error
because there are subtle differences in the way different platforms handle
different encodings!

I agree with Sander that while it is nice to accomodate individual
language teams in their use of specific encodings the advantages of being
able to reliably depend on files being in a single encoding outweigh being
accomodating to translation teams at this point.

Responsibility for UTF-8 - Developers or Translation Teams?
____________________________________________________

Arguing over who should have responsibility for UTF-8 - developers or
translation teams - misses the point entirely and defers a solution to the
problem. While translation teams might differentiate themselves from
developers  within a project - from a USER perspective we are all
developers insomuch as we contribute  to a product that they end up
using.  If we don't move wholesale to UTF-8 then what likelihood is there
of  users not having to worry about file encodings?  Those of you running
Netscape/Mozilla - click View->Character Set and look at the fine (but by
no means full) range of viewing options.  Should users have to deal with
the likes of this? Should application/tool developers have to program to
allow for it? Yes UTF-8 is there, but how many people have it as there
default viewing encoding? Why not? Because developers are not programming
to UTF-8 for the web. Because tools for creating HTML are not developed to
generate UTF-8 encoded files etc.  In general (though not generally in
Gnome!) there is a problem of slow/ reluctant developer adoption of UTF-8.
Consequently, developers (particularly tools developers) continue to
suffer having to deal with multiple encodings and users continue to suffer
garbled web pages etc.. In short - everybody loses.

Legacy Codesets
______________________

We all know that UTF-8 is better. We are slowly trying to move our users
towards UTF-8. Admittedly, we will still have to cater to legacy codesets
for the forseeablee future.  Maybe UTF-8 itself will be displaced in the
future.
However, it is the best we have at the moment.  I think 'legacy codesets'
is a good phrase. It sums up the way we should be thinking about  8859-1,
8859-7 etc., KOi8-r etc.  I'd hate to think it 5 years time we'll all
still be dealing with the polyglot babble of encodings we have at the
moment.  However, that is precisely what is going to happen if developers
themselves don't get used to using UTF-8.  Translation teams saying 'we're
not using  UTF-8 because we use 8859-1' or 'we're not using UTF-8 because
we're not used to it and so our using  it is prone to error' or 'pango and
gettext can deal with it' is not really good enough at this point.

While we do have a responsibility to users in accomodating legacy codesets
- in so far as is possible there ought NOT be any such duty to fellow
contributors to a software project.

What to do ?
_________________________

I think the following things ought to be done :

[1] Identify what are the usage limitations of UTF-8 for some translations
    teams and identify how they can be eliminated ( the limitations not
    the translation teams ;-) )

[2] Create a tool that can check whether a file is UTF-8 encoded.
    The tool should not be dependent on simply reading a charset field
    within the file to see whether it says UTF-8 but by analysing the
    byte stream. Does such a tool exist already within the community?

    I think it may be impossible to distinguish between UTF-8 and 8859-1
    if no character is outside the 0-127 range. Can anyone confirm? Is
    this a big problem in identifying UTF-8 encoded files?

    The tool would be provided to translation teams to check their files
    prior to the cvs commit.

[3] A commit check script (possibly invoking [1] above)
    placed in CVS that refuses to commit any po file that is not UTF-8.

Having said that, I don't think they will be done and we can all enjoy
watching this debate resurface again in another couple of months.  I am
off to put on my flame retardant suit now. ;-)

Best Regards,
Damien

>

>

>
>
> Salut, Sven
> _______________________________________________
> gnome-i18n mailing list
> gnome-i18n@gnome.org
> http://mail.gnome.org/mailman/listinfo/gnome-i18n

--
¤ºÿ°`°º¤ø,¸¸,ø¤º°`°º¤ø¤ºÿ°`°º¤ø,¸¸,ø¤º°`°º¤øø¤º°`°ºÿ¤
Damien Donlon
damien.donlon@sun.com
00 353 1 8199225
x19225

Follow-Ups:
- Re: UTF-8
  - From: German Poo Caaman~o
- Re: UTF-8
  - From: Sven Neumann
- Re: UTF-8
  - From: Ole Laursen

References:
- UTF-8
  - From: Christian Rose
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Carlos Perelló Marín
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Sven Neumann
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Sven Neumann

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]