Re: deploying UTF-8 in new programs



Darin Adler <darin bentspoon com> writes:

> Some Unicode-related questions (maybe I should be asking these on the
> gnome 2.0 list or the gnome i18n list):
> 
>      1) How did the po files for projects like gtk+ get transcoded to UTF-8?
>   Who did it? With what tools?

Mostly Robert Brady. See gtk+/po/README.translators for information
about tools.

>      2) Is there a standard way to detect at runtime that the gettext
> translations are in the wrong charset? Should we bother doing that?

Things will die horribly with warnings all over the place. If we take
the GTK+ approach of putting all .po files in UTF-8 then this is
solely a translator problem, since such .po files will work whether
or not your system has bind_textdomain_codeset().

>      3) How should programs figure out what character set file names are in?
>   Should we add something to glib and/or gnome-vfs to help with this?

Basically, this is a "Unix is screwed" problem. The file system isn't
tagged, and file names are far too short to autodetect. (You might not
do too badly assuming UTF-8 and falling back to the locale if it
isn't legimimate, but maybe not. But that doesn't help on saving.)

I don't think anyone has come up with a satisfactory solution yet.
The only thing that is going to half-way work is if everybody simply
switches over to UT8-locales and converts all their filenames.

>      4) Are there functions in the platform for converting file names
> and paths and the like to and from UTF-8?

Yes:

/* Convert between the operating system (or C runtime)
 * representation of file names and UTF-8.
 */
gchar* g_filename_to_utf8   (const gchar  *opsysstring,
			     gssize        len,            
			     gsize        *bytes_read,     
			     gsize        *bytes_written,  
			     GError      **error);
gchar* g_filename_from_utf8 (const gchar  *utf8string,
			     gssize        len,            
			     gsize        *bytes_read,     
			     gsize        *bytes_written,  
			     GError      **error);

The implementation of these are nothing very magic on Unix.

 * Normally they are no-ops

 * if the G_BROKEN_FILENAMES environment variable is set, they 
   reduce to g_locale_to/from_utf8.

>      5) When making file: URIs, should the % sequences encode the
> actual file names, or the UTF-8 equivalent of the file names, taking
> into account the character set used for file names?

No clue. The one thing I'd consider is - many places should deescape
filenames for display, and that's only possible if you have the
filename in UTF-8 form.

Regards,
                                        Owen





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]