Re: auto-upgrading strings to utf8



On søn, 2005-06-05 at 04:28 +0200, Quentin wrote:
With gtk2-perl, most strings passed as arguments to a glib/gtk function
are auto upgraded to utf8,
that's not the case in C, where arguments must be in the proper encoding
before passing it as an argument to a gtk/glib function.

Which leads to a problem with the Gstreamer bindings where the filenames
is a string property of a Glib object, and thus are auto upgraded to
utf8, but they shouldn't.
So, following a chat with muppet on IRC, we were wondering if
automatically upgrading text to utf8 is the right thing to do?

It is :-)

I tried disabling auto-upgrading in Glib, and my program (a very complex
jukebox) runs fine because all the data I use are utf8, so there is no
need to upgrade strings to utf8 in this case.

The problem here is the data not the upgrade.

Your strings are utf8 but you don't let perl know. That will break
things all over the place not just in Glib/Gtk2 (regular expressions and
pretty much every other string operator won't work on your data.)

I you know for sure that your data is utf8, call Encode::_utf8_on(...)
on your string. _utf8_on is a very cheap, it just flips a bit.

So from a Perl point of view your strings is broken. Please note that
I'm not saying that Perls POV is right, but its the way it is, and isn't
likely to change before Perl 6. In perl today a string can be either in
the encoding of the locale (usually iso-8859-1 or is-8859-15 in France
and Denmark.) or utf8 in which case the utf8 flag is set on the string.

There are two separate issues here. One is that gtk+ requires strings to
be utf8 (this one i a non issue for us because Perl knows about uft8 so
with our typemap all is good - unless the strings are not valid.) 

The second issue is filenames. This one is harder. Some applications use
utf8 filenames regardless of locale (personally I think that is a bug -
but there can be valid reasons to do so I suppose).

The problem is how to keep existing code working...

That's easy. Don't break the typemap. IMHO the typemap is right as it is
(not a big surprise seeing how I made the first version of it.) 

I do not think that its a good idea to break the common case, to get
filenames right. I don't think its acceptable to not be able to print
the same string as you would put in a Label.

Any thoughts on how to fix the problem ?

Ideally the glib filename functions should be fixed. Quite a few glib
based programs have had problems with filenames, which become utf8 even
though the locale is say iso-8859-15.

I think we should provide a filename helper of some sort. Either as a
function that take a perl string and returns a filename suited for the
locale or perhaps handling the convertion in the the wrappers for the
functions that access the filesystem.

./borup




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]