Re: utf8 odd behavior with Gtk2



On 01.07.2012 14:32, zentara wrote:
What you will notice, or I do with perl 5.14.1, is that the placement
of the "use utf8::all" changes what is decoded properly.  If that use line
comes before the Gtk2 modules, it dosn't decode input. If placed after, it
works fine.  Furthermore, if you comment out the Gtk2 modules, it works
right.

This is due to Gtk2's treatment of @ARGV. When you call Gtk2::init (for example via 'use Gtk2 -init'), it copies @ARGV into a C array and passes it on to gtk_init, which might remove entries from it. To make these changes visible to the Perl programmer, Gtk2::init then clears @ARGV and copies the contents of the C array back into it. The problem you found occurs because all this copying does not take the UTF8 flag into account (it simply uses SvPV and newSVpv).

So when you use utf8::all before Gtk2, @ARGV contains strings whose internal representation is in UTF8. When Gtk2::init then reconstructs @ARGV from the C array, it creates Perl strings from UTF8 encoded byte sequences but does not mark the strings as such (i.e. it does not set the UTF8 flag). When you print these strings, perl sees no UTF8 flag and so assumes they contain Latin1-encoded byte sequences and tries to convert them to UTF8. This leads to the doubly-encoded output that you see.

So the diagnosis is easy enough. I'm not so certain about the correct fix, though.

â Do we continue to use SvPV/newSVpv but also store the UTF8 flag, and if it was set, restore it?

â Do we switch to always using SvPVutf8/newSVpvn_utf8, assuming that @ARGV always contains UTF-8-encoded data?

â Do we switch to always using SvPVbyte/newSVpv, assuming that @ARGV always contains Latin1-encoded data?

I'm leaning towards the first option, but I'm not sure. I don't have a firm grasp on the Perl/UTF-8/XS complex yet, and I've yet to see clear documentation for XS authors.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]