Hi all, if there are no complaints I will revert commit 654711d0f940d7480d0f1cdb25a3dc9996f7a706 in master. Therefore, we stay backwards compatible, downside is that users have to be really cautious when dealing with str/unicode in python 2. Best regards, Sebastian On 22.01.2012 12:14, Sebastian Pölsterl wrote: > Am 22.01.2012 10:52, schrieb Martin Pitt: >> Hello friends of PyGI, >> [...] >> >>> As GTK+ only supports utf-8 encoded strings you have to encode every >>> unicode object to utf-8 before supplying it to GTK+. In most cases a >>> unicode object is automatically converted to utf-8: >>> >>> label = Gtk.Label() >>> label.set_text(u"l\xf6\xe6man") >> >> I still don't know which weird magic Python does to make something >> sensible out of this. \xf6\xe6 is NOT valid UTF-8, it is ISO-8859-1 >> aka latin-1. Perhaps, because it is so widely used, it tries UTF-8 >> first and then tries to interpret strings as latin-1 when that fails? >> > Keep in mind that unicode objects do _not_ represent a specific encoding > such as latin1 or utf-8, a unicode object is a list of code points. As > you can see at [1], "ö" has the code point U+00F6 and its utf-8 byte > value is 0xc3 0xb6. The code points just seem to be equivalent to latin1 > (by purpose). > >> In the other direction it gets it right: >> >>>>> print "l\xf6\xe6man".decode('UTF-8') >> UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 1: invalid start byte >> > You are using str object here, that means \xf6 is interpreted as byte > value, as you said the byte 0xF6 is not valid utf-8. > >>>>> print "l\xf6\xe6man".decode('latin1') >> löæman >> > Same as above, but 0xF6 is valid value in latin-1. > >>> However, "label.get_text()" will return a str (byte representation) that >>> looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode >>> representation) in Python 3. >> >> It seems pygobject automatically converts return values to str (i. e. >> unicode) in Python 3, but keeps the GTK+ UTF-8 byte array (i. e. str >> in Python 2, unicode in Python 3) for Python 2. I think that's the >> root of that bug report [1], i. e. that pygobject for python 2 takes >> and gives UTF-8 byte arrays, while you would prefer giving and taking >> unicode. >> > That's correct, there's an inconsistency between Python 2 and Python 3. > >>> This is a pain if you want to retrieve a >>> string from a widget and concatenate it with a string, such as: >>> >>> u"F\xfd\xdfe " + l.get_text() >>> >>> which will give you the infamous UnicodeDecodeError. >> >> Indeed, but (apart from the invalid UTF-8 encoding) this is not >> specific to GTK or pygobject. You get the very same with >> > Again, unicode != utf-8 encoded. > >>>>> u'ä' + 'ä' >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: >> ordinal not in range(128) >> >> because you use two different data types. I guess Python does not want >> to make any assumptions about the encoding of the second byte array. >> > Of course, I just wanted to point out that it is very easy with Python 2 > and GTK+ to get in a lot of trouble by mixing str and unicode objects. > >>> Whereas in Python 3 things work fine, you provide a unicode >>> representation and you get a unicode representation, it is a mess in >>> Python 2. A working solution is >>> >>> u"F\xfd\xdfe " + l.get_text().decode("utf-8") >>> >>> or >>> >>> u"F\xfc\xdfe ".encode("utf-8") + t >> >> Right, then + gets values of the same data type. >> >>> I personally would prefer to work with unicode representations all the >>> time instead of the byte representation, but I don't know how much we >>> can change this behavior if we want to preserve API/ABI. >> >> As I already said in [2] I agree that it would generally make more >> sense to consistently use unicode objects and transparently encode >> (arguments) / decode (return values) them for all "utf8" type >> arguments. It seems we did get that right in Pygobject for Python 3 at >> last, but as in python2 this never happened, changing it now would >> mean to break pretty much all existing software out there which uses >> gi.repository.Gtk. For the same reason I think we should revert the >> commit from [1]: It makes the situation even worse by doing the >> unicode conversion for just one particular Gtk method call, while >> everything else still keeps being byte arrays. >> >> So I'm afraid we need to live with the situation that we have for >> pythhon 2: Ignore unicode and just work with UTF-8 encoded byte arrays >> all the time. >> > So this means we discourage the use of unicode objects in Python 2? > > > [1]: http://www.utf8-chartable.de/ > > Best regards, > Sebastian Pölsterl > > > > > _______________________________________________ > python-hackers-list mailing list > python-hackers-list gnome org > http://mail.gnome.org/mailman/listinfo/python-hackers-list
Attachment:
signature.asc
Description: OpenPGP digital signature