Am 22.01.2012 10:52, schrieb Martin Pitt: > Hello friends of PyGI, > [...] > >> As GTK+ only supports utf-8 encoded strings you have to encode every >> unicode object to utf-8 before supplying it to GTK+. In most cases a >> unicode object is automatically converted to utf-8: >> >> label = Gtk.Label() >> label.set_text(u"l\xf6\xe6man") > > I still don't know which weird magic Python does to make something > sensible out of this. \xf6\xe6 is NOT valid UTF-8, it is ISO-8859-1 > aka latin-1. Perhaps, because it is so widely used, it tries UTF-8 > first and then tries to interpret strings as latin-1 when that fails? > Keep in mind that unicode objects do _not_ represent a specific encoding such as latin1 or utf-8, a unicode object is a list of code points. As you can see at [1], "ö" has the code point U+00F6 and its utf-8 byte value is 0xc3 0xb6. The code points just seem to be equivalent to latin1 (by purpose). > In the other direction it gets it right: > >>>> print "l\xf6\xe6man".decode('UTF-8') > UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 1: invalid start byte > You are using str object here, that means \xf6 is interpreted as byte value, as you said the byte 0xF6 is not valid utf-8. >>>> print "l\xf6\xe6man".decode('latin1') > löæman > Same as above, but 0xF6 is valid value in latin-1. >> However, "label.get_text()" will return a str (byte representation) that >> looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode >> representation) in Python 3. > > It seems pygobject automatically converts return values to str (i. e. > unicode) in Python 3, but keeps the GTK+ UTF-8 byte array (i. e. str > in Python 2, unicode in Python 3) for Python 2. I think that's the > root of that bug report [1], i. e. that pygobject for python 2 takes > and gives UTF-8 byte arrays, while you would prefer giving and taking > unicode. > That's correct, there's an inconsistency between Python 2 and Python 3. >> This is a pain if you want to retrieve a >> string from a widget and concatenate it with a string, such as: >> >> u"F\xfd\xdfe " + l.get_text() >> >> which will give you the infamous UnicodeDecodeError. > > Indeed, but (apart from the invalid UTF-8 encoding) this is not > specific to GTK or pygobject. You get the very same with > Again, unicode != utf-8 encoded. >>>> u'ä' + 'ä' > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: > ordinal not in range(128) > > because you use two different data types. I guess Python does not want > to make any assumptions about the encoding of the second byte array. > Of course, I just wanted to point out that it is very easy with Python 2 and GTK+ to get in a lot of trouble by mixing str and unicode objects. >> Whereas in Python 3 things work fine, you provide a unicode >> representation and you get a unicode representation, it is a mess in >> Python 2. A working solution is >> >> u"F\xfd\xdfe " + l.get_text().decode("utf-8") >> >> or >> >> u"F\xfc\xdfe ".encode("utf-8") + t > > Right, then + gets values of the same data type. > >> I personally would prefer to work with unicode representations all the >> time instead of the byte representation, but I don't know how much we >> can change this behavior if we want to preserve API/ABI. > > As I already said in [2] I agree that it would generally make more > sense to consistently use unicode objects and transparently encode > (arguments) / decode (return values) them for all "utf8" type > arguments. It seems we did get that right in Pygobject for Python 3 at > last, but as in python2 this never happened, changing it now would > mean to break pretty much all existing software out there which uses > gi.repository.Gtk. For the same reason I think we should revert the > commit from [1]: It makes the situation even worse by doing the > unicode conversion for just one particular Gtk method call, while > everything else still keeps being byte arrays. > > So I'm afraid we need to live with the situation that we have for > pythhon 2: Ignore unicode and just work with UTF-8 encoded byte arrays > all the time. > So this means we discourage the use of unicode objects in Python 2? [1]: http://www.utf8-chartable.de/ Best regards, Sebastian Pölsterl
Attachment:
signature.asc
Description: OpenPGP digital signature