Hi all, I want to start a discussion on how to properly deal with strings and unicode, especially for python 2, which was triggered by [1]. The goal should be to come up with a "best practices" section in the Python GTK+ 3 Tutorial. My current understanding is, and please correct me if I'm wrong, that str is a list of bytes and unicode is a list of code points (32 bit integer each). Therefore, a unicode object is an abstract representation of a string independent of the encoding. As GTK+ only supports utf-8 encoded strings you have to encode every unicode object to utf-8 before supplying it to GTK+. In most cases a unicode object is automatically converted to utf-8: label = Gtk.Label() label.set_text(u"l\xf6\xe6man") However, "label.get_text()" will return a str (byte representation) that looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode representation) in Python 3. This is a pain if you want to retrieve a string from a widget and concatenate it with a string, such as: u"F\xfd\xdfe " + l.get_text() which will give you the infamous UnicodeDecodeError. Whereas in Python 3 things work fine, you provide a unicode representation and you get a unicode representation, it is a mess in Python 2. A working solution is u"F\xfd\xdfe " + l.get_text().decode("utf-8") or u"F\xfc\xdfe ".encode("utf-8") + t I personally would prefer to work with unicode representations all the time instead of the byte representation, but I don't know how much we can change this behavior if we want to preserve API/ABI. What do you think? [1]: https://bugzilla.gnome.org/show_bug.cgi?id=663610 -- Best regards, Sebastian Pölsterl
Attachment:
signature.asc
Description: OpenPGP digital signature