Dealing with strings



Hi all,

I want to start a discussion on how to properly deal with strings and
unicode, especially for python 2, which was triggered by [1]. The goal
should be to come up with a "best practices" section in the Python GTK+
3 Tutorial.

My current understanding is, and please correct me if I'm wrong, that
str is a list of bytes and unicode is a list of code points (32 bit
integer each). Therefore, a unicode object is an abstract representation
of a string independent of the encoding. As GTK+ only supports utf-8
encoded strings you have to encode every unicode object to utf-8 before
supplying it to GTK+. In most cases a unicode object is automatically
converted to utf-8:

	label = Gtk.Label()
	label.set_text(u"l\xf6\xe6man")

However, "label.get_text()" will return a str (byte representation) that
looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode
representation) in Python 3. This is a pain if you want to retrieve a
string from a widget and concatenate it with a string, such as:

	u"F\xfd\xdfe " + l.get_text()

which will give you the infamous UnicodeDecodeError.

Whereas in Python 3 things work fine, you provide a unicode
representation and you get a unicode representation, it is a mess in
Python 2. A working solution is

	u"F\xfd\xdfe " + l.get_text().decode("utf-8")

or

	u"F\xfc\xdfe ".encode("utf-8") + t

I personally would prefer to work with unicode representations all the
time instead of the byte representation, but I don't know how much we
can change this behavior if we want to preserve API/ABI.

What do you think?

[1]: https://bugzilla.gnome.org/show_bug.cgi?id=663610

-- 
Best regards,
Sebastian Pölsterl

Attachment: signature.asc
Description: OpenPGP digital signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]