Re: Dealing with strings



Hello friends of PyGI,

Sebastian Pölsterl [2012-01-18 22:19 +0100]:
> My current understanding is, and please correct me if I'm wrong, that
> str is a list of bytes and unicode is a list of code points (32 bit
> integer each). Therefore, a unicode object is an abstract representation
> of a string independent of the encoding.

Right. Python 2 does not do a very good job of telling these apart,
but Python 3 strictly separates them as distinct data types.

> As GTK+ only supports utf-8 encoded strings you have to encode every
> unicode object to utf-8 before supplying it to GTK+. In most cases a
> unicode object is automatically converted to utf-8:
> 
> 	label = Gtk.Label()
> 	label.set_text(u"l\xf6\xe6man")

I still don't know which weird magic Python does to make something
sensible out of this. \xf6\xe6 is NOT valid UTF-8, it is ISO-8859-1
aka latin-1. Perhaps, because it is so widely used, it tries UTF-8
first and then tries to interpret strings as latin-1 when that fails?

In the other direction it gets it right:

>>> print "l\xf6\xe6man".decode('UTF-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 1: invalid start byte

>>> print "l\xf6\xe6man".decode('latin1')
löæman

> However, "label.get_text()" will return a str (byte representation) that
> looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode
> representation) in Python 3.

It seems pygobject automatically converts return values to str (i. e.
unicode) in Python 3, but keeps the GTK+ UTF-8 byte array (i. e. str
in Python 2, unicode in Python 3) for Python 2. I think that's the
root of that bug report [1], i. e. that pygobject for python 2 takes
and gives UTF-8 byte arrays, while you would prefer giving and taking
unicode.

> This is a pain if you want to retrieve a
> string from a widget and concatenate it with a string, such as:
> 
> 	u"F\xfd\xdfe " + l.get_text()
> 
> which will give you the infamous UnicodeDecodeError.

Indeed, but (apart from the invalid UTF-8 encoding) this is not
specific to GTK or pygobject. You get the very same with

>>> u'ä' + 'ä'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

because you use two different data types. I guess Python does not want
to make any assumptions about the encoding of the second byte array.

> Whereas in Python 3 things work fine, you provide a unicode
> representation and you get a unicode representation, it is a mess in
> Python 2. A working solution is
> 
> 	u"F\xfd\xdfe " + l.get_text().decode("utf-8")
> 
> or
> 
> 	u"F\xfc\xdfe ".encode("utf-8") + t

Right, then + gets values of the same data type.

> I personally would prefer to work with unicode representations all the
> time instead of the byte representation, but I don't know how much we
> can change this behavior if we want to preserve API/ABI.

As I already said in [2] I agree that it would generally make more
sense to consistently use unicode objects and transparently encode
(arguments) / decode (return values) them for all "utf8" type
arguments. It seems we did get that right in Pygobject for Python 3 at
last, but as in python2 this never happened, changing it now would
mean to break pretty much all existing software out there which uses
gi.repository.Gtk.  For the same reason I think we should revert the
commit from [1]: It makes the situation even worse by doing the
unicode conversion for just one particular Gtk method call, while
everything else still keeps being byte arrays.

So I'm afraid we need to live with the situation that we have for
pythhon 2: Ignore unicode and just work with UTF-8 encoded byte arrays
all the time.

Thanks,

Martin

[1] https://bugzilla.gnome.org/show_bug.cgi?id=663610
[2] https://bugzilla.gnome.org/show_bug.cgi?id=663610#c13

-- 
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Attachment: signature.asc
Description: Digital signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]