Re: Dealing with strings

From: Sebastian Pölsterl <sebp k-d-w org>
To: Martin Pitt <martin pitt ubuntu com>
Cc: python-hackers-list gnome org
Subject: Re: Dealing with strings
Date: Sun, 22 Jan 2012 12:14:58 +0100

Am 22.01.2012 10:52, schrieb Martin Pitt:
> Hello friends of PyGI,
> [...]
> 
>> As GTK+ only supports utf-8 encoded strings you have to encode every
>> unicode object to utf-8 before supplying it to GTK+. In most cases a
>> unicode object is automatically converted to utf-8:
>>
>> 	label = Gtk.Label()
>> 	label.set_text(u"l\xf6\xe6man")
> 
> I still don't know which weird magic Python does to make something
> sensible out of this. \xf6\xe6 is NOT valid UTF-8, it is ISO-8859-1
> aka latin-1. Perhaps, because it is so widely used, it tries UTF-8
> first and then tries to interpret strings as latin-1 when that fails?
> 
Keep in mind that unicode objects do _not_ represent a specific encoding
such as latin1 or utf-8, a unicode object is a list of code points. As
you can see at [1], "ö" has the code point U+00F6 and its utf-8 byte
value is 0xc3 0xb6. The code points just seem to be equivalent to latin1
(by purpose).

> In the other direction it gets it right:
> 
>>>> print "l\xf6\xe6man".decode('UTF-8')
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 1: invalid start byte
> 
You are using str object here, that means \xf6 is interpreted as byte
value, as you said the byte 0xF6 is not valid utf-8.

>>>> print "l\xf6\xe6man".decode('latin1')
> löæman
> 
Same as above, but 0xF6 is valid value in latin-1.

>> However, "label.get_text()" will return a str (byte representation) that
>> looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode
>> representation) in Python 3.
> 
> It seems pygobject automatically converts return values to str (i. e.
> unicode) in Python 3, but keeps the GTK+ UTF-8 byte array (i. e. str
> in Python 2, unicode in Python 3) for Python 2. I think that's the
> root of that bug report [1], i. e. that pygobject for python 2 takes
> and gives UTF-8 byte arrays, while you would prefer giving and taking
> unicode.
> 
That's correct, there's an inconsistency between Python 2 and Python 3.

>> This is a pain if you want to retrieve a
>> string from a widget and concatenate it with a string, such as:
>>
>> 	u"F\xfd\xdfe " + l.get_text()
>>
>> which will give you the infamous UnicodeDecodeError.
> 
> Indeed, but (apart from the invalid UTF-8 encoding) this is not
> specific to GTK or pygobject. You get the very same with
> 
Again, unicode != utf-8 encoded.

>>>> u'ä' + 'ä'
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
> ordinal not in range(128)
> 
> because you use two different data types. I guess Python does not want
> to make any assumptions about the encoding of the second byte array.
> 
Of course, I just wanted to point out that it is very easy with Python 2
and GTK+ to get in a lot of trouble by mixing str and unicode objects.

>> Whereas in Python 3 things work fine, you provide a unicode
>> representation and you get a unicode representation, it is a mess in
>> Python 2. A working solution is
>>
>> 	u"F\xfd\xdfe " + l.get_text().decode("utf-8")
>>
>> or
>>
>> 	u"F\xfc\xdfe ".encode("utf-8") + t
> 
> Right, then + gets values of the same data type.
> 
>> I personally would prefer to work with unicode representations all the
>> time instead of the byte representation, but I don't know how much we
>> can change this behavior if we want to preserve API/ABI.
> 
> As I already said in [2] I agree that it would generally make more
> sense to consistently use unicode objects and transparently encode
> (arguments) / decode (return values) them for all "utf8" type
> arguments. It seems we did get that right in Pygobject for Python 3 at
> last, but as in python2 this never happened, changing it now would
> mean to break pretty much all existing software out there which uses
> gi.repository.Gtk.  For the same reason I think we should revert the
> commit from [1]: It makes the situation even worse by doing the
> unicode conversion for just one particular Gtk method call, while
> everything else still keeps being byte arrays.
> 
> So I'm afraid we need to live with the situation that we have for
> pythhon 2: Ignore unicode and just work with UTF-8 encoded byte arrays
> all the time.
> 
So this means we discourage the use of unicode objects in Python 2?


[1]: http://www.utf8-chartable.de/

Best regards,
Sebastian Pölsterl

Attachment: signature.asc
Description: OpenPGP digital signature

References:
- Dealing with strings
  - From: =?ISO-8859-15?Q?Sebastian_P=F6lsterl?=
- Re: Dealing with strings
  - From: Martin Pitt

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]