Re: Dealing with strings

From: Sebastian Pölsterl <sebp k-d-w org>
To: Martin Pitt <martin pitt ubuntu com>
Cc: python-hackers-list gnome org
Subject: Re: Dealing with strings
Date: Sun, 05 Feb 2012 12:13:18 +0100
Hi all,

if there are no complaints I will revert commit
654711d0f940d7480d0f1cdb25a3dc9996f7a706 in master. Therefore, we stay
backwards compatible, downside is that users have to be really cautious
when dealing with str/unicode in python 2.

Best regards,
Sebastian

On 22.01.2012 12:14, Sebastian Pölsterl wrote:
> Am 22.01.2012 10:52, schrieb Martin Pitt:
>> Hello friends of PyGI,
>> [...]
>>
>>> As GTK+ only supports utf-8 encoded strings you have to encode every
>>> unicode object to utf-8 before supplying it to GTK+. In most cases a
>>> unicode object is automatically converted to utf-8:
>>>
>>> 	label = Gtk.Label()
>>> 	label.set_text(u"l\xf6\xe6man")
>>
>> I still don't know which weird magic Python does to make something
>> sensible out of this. \xf6\xe6 is NOT valid UTF-8, it is ISO-8859-1
>> aka latin-1. Perhaps, because it is so widely used, it tries UTF-8
>> first and then tries to interpret strings as latin-1 when that fails?
>>
> Keep in mind that unicode objects do _not_ represent a specific encoding
> such as latin1 or utf-8, a unicode object is a list of code points. As
> you can see at [1], "ö" has the code point U+00F6 and its utf-8 byte
> value is 0xc3 0xb6. The code points just seem to be equivalent to latin1
> (by purpose).
> 
>> In the other direction it gets it right:
>>
>>>>> print "l\xf6\xe6man".decode('UTF-8')
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 1: invalid start byte
>>
> You are using str object here, that means \xf6 is interpreted as byte
> value, as you said the byte 0xF6 is not valid utf-8.
> 
>>>>> print "l\xf6\xe6man".decode('latin1')
>> löæman
>>
> Same as above, but 0xF6 is valid value in latin-1.
> 
>>> However, "label.get_text()" will return a str (byte representation) that
>>> looks like 'l\xc3\xb6\xc3\xa6man' in Python 2 but str (unicode
>>> representation) in Python 3.
>>
>> It seems pygobject automatically converts return values to str (i. e.
>> unicode) in Python 3, but keeps the GTK+ UTF-8 byte array (i. e. str
>> in Python 2, unicode in Python 3) for Python 2. I think that's the
>> root of that bug report [1], i. e. that pygobject for python 2 takes
>> and gives UTF-8 byte arrays, while you would prefer giving and taking
>> unicode.
>>
> That's correct, there's an inconsistency between Python 2 and Python 3.
> 
>>> This is a pain if you want to retrieve a
>>> string from a widget and concatenate it with a string, such as:
>>>
>>> 	u"F\xfd\xdfe " + l.get_text()
>>>
>>> which will give you the infamous UnicodeDecodeError.
>>
>> Indeed, but (apart from the invalid UTF-8 encoding) this is not
>> specific to GTK or pygobject. You get the very same with
>>
> Again, unicode != utf-8 encoded.
> 
>>>>> u'ä' + 'ä'
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
>> ordinal not in range(128)
>>
>> because you use two different data types. I guess Python does not want
>> to make any assumptions about the encoding of the second byte array.
>>
> Of course, I just wanted to point out that it is very easy with Python 2
> and GTK+ to get in a lot of trouble by mixing str and unicode objects.
> 
>>> Whereas in Python 3 things work fine, you provide a unicode
>>> representation and you get a unicode representation, it is a mess in
>>> Python 2. A working solution is
>>>
>>> 	u"F\xfd\xdfe " + l.get_text().decode("utf-8")
>>>
>>> or
>>>
>>> 	u"F\xfc\xdfe ".encode("utf-8") + t
>>
>> Right, then + gets values of the same data type.
>>
>>> I personally would prefer to work with unicode representations all the
>>> time instead of the byte representation, but I don't know how much we
>>> can change this behavior if we want to preserve API/ABI.
>>
>> As I already said in [2] I agree that it would generally make more
>> sense to consistently use unicode objects and transparently encode
>> (arguments) / decode (return values) them for all "utf8" type
>> arguments. It seems we did get that right in Pygobject for Python 3 at
>> last, but as in python2 this never happened, changing it now would
>> mean to break pretty much all existing software out there which uses
>> gi.repository.Gtk.  For the same reason I think we should revert the
>> commit from [1]: It makes the situation even worse by doing the
>> unicode conversion for just one particular Gtk method call, while
>> everything else still keeps being byte arrays.
>>
>> So I'm afraid we need to live with the situation that we have for
>> pythhon 2: Ignore unicode and just work with UTF-8 encoded byte arrays
>> all the time.
>>
> So this means we discourage the use of unicode objects in Python 2?
> 
> 
> [1]: http://www.utf8-chartable.de/
> 
> Best regards,
> Sebastian Pölsterl
> 
> 
> 
> 
> _______________________________________________
> python-hackers-list mailing list
> python-hackers-list gnome org
> http://mail.gnome.org/mailman/listinfo/python-hackers-list
Attachment: signature.asc
Description: OpenPGP digital signature
Follow-Ups:
- Re: Dealing with strings
  - From: Martin Pitt
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]