Re: Some strings corrupted when inserting into liststore model
- From: Grant McLean <grant mclean net nz>
- To: gtk-perl-list gnome org
- Subject: Re: Some strings corrupted when inserting into liststore model
- Date: Wed, 20 Oct 2021 11:00:43 +1300
On Tue, 2021-10-19 at 11:34 +1100, Daniel Kasak via gtk-perl-list
wrote:
Right. I found a hack on https://perldoc.perl.org/perlunicode ( which
you directed me to ) that appears to have fixed *this* particular
issue ( though it's not clear what I've then broken as a result )
Calling:
Encode::_utf8_on($_)
... for every value just prior to being pushed into the model
appears to work. Yay :)
The operative word here is "appears". This hack will work for most
characters but not all.
The general advice for working with encodings from Perl is that you
should:
* decode bytes on input to give you strings in Perl's internal
representation which supports multi-byte characters; and
* encode strings to bytes in a particular encoding on output
These days the most common encoding you will encounter is UTF-8.
To do the relevant decoding of a UTF-8 file you might open it like
this:
open(my $fh, '<:utf8', $filename);
Or, if the string was not read from a file but was simply defined in
your script, you would tell Perl to decode the bytes of your script
from UTF-8 by including this pragma:
use utf8;
For output to a file you might use:
open(my $fh, '>:utf8', $filename);
Your experience seems to suggest that the Perl Gtk bindings will do the
right thing when presented with a string that has the internal "utf8"
flag set. But if your string has non-ASCII characters but does not
already have that flag set then it seems the decoding step has been
missed.
Data that came from a DB connection rather than a file might need to be
decoded with something like:
$perl_string = Encode::decode_utf8($db_string);
However most of the DBD drivers allow you to set a flag so that this
happens automatically.
The reason messing with the utf8 flag on the Perl string appears to
work is that Perl's internal encoding is almost-but-not-quite UTF-8.
For historical reasons (and arguably as an memory optimisation)
sometimes Perl will encode some characters in the range 0x80-0xFF as a
single byte ("Latin-1" encoding) rather than the two bytes that UTF-8
would require.
For example chr(0x20AC) would return a Perl string which was
represented in memory using UTF-8 bytes. Whereas chr(0xE9) would return
a Perl string which was represented in memory using a single Latin-1
byte. Simply setting the utf8 flag on the first string would do no
harm (since it's already set) but it would make a mess of the second
string because it's only one byte long and not a valid UTF-8 sequence.
If you really want to understand this stuff here's a link to a
conference talk I did on the subject:
https://www.youtube.com/watch?v=cgswnneFp-s
Regards
Grant
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]