Re: Some strings corrupted when inserting into liststore model



On Tue, 2021-10-19 at 11:34 +1100, Daniel Kasak via gtk-perl-list
wrote:
Right. I found a hack on https://perldoc.perl.org/perlunicode ( which
you directed me to ) that appears to have fixed *this* particular
issue ( though it's not clear what I've then broken as a result )
Calling:

Encode::_utf8_on($_)

 ... for every value just prior to being pushed into the model
appears to work. Yay :)

The operative word here is "appears".  This hack will work for most
characters but not all.

The general advice for working with encodings from Perl is that you
should:

 * decode bytes on input to give you strings in Perl's internal
   representation which supports multi-byte characters; and

 * encode strings to bytes in a particular encoding on output

These days the most common encoding you will encounter is UTF-8.
To do the relevant decoding of a UTF-8 file you might open it like
this:

    open(my $fh, '<:utf8', $filename);

Or, if the string was not read from a file but was simply defined in
your script, you would tell Perl to decode the bytes of your script
from UTF-8 by including this pragma:

    use utf8;

For output to a file you might use:

    open(my $fh, '>:utf8', $filename);

Your experience seems to suggest that the Perl Gtk bindings will do the
right thing when presented with a string that has the internal "utf8"
flag set.  But if your string has non-ASCII characters but does not
already have that flag set then it seems the decoding step has been
missed.

Data that came from a DB connection rather than a file might need to be
decoded with something like:

    $perl_string = Encode::decode_utf8($db_string);

However most of the DBD drivers allow you to set a flag so that this
happens automatically.

The reason messing with the utf8 flag on the Perl string appears to
work is that Perl's internal encoding is almost-but-not-quite UTF-8.
For historical reasons (and arguably as an memory optimisation)
sometimes Perl will encode some characters in the range 0x80-0xFF as a
single byte ("Latin-1" encoding) rather than the two bytes that UTF-8
would require.

For example chr(0x20AC) would return a Perl string which was
represented in memory using UTF-8 bytes. Whereas chr(0xE9) would return
a Perl string which was represented in memory using a single Latin-1
byte.  Simply setting the utf8 flag on the first string would do no
harm (since it's already set) but it would make a mess of the second
string because it's only one byte long and not a valid UTF-8 sequence.

If you really want to understand this stuff here's a link to a
conference talk I did on the subject:

    https://www.youtube.com/watch?v=cgswnneFp-s

Regards
Grant



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]