Re: [Fwd: Re: filename encoding issues completely broken in glib (and likely also gtk2)?]



On Wed, Mar 19, 2008 at 10:11:31AM +0100, Torsten Schoenfeld <kaffeetisch gmx de> wrote:
I just realized that you probably hadn't seen my reply...

And sorry for the delay :)

Since the GPerlFilename typemap and the gperl_filename_from_sv and
gperl_sv_from_filename functions are public API, we can't just remove
them.  And since it's pretty clearly documented what they do, I don't
think we can change their behavior either.

If the behaviour is buggy, I think they should be fixed.

Hoever, these fucntions are not (IMHO) buggy, they are just very badly
named, they should be called something like gperl_filename_from_unicode and
vice versa (incidentally, this is the name I used for the perl-level API).

What is buggy is enforcing their use for glib functions expecting filenames.
And this is clearly against the documentation:

   c-glib "f" documents a filename as input
   perl does f (filename_from_unicode (string))
   perl expects unicode as filename

Problem is that not all filenames are representable as unicode. The
conversion functions are required to exist, but only the useR/developer
knows wether his filename is a unicode filename or a native OS filename,
glib _cannot_ decide this for him.

This is partly an historical accident, as glib errornously required
"unicode" (utf-8 encoded) for some file functions in ancient time, but
this has since been fixed.

So, what about two new typemaps: one for filenames in GLib encoding, and
one for filenames in UTF-8[1]?  We could then look at every usage of
GPerlFilename and fix it if necessary.

It's very problematic. There are at least three different "encodings":

- raw filenames, these are octet strings and will always stay octet strings.
  (and this is what the glib api uses, at leats in all cases I have seen).
- displayed filenames - some non-reversible mapping from filenames to
  something to display.
- filenames entered by the user - these of course are locale dependent, which
  in glib/gtk2 means "utf-8". There is a reversible mapping between them
  and filenames.

So.. here are the conversions that work:

   text filename (e.g. entered by the user) => OS filename => text filename
   OS filename => display filename

and these conversions don't:

   OS filename => text filename => OS filename

This doesn't work because OS filenames are not reversibly representable as
a text filename (they might not be text at all).

Unfortunately, the perl interface forces the latter conversion because the
api requires a "text filename", but there is no way to generate that when one
only has an OS filename.

glib and gtk+ dpn't have this bug, the bug is solely in the perl interface.

I understand that the perl interface wants to be helpful, but since there
is no way (in perl) to work around this bug, all helpfulness is moot.

[1] Can't we just use the normal gchar* typemap for filenames that are
supposed to be encoded as UTF-8?

The gchar typemap I have uses SvPV_nolen after forcing the sv to be utf-8
encoded.

On POSIX, this is wrong in 100% of the cases, _as it doesn't even work for
filenames encoded in utf-8_!

The _only_ correct way (on POSIX) to access sv's that store filenames or
other binary data is to use SvPVbyte (neither SvPV nor SvPVutf8 work).

On windows, it is more complicated, as some perls on windows use the posix
model, and some use the native win32 model (where filenames *are* unicode,
and I suspect utf-8 *is* the correct encoding for filenames regarding
glib, although I do not know this).

(in some versions of perl this is also runtime-switchable).

However, getting it right for POSIX means getting it right for windows,
mostly, too, so this should be the first step.

Let me rehearse the perl unicode model again, as this is the source for much
confusion:

    1. Perl strings can store characters with ordinal values > 255.
       This enables you to store Unicode characters as single characters
       in a Perl string - very natural.

    2. Perl does not associate an encoding with your strings.  Unless
       you force it to, e.g. when matching it against a regex, or
       printing the scalar to a file, in which case Perl either
       interprets your string as locale-encoded text, octets/binary,
       or as Unicode, depending on various settings. In no case is an
       encoding stored together with your data, it is use that decides
       encoding, not any magical metadata.

    3. The internal utf-8 flag has no meaning with regards to the
       encoding of your string.  Just ignore that flag unless you
       debug a Perl bug, a module written in XS or want to dive into
       the internals of perl. Otherwise it will only conâ fuse you,
       as, despite the name, it says nothing about how your string is
       encoded. You can have Unicode strings with that flag set, with
       that flag clear, and you can have binary data with that flag
       set and that flag clear. Other possibilities exist, too.

       If you didnât know about that flag, just the better, pretend it doesnât exist.

    4. A "Unicode String" is simply a string where each character can
       be validly interpreted as a Unicode codepoint.  If you have UTF-8
       encoded data, it is no longer a Unicode string, but a Unicode
       string encoded in UTF-8, giving you a binary string.

    5. A string containing "high" (> 255) character values is not a
       UTF-8 string.  Itâs a fact. Learn to live with it.

One of the important things to remember is that strings that contain UTF-8
are not normally encoded in UTF-8 in perl (they can be, however).

So to access a perl value that you expect to contain UTF-8, you have to
use SvPVbyte, not SvPVutf8.

Glib documentation was a bit fuzzy about filename encodings, but nowadays,
it gets it right as well, and all functions that access files do expect
binary data, and not utf-8 encoded filenames (again, the *binary*
filenames you pass in are often utf-8 encoded, but that is an issue for
the caller, not glib, as not all valid filenames are utf-8 encoded of course).

I can explain this in more detail, but the current situation is that
gtk-perl and glib-perl rule out the use of OS filenames completely, that
emans applciations like image viewers _cannot_ treat all files correctly,
as gtl-perl only allows the subset of filenames that are encoded to its
liking.

This is not a problem with the C libraries, those do work correct
(nowadays).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      pcg goof com
      -=====/_/_//_/\_,_/ /_/\_\



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]