Re: gnome-vfs and character sets



On Mon, 2004-03-15 at 21:56, Christian Biesinger wrote:
> Alexander Larsson wrote:
> >>o) If I give gnome-vfs a file: URI, how should non-ascii characters be 
> >>encoded? Should I give it escaped UTF-8, or unescaped UTF-8, or should I 
> >>use the filesystem's charset?
> >>(the function I'm most interested in in this context is 
> >>gnome_vfs_get_file_info)
> > 
> > You should use the filesystems charset (which is undefined, filenames
> > are bytestrings). 
> 
> Cool. Thank you, this is actually what works better for me.
> I can (should?) escape non-ascii characters though, right? (in URIs)

Its the only way it can work.

> >>o) What character set is used for the filename member of the 
> >>GnomeVFSFileInfo struct?
> > 
> > None. Its a byte string (without \0 or / in) corresponding to the
> > filename bytestring on the disk.
> 
> What if this fileinfo does not correspond to a local file (for example, 
> a smb: file)?

Thats worse. It depends on the actual uri used. For smb: starting with
gnome-vfs 2.5.x (whichever added smb: support inside gnome-vfs) uris are
all in (escaped) utf8, and the smb protocol converts to whatever is on
the destination system. This requires the server to be correctly set up
of course. For other types of uri's things are less nice. For instance
ftp or sftp uris are just bytestrings, because the protocol doesn't
allow you to know the encoding of the other side.

I guess the general rules for filenames are:

Whenever we can detect the charset used for the URI type we try to
convert it to/from utf8 automatically inside gnome-vfs.

When we don't/can't know the encoding, we make no guarantees, and we use
whatever byte-encoding was used on the target system (escaped properly
to be ascii according to the RFCs) so that we're certain we can
reference the filename. Displaying filenames like this is more tricky,
so in all code you have to separate Display name (must be utf8 for
display) and real name, since the conversion from Display name to a
target filename might not be one-to-one. 

In the case of unknown encoding we try to use utf8 as much as possible,
on the grounds that thats where we want to go in the future, but as a
fallback we can also try latin1 or the local encoding.

Additionally Glib/Gtk/Gnome uses the following approach for the encoding
of local filenames:
If G_BROKEN_FILENAMES is not set in the environment, all filenames are
assumed to be in utf8, and new filenames are written as utf8. (If they
are not utf8 we do our best to display them still.)
If G_BROKEN_FILENAMES is set, all filenames are assumed to be in the
encoding of the locale charset.

> > At the moment everything uses UTF-8 here, although I dunno if anything
> > does anything with this behind the scenes. I think we just pass on the
> > strings. However, username/password rarely use > 127 characters i guess,
> > and for the case they do, lets bloody hope they use utf8. :)
> 
> Hmm, at least HTTP requires special dealing with non-ascii 
> authentication information...

Well. We just pass on whatever string the user typed in, and since its
from the UI its typically utf8. Its escaped, but i don't think we do
anything more with it. So, if the user password has a latin-1 non-ascii
char in it, then he's pretty hosed... I'm not sure what you can do about
this though, he might as well have the same character but encoded as
utf8 in his password.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                            Red Hat, Inc 
                   alexl redhat com    alla lysator liu se 
He's an oversexed flyboy gentleman spy on the edge. She's a tortured 
cat-loving Valkyrie with the power to bend men's minds. They fight crime! 




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]