Re: [Nautilus-list] A fix for non-ASCII characters

From: Håvard Wigtil <havardw stud ntnu no>
To: Darin Adler <darin bentspoon com>
Cc: Nautilus <nautilus-list lists eazel com>
Subject: Re: [Nautilus-list] A fix for non-ASCII characters
Date: 18 Jan 2002 23:13:28 +0100

On Fri, 2002-01-18 at 22:23, Darin Adler wrote:
> On 1/18/02 12:25 PM, "Håvard Wigtil" <havardw stud ntnu no> wrote:
> > I built Gnome 2
> > from CVS, and discovered that Nautilus can't even spell my name!
> > ('Håvard' becomes 'H?vard').  Investigation shows that the file name
> > conversion code (make_valid_utf8 in libnautilus-private/nautilus-file.c)
> > doesn't try to convert non-ASCII characters, it just replaces them with
> > question marks.
> 
> That's incorrect.
> 
> What's happening here is that the file name is encoded in ISO-8859-1, and
> glib 2.x treats all file names as if they are UTF-8. It's not all non-ASCII
> characters that are replaced with question marks; illegal UTF-8 sequences
> are replaced with question marks, but you can use all sorts of non-ASCII
> characters.

Well, I had a feeling that I didn't fully understand your code ;) I
think I understand it now.

> If you have a file that has the name 'Håvard' encoded in UTF-8, then it will
> show up fine in Nautilus. But the other tools on your system, like ls and
> the terminal, are using ISO-8859-1, which makes things rather confusing.
> 
> We need to talk to the glib maintainers and other experts about how to deal
> with this.

I'll let you talk for a while and see how it turns out.
 
> > I've filed a bug in bugzilla, and attatched my first attempt at a patch:
> > http://bugzilla.gnome.org/show_bug.cgi?id=69059
> 
> I've added comments to that bug report.
> 
> > This does not solve the special case where the UTF-8 escape marker ('Â',
> > capital A with a rooftop over),
> 
> That's not "the UTF-8 escape marker". It's true that adding that character
> is a simple-minded way to convert an ISO-8859-1 character to a UTF-8
> sequence, but there are lots of other ways UTF-8 encodes various characters.

OK, I was over-simplifying. And as you probably can tell, English is not
my native language, ;)

If my understanding of UTF-8 is correct, the highest bit in a byte is
used to indicate a multi-byte sequence. In that case, all accented
characters defined by any ISO 8895 encoding will not be displayed as
intended, either they will be invalid and replaced by question marks, or
two (or more) characters will combine to show a single character if they
happen to be a valid UTYF-8 sequence. Is this a correct interpretation?


-- 
    Håvard

mailto:havardw stud ntnu no||http://www.stud.ntnu.no/~havardw||73525576
All it takes to start an avalanche is a single snowflake||Or a
snowboarder
        Oh! Un Fraggle! Regarde, maman! J'ai attrapé un Fraggle!

References:
- Re: [Nautilus-list] A fix for non-ASCII characters (and hello)
  - From: Darin Adler

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]