Dealing with invalid UTF-8 [was: Re: Faster UTF-8 decoding in GLib]

From: Daniel Elstner <daniel kitta googlemail com>
To: Behdad Esfahbod <behdad behdad org>
Cc: gtk-devel-list gnome org
Subject: Dealing with invalid UTF-8 [was: Re: Faster UTF-8 decoding in GLib]
Date: Tue, 30 Mar 2010 00:54:16 +0200

Hi Behdad,

> Well, there's a bit more to it.  Just because some bytes in a file are invalid
> acording to the spec doesn't mean your text editor should refuse to open the
> file.  While g_utf8_get_char() and friends do assume valid UTF-8 data, it's an
> unwritten assumption that for invalid bytes they simply skip the byte and
> return -1.  And I want to keep it that way and perhaps even document it.  I
> think I use that in Pango IIRC.

I'd like to bring this up for discussion as a separate matter, because I
think it's a dangerously wrong way of handling things.

First and foremost, if your text editor uses g_utf8_get_char() on data
read from an external file without any validation, then that's a glaring
and serious bug.  Even if you are going to assume the incomplete checks
that are currently in place, it's still nowhere robust enough to deal
with untrusted input.

There are dedicated functions provided for reading data which may not be
valid UTF-8, and only those should be used.  There is no need to reject
the entire file.  Also, I believe that GIOChannel conveniently does the
UTF-8 validation for you on the fly.

Second, it is plain *impossible* for g_utf8_get_char() to handle invalid
UTF-8 sequences in a correct manner, because it does not know where the
buffer ends.  If the first byte is bogus already and does not actually
belong to a UTF-8 sequence, you will have read farther than you were
supposed to by the time you discover that it isn't followed by a proper
continuation byte.  If it was the last byte in the buffer, you have
already read past the end at that point.

Third, g_utf8_get_char() does not actually skip anything.  The skipping
is done in the calling code, usually by means of g_utf8_next_char() to
advance to the next code point after each iteration.  The implementation
of g_utf8_get_char() has no influence whatsoever on that iteration and
how much is being skipped.

That being said, it would be a trivial matter to add the same checks to
the glibmm implementation.  However, I'd rather not do so because all it
provides you with is a false sense of security.

--Daniel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]