Re: Faster UTF-8 decoding in GLib

From: Daniel Elstner <daniel kitta googlemail com>
To: Mikhail Zabaluev <mikhail zabaluev gmail com>
Cc: gtk-devel-list gnome org
Subject: Re: Faster UTF-8 decoding in GLib
Date: Wed, 17 Mar 2010 12:38:33 +0200

Hi,

Am Mittwoch, den 17.03.2010, 00:17 +0200 schrieb Mikhail Zabaluev:

> Yes, though we are already in the buffer overflow territory with all
> implementations of g_utf8_get_char considered so far.

Only read past the end, thus no security implications beyond a potential
for DoS in the unlikely event that the memory a few bytes ahead is not
accessible.  And the current implementation has that "problem", too.

> >> My understanding is that unvalidated decoding should also accept
> >> various software's misconstructions of UTF-8 and produce some
> >> meaningful output.
> >
> > Meaningful in what sense?  And what kind of misconstructions would that
> > be, for example?
> 
> Wikipedia describes a couple:
> http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations
> 
> I think it's useful to have functions loose enough to interoperate
> with these too, as long as one uses the validating routines for any
> untrusted input.

The NUL character encoded as 0xC0 0x80 is simply an overlong sequence
and should therefore be parsed as U+0000 by my original implementation,
although I wouldn't call that a feature.  And interpreting CESU-8 is
something the mainline implementation does not do either, and neither do
I think it should.

Sorry, I think it's completely arbitrary to say "Hey, let's treat this
and that kind of invalid UTF-8 in this in that manner which I happen to
like."  The documentation nowhere says that it would do so, and the
current implementation doesn't do it either, so what's the point?

--Daniel

References:
- Faster UTF-8 decoding in GLib
  - From: Mikhail Zabaluev
- Re: Faster UTF-8 decoding in GLib
  - From: Daniel Elstner
- Re: Faster UTF-8 decoding in GLib
  - From: Daniel Elstner
- Re: Faster UTF-8 decoding in GLib
  - From: Mikhail Zabaluev
- Re: Faster UTF-8 decoding in GLib
  - From: Daniel Elstner
- Re: Faster UTF-8 decoding in GLib
  - From: Mikhail Zabaluev
- Re: Faster UTF-8 decoding in GLib
  - From: Daniel Elstner
- Re: Faster UTF-8 decoding in GLib
  - From: Mikhail Zabaluev

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]