Re: Faster UTF-8 decoding in GLib
- From: Daniel Elstner <daniel kitta googlemail com>
- To: Mikhail Zabaluev <mikhail zabaluev gmail com>
- Cc: gtk-devel-list gnome org
- Subject: Re: Faster UTF-8 decoding in GLib
- Date: Wed, 17 Mar 2010 12:38:33 +0200
Hi,
Am Mittwoch, den 17.03.2010, 00:17 +0200 schrieb Mikhail Zabaluev:
> Yes, though we are already in the buffer overflow territory with all
> implementations of g_utf8_get_char considered so far.
Only read past the end, thus no security implications beyond a potential
for DoS in the unlikely event that the memory a few bytes ahead is not
accessible. And the current implementation has that "problem", too.
> >> My understanding is that unvalidated decoding should also accept
> >> various software's misconstructions of UTF-8 and produce some
> >> meaningful output.
> >
> > Meaningful in what sense? And what kind of misconstructions would that
> > be, for example?
>
> Wikipedia describes a couple:
> http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations
>
> I think it's useful to have functions loose enough to interoperate
> with these too, as long as one uses the validating routines for any
> untrusted input.
The NUL character encoded as 0xC0 0x80 is simply an overlong sequence
and should therefore be parsed as U+0000 by my original implementation,
although I wouldn't call that a feature. And interpreting CESU-8 is
something the mainline implementation does not do either, and neither do
I think it should.
Sorry, I think it's completely arbitrary to say "Hey, let's treat this
and that kind of invalid UTF-8 in this in that manner which I happen to
like." The documentation nowhere says that it would do so, and the
current implementation doesn't do it either, so what's the point?
--Daniel
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]