Re: Faster UTF-8 decoding in GLib



Hi,

Am Mittwoch, den 17.03.2010, 00:17 +0200 schrieb Mikhail Zabaluev:

> Yes, though we are already in the buffer overflow territory with all
> implementations of g_utf8_get_char considered so far.

Only read past the end, thus no security implications beyond a potential
for DoS in the unlikely event that the memory a few bytes ahead is not
accessible.  And the current implementation has that "problem", too.

> >> My understanding is that unvalidated decoding should also accept
> >> various software's misconstructions of UTF-8 and produce some
> >> meaningful output.
> >
> > Meaningful in what sense?  And what kind of misconstructions would that
> > be, for example?
> 
> Wikipedia describes a couple:
> http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations
> 
> I think it's useful to have functions loose enough to interoperate
> with these too, as long as one uses the validating routines for any
> untrusted input.

The NUL character encoded as 0xC0 0x80 is simply an overlong sequence
and should therefore be parsed as U+0000 by my original implementation,
although I wouldn't call that a feature.  And interpreting CESU-8 is
something the mainline implementation does not do either, and neither do
I think it should.

Sorry, I think it's completely arbitrary to say "Hey, let's treat this
and that kind of invalid UTF-8 in this in that manner which I happen to
like."  The documentation nowhere says that it would do so, and the
current implementation doesn't do it either, so what's the point?

--Daniel




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]