Re: Faster UTF-8 decoding in GLib


Am Samstag, den 27.03.2010, 18:04 -0400 schrieb Behdad Esfahbod:

> Sure, I wasn't referring to valid data.  In valid UTF-8, there is no 5byte or
> 6byte sequences either.

True, but that was a post-hoc restriction imposed afterwards, when
Unicode was redefined as a 21-bit character set, presumably to suit the
range representable by UTF-16.  The 21-bit version of UCS-4 got the new
name UTF-32, but UTF-8 kept its name despite the changed definition.

At the time the UTF-8 decoding routines in GLib and glibmm were written,
a UTF-8 sequence was still considered as up to six bytes long and able
to encode a full 31-bit UCS-4 code point.

I don't think its inconceivable that some day the restriction on 21 bit
may be lifted again.  After all, Unicode started out as 16-bit encoding
and was later extended beyond that range.

But even if it does not happen, no-longer-valid UTF-8 sequences of five
or six bytes can be interpreted as UCS-4 code points in an unambiguous
and obvious manner.  And since it just happens to come out that way out
of the algorithm, I see no need to artificially constrain it so that it
would return something else instead.

However, for other invalid conditions to result in defined behavior,
explicit checks would be required in the code.  I see no reason to pay
the cost for insufficient validation checks in light of the fact that
the documentation explicitly states that the behavior is undefined if
the input is not valid UTF-8.  It might be a different matter if it
would write past the end of a buffer or something, but that's not the
case here.

Interestingly, g_utf8_get_char() is the only place where the UTF8_GET()
macro is used.  I guess this wasn't always the case, and that some other
piece of code may have relied upon its half-checking behavior in the


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]