Re: Faster UTF-8 decoding in GLib


2010/3/26 Behdad Esfahbod <behdad behdad org>:
> Another idea, now that people are measuring: What about this:
> static const int utf8_mask_data[7] = {
>  0, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01
> };
> #define UTF8_COMPUTE(Char, Mask, Len) \
>    Len = utf8_skip_data[(guchar)(Char)]; \
>    Mask = utf8_mask_data[Len]; \
>    if (G_UNLIKELY ((guchar)(Char) >= 0xfe)) \
>      Len = -1; \

I have tried this, and contrary to my expectations as well, the result
for Core 2 was worse than mainline.

There are now two more changes on this branch:;a=shortlog;h=refs/heads/fast-utf8-elstner

The mask variables now have explicit type guint32, rather than
gunichar. I think this makes sure that a summary left shift over 32
bits will result in zero, terminating the loop; if this is not enough,
a mask with 0xFFFFFFFF could be thrown in, which hopefully will be
optimized away on 32 bit targets.

g_utf8_get_char() is back to its previous implementation, in the name
of quirk compatibility. So, now there are three "gears" for UTF-8

3. g_utf8_iterate() is the fastest, with almost no validation;
2. g_utf8_get_char() is slower, performs (yet undocumented) checks for
structurally correct UTF-8-ish sequences;
1. g_utf8_get_char_validated() is the slowest, performs thorough UTF-8

Best regards,

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]