Re: Faster UTF-8 decoding in GLib
- From: Behdad Esfahbod <behdad behdad org>
- To: Daniel Elstner <daniel kitta googlemail com>
- Cc: gtk-devel-list gnome org
- Subject: Re: Faster UTF-8 decoding in GLib
- Date: Fri, 26 Mar 2010 13:25:57 -0400
Sorry for replying so late. I saw a few replies implying that the developer
time to implement a (to me, unmeasurably) useful feature has been spent
already so I should go ahead and commit it. There are various flaws with that
argument:
- It ignores the fact that writing a patch is a small part of the time spent
on a change. Ignoring the maintainer review time as well as future
maintenance. If you think I should commit without spending significant time
on it, well, there's a reason you're not the maintainer :P. In short, it's
the maintainer that is taking the risk, not you or the patch author. Guess
why I'm replying this late? Because reading 18 messages and 20 patches takes
time. Time I could spend on fixing a bug that has a measurable impact at least.
- It also assumes that the patch is ready, and useful. The original patch
series had various flaws. A few I list:
* Introduce 256 new relocations!
* Inlined a public function, but just to make an indirect function call
instead. What's the point of inlining then?!
* Had unknown impacts on systems with higher function call overhead.
* Was not tested in real-life situations. Perf tests are not realistic.
Calling g_utf8_next_char a million times in a loop is nothing like real-life.
In real life strings that are processed are really short. Memory cache
effects make any micro-optimization you make look like noise.
* Changed the semantics of the glib UTF-8 functions. Dealing with UTF-8
coming from outside world is very sensitive matter security-wise. There's
backward compatibility also. Can't just decide to return a different value
from now on.
* The construct borrowed from glibmm, as beautiful as it is, is WRONG for
6-byte-long UTF-8. It just doesn't work. We historically support those
sequences.
That said. I'm not being unfair to anyone here. I personally am a utf-8
microoptimizing geek myself. See for example this blogpost:
http://mces.blogspot.com/2008/04/utf-8-bit-manipulation.html
So I'm not even willing to commit my own optimization to that code without
seeing real-world numbers first.
Another idea, now that people are measuring: What about this:
static const int utf8_mask_data[7] = {
0, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01
};
#define UTF8_COMPUTE(Char, Mask, Len) \
G_STMT_BEGIN { \
Len = utf8_skip_data[(guchar)(Char)]; \
Mask = utf8_mask_data[Len]; \
if (G_UNLIKELY ((guchar)(Char) >= 0xfe)) \
Len = -1; \
} G_STMT_END
behdad
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]