Re: Some profiling

From: Tim Janik <timj gtk org>
To: Stefan Westerfeld <stefan space twc de>
Cc: beast gnome org
Subject: Re: Some profiling
Date: Wed Jan 28 22:40:10 2004
On Thu, 29 Jan 2004, Stefan Westerfeld wrote:

>    Hi!
>
> I was wondering why beast/tests/perftest is soo slow. It only invokes a
> procedure over and over again. Result (Athlon-XP 1800+, Linux-2.6.0-k7,
> Debian unstable):

if you're running 2.6.0 and debian, you can also facilitate the oprofile
module. for that, you need the oprofile packet installed, the kernel
module enabled, and if you run into:
opgprof: error while loading shared libraries: libbfd-2.14.90.0.6.so: cannot open shared object file: No such file or directory
a simple:
cp -aiv /usr/lib/libbfd-2.14.90.0.7.so /usr/lib/libbfd-2.14.90.0.6.so
will do ;)
this gets you profiling of binaries and shared libraries without
recompilation (albeit without call-graphs also, so there is a point
in going through the hassle of using -pg and gprof).

> stefan@luna1:~/src/beast-plain/beast/tests$ perftest
> 1.371361 seconds for 10000 invocations => 7292.026062 invocations/second, 0.000137 sec per invocation

> So the first assumption I had was that its the context switching (between
> the threads) itself that is soo slow. But writing two programs, one that
> uses select and one that uses pthread conditions, to just test context
> switching, didn't confirm this. The timings on my system for 1000000 round
> trips (switches back and forth) are:

i'm not sure i understand why you did this in the first place.
were these complete stand-alone programs? as for beast, the
data structure used for communication at the thread boundaries
is SfiComPort, and that uses a condition or standard
sfi_thread_wakeup() to wake up the receiving thread.
the condition is used when waiting for a procedure return value,
while sfi_thread_wakeup() is used to wake up an idle thread
once new data arrives, and sfi_thread_wakeup() uses the glib
main loop pipe/select mechanism.
so all in all, with perftest benching procedure invocations,
you get a good mix of condition and select() waits, almost
50:50 i'd guess.

> So, I think that the roundtrip delay from context switching is probably not
> the problem (that would allow 125470 roundtrips per second, which is far better
> than the observed value).

> However, as this wasn't the real reason, I started to build a profiled version
> of perftest, which was quite a problem, as gprof doesn't handle shared libs
> well. I did this:
>
>  - build glib-2.2.3 from scratch with -pg compiler option
>  - build sfi, bse and tests with -pg compiler option
>  - make a list of all object files required by perftest

hm, interesting, how did you do that?

>  - link them together manually, adding -lmad -lz -logg and so on
>
> The full results (gprof output for doing 50000 invocations), including the
> roundtrip tests mentioned above, can be found at
>
>   http://space.twc.de/~stefan/kde/beast-profiling.tar.bz2

unfortunately not:

$ wget -c http://space.twc.de/~stefan/kde/beast-profiling.tar.bz2
04:06:56 ERROR 404: File Not Found.

> What can be seen (or at least what I can see, its definitely worth reading
> the whole result yourself) is that procedure invocation currently places
> _heavy_ stress on the glib type system:
>
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>   8.43      0.80     0.80 14150793     0.00     0.00  type_check_is_value_type_U
>   7.80      1.54     0.74                             __i686.get_pc_thunk.bx
>   7.80      2.28     0.74                             _g_locale_charset_unalias
>   7.11      2.96     0.68  5250491     0.00     0.00  g_type_is_a
>   3.79      3.32     0.36  1552384     0.00     0.00  g_hash_table_lookup
>   3.58      3.66     0.34  6550946     0.00     0.00  g_type_value_table_peek
>   3.27      3.97     0.31  4000171     0.00     0.00  g_type_check_value_holds
>   3.06      4.26     0.29  1103356     0.00     0.00  g_str_hash
>   2.32      4.48     0.22  1550089     0.00     0.00  g_value_unset
>   2.27      4.69     0.22  5650292     0.00     0.00  g_type_check_value
>   2.16      4.90     0.21   750000     0.00     0.00  sfi_seq_append_copy
>   2.11      5.10     0.20  5000184     0.00     0.00  boxed_nodes_cmp
>   2.11      5.30     0.20  1500447     0.00     0.00  g_data_set_internal
>   1.74      5.46     0.17  1150103     0.00     0.00  g_value_type_compatible
>   1.63      5.62     0.16   350010     0.00     0.00  boxed_proxy_value_init
>
> So the first 60% of the profile is almost completely spent within the type
> system.

no wonder, all that perftest does is calling a very simple procedure, so
the majority of work is spent marshalling and demarshalling the call,
which is all sorts of type system interaction.

> Given that all GUI/Core communication (including things that require
> frequent updates like scopes) will go through the SFI layer, optimization
> of the whole process might be useful.

while the GUI might benefit from faster call marshalling, as far as the
scopes are concerned, i just committed a major optimization yesterday.
that is, SfiComPort (and dependent code) now support bulk transfers,
so multiple scope requests and responses are being processed per
context switch now. these are still marshalled int multiple GValues
of course though ;)

> Besides optimizing the type system itself (which could be hard),

that's a topic coming up over and over again (on gtk-devel-list that is),
which is why a fair amount of fixes did already go into the type system.
so it's unlikely you'll find lots of a big and easy to optimize spot ;(

> I think it
> might also be worthwile to check how to reduce the stress we put on the type
> system (i.e. by trying to avoid redundant checks/conversions of GValues in
> the critical path).

depends on what's redundant here, have you had an actuall look at the
checks in the path and can point out specific spots that are redundant?

> At least I don't see why ~100 g_type_is_a calls per
> procedure invocation should be really necessary.

i think i do see where those come from. for the most part, functions do:
value_set_int (GValue *v, int i)
{
  g_return_if_fail (G_VALUE_HOLDS_INT (v));
  . . .
}
which is a type_check_is_value_type_U() and a g_type_is_a() each time.

a similar performance bottle neck popped up in gtk programs with
object instance checks, up to a point, where we switched to inlined
type checks for GCC, which gave in a bit of the type safety being
checked for. we could do something similar for values, please redo
your benchmarks with the following macro definition applied (best is
to put this in your local copy of gtype.h):

#ifdef  __GNUC__
#  define _G_TYPE_CVH(vl, gt)             (G_GNUC_EXTENSION ({ \
  GValue *__val = (GValue*) vl; GType __t = gt; gboolean __r; \
  if (__val && __val->g_type == __t) \
    __r = TRUE; \
  else \
    __r = g_type_check_value_holds (__val, __t); \
  __r; \
}))
#else  /* !__GNUC__ */
#  define _G_TYPE_CVH(vl, gt)             (g_type_check_value_holds ((GValue*) vl, gt))
#endif /* !__GNUC__ */

and if you could make both available, that'd be great ;)

> It might also be possible to optimize the client language binding to generate
> different, but more performant (in the sense of leaving us again with less
> work to do, until we have the final sfi datastructure we can send) invocation
> code.

i'd first check whether the client language binding is indeed
a major bottleneck. i'd guess not, as the most part of value
magic (copies, validation etc.) is going on in the code performing
a procedure invocation.

thanks for doing the profiling work, if your results show a significant
difference, we can put the above _G_TYPE_CVH() into glib-2.4.

>
>    Cu... Stefan

---
ciaoTJ
Follow-Ups:
- Re: Some profiling
  - From: Stefan Westerfeld
- Re: Some profiling
  - From: Stefan Westerfeld
References:
- Some profiling
  - From: Stefan Westerfeld
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]