Some profiling



   Hi!

I was wondering why beast/tests/perftest is soo slow. It only invokes a
procedure over and over again. Result (Athlon-XP 1800+, Linux-2.6.0-k7,
Debian unstable):

stefan@luna1:~/src/beast-plain/beast/tests$ perftest
1.371361 seconds for 10000 invocations => 7292.026062 invocations/second, 0.000137 sec per invocation

This appears to be quite slow to me, since what happens should only be
 - the language C++ language binding puts arguments (and so on) into an
   SfiValue based data structure containing procedure name, parameter,
   args and so on
 - calling thread wakes up bse thread and suspends
 - bse thread gets procedure name and args from the data structure
 - procedure gets executed
 - bse thread creates SfiValue based data structure, containing the result
 - bse thread wakes up calling thread
 - result gets returned from C++ language binding, which gets the result from
   the SfiValue

So the first assumption I had was that its the context switching (between
the threads) itself that is soo slow. But writing two programs, one that
uses select and one that uses pthread conditions, to just test context
switching, didn't confirm this. The timings on my system for 1000000 round
trips (switches back and forth) are:

stefan@luna1:~/prog/roundtrip$ time -p roundtrip-pthread
1823293664 (expected 1823293664)    # <- this is debugging output only
real 9.24
user 1.35
sys 3.55
stefan@luna1:~/prog/roundtrip$ time -p roundtrip-select
1823293664 (expected 1823293664)
real 7.97
user 0.54
sys 3.42

So, I think that the roundtrip delay from context switching is probably not
the problem (that would allow 125470 roundtrips per second, which is far better
than the observed value). I was also surprised that select actually performs
better than pthread conditions, but when you look at the code, you can see
that if the kernel decides to switch as soon as a wakeup occurs on a condition,
it may need to switch back and forth more often than necessary, because of
locks held by the other thread (pthread conditions require locks).

But maybe something similar occurs with select() as well? Or maybe its just
that the select()/pipe() solution uses no locks in the application at all
that makes it perform better? I don't know these.

However, as this wasn't the real reason, I started to build a profiled version
of perftest, which was quite a problem, as gprof doesn't handle shared libs
well. I did this:

 - build glib-2.2.3 from scratch with -pg compiler option
 - build sfi, bse and tests with -pg compiler option
 - make a list of all object files required by perftest
 - link them together manually, adding -lmad -lz -logg and so on

The full results (gprof output for doing 50000 invocations), including the
roundtrip tests mentioned above, can be found at

  http://space.twc.de/~stefan/kde/beast-profiling.tar.bz2

What can be seen (or at least what I can see, its definitely worth reading
the whole result yourself) is that procedure invocation currently places
_heavy_ stress on the glib type system:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  8.43      0.80     0.80 14150793     0.00     0.00  type_check_is_value_type_U
  7.80      1.54     0.74                             __i686.get_pc_thunk.bx
  7.80      2.28     0.74                             _g_locale_charset_unalias
  7.11      2.96     0.68  5250491     0.00     0.00  g_type_is_a
  3.79      3.32     0.36  1552384     0.00     0.00  g_hash_table_lookup
  3.58      3.66     0.34  6550946     0.00     0.00  g_type_value_table_peek
  3.27      3.97     0.31  4000171     0.00     0.00  g_type_check_value_holds
  3.06      4.26     0.29  1103356     0.00     0.00  g_str_hash
  2.32      4.48     0.22  1550089     0.00     0.00  g_value_unset
  2.27      4.69     0.22  5650292     0.00     0.00  g_type_check_value
  2.16      4.90     0.21   750000     0.00     0.00  sfi_seq_append_copy
  2.11      5.10     0.20  5000184     0.00     0.00  boxed_nodes_cmp
  2.11      5.30     0.20  1500447     0.00     0.00  g_data_set_internal
  1.74      5.46     0.17  1150103     0.00     0.00  g_value_type_compatible
  1.63      5.62     0.16   350010     0.00     0.00  boxed_proxy_value_init

So the first 60% of the profile is almost completely spent within the type
system. Given that all GUI/Core communication (including things that require
frequent updates like scopes) will go through the SFI layer, optimization
of the whole process might be useful.

Besides optimizing the type system itself (which could be hard), I think it
might also be worthwile to check how to reduce the stress we put on the type
system (i.e. by trying to avoid redundant checks/conversions of GValues in
the critical path). At least I don't see why ~100 g_type_is_a calls per
procedure invocation should be really necessary.

It might also be possible to optimize the client language binding to generate
different, but more performant (in the sense of leaving us again with less
work to do, until we have the final sfi datastructure we can send) invocation
code.

   Cu... Stefan
-- 
  -* Stefan Westerfeld, stefan@space.twc.de (PGP!), Hamburg/Germany
     KDE Developer, project infos at http://space.twc.de/~stefan/kde *-         



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]