Re: Benchmarking BuildStream commands from user point of view



On Mon, 2017-11-06 at 18:38 +0000, Angelos Evripiotis wrote:
I haven't looked at the script, but I can say that we want to shoot for
maintainability and longevity here, in other projects it's taken me a
long time to get off the ground to run benchmarks after they have bit
rotted for a year or so.

I agree on the aims. I think we should set up a schedule for the benchmarks to
be run, so that they don't rot away in this manner. One kind of run could be to
make sure it is still compatible with bst, another kind could be to actually
produce the results.

Indeed, since we are with gitlab, I am *hoping* we can get all of this
done with gitlab and not introduce other tech - this could potentially
be done with dedicated runner machines and controlled docker images (I
think when you setup your own runners, you are allowed to impose some
restriction to guarantee that your pipeline is exclusive - which seems
appropriate for benchmarking).

However... lets not aim too high for a first iteration - something that
somebody can run themselves to test and compare multiple versions of
BuildStream in the same uncontrolled environment is a good first step.

[...]
While this is not extremely helpful, I feel like I should point to the
benchmarks I worked on back with Openismus, the code is entirely not
useful for this, but the same concepts apply:

git repo: https://gitorious.org/openismus-playground/phonebook-benchmarks
sample blog post: https://blogs.gnome.org/tvb/2013/12/03/smart-fast-addressbooks/

Nice, I had a quick look. I like that you made it
reproducible, kept it simple, and did valid comparisons. I think it's
a similar problem.

I worked on this but most credit goes to Mathias Hasselmann who
initially created this.

I would not say that this is particularly reproducible, getting
benchmarks up and running was a time sink and I think that was a weak
point in the referred project.

What it did do well though:

 o Accurate readings of cpu and memory consumption

 o Measuring the relevant details (time per record)

 o Comparisons of multiple versions (important for optimization
   work, ability to easily infer the impacts of code changes)

 o Realistic data sets (the vCard generator used a database of
   a wide variation of realistic first and last names, and
   phone numbers that are valid on a wide range of locales).

 o Nice presentation of data, graphing of the results was
   pretty nicely done - and with some effort; various graphs
   could be produced to observe different trends from the
   same complete result sets.

   One could take a run of the full benchmarks against 5
   different versions, and then render graphs for selective
   things (like phone number matching speed from 2 of the
   sample EDS versions).

[...]
Before we started this thread, I had modified plugin.timed_activity() to
separately log timing and capture a cProfile. It was pretty useful for
focussing on the numbers I was really interested in.

This is an excellent idea I hadn't considered. Very attractive and
tempting.

One big problem with benchmarks is having your benchmarking overhead
cloud the results, it would seem that in our case; we already make an
attempt to time every operation - so why bother with anything more ?

I'm very tempted now to think that a benchmarking solution for
BuildStream should be:

  o A parser of the master BuildStream log output

  o Improvements to the BuildStream logging in general, which is also
    a good thing

    - Ensure that the startup time is logged in the same way as other
      BuildStream activities are - this will have a side effect of
      starting the logger in advance of printing the pipeline summary
      but that's not a bad thing

    - Ensure that the timings are calculated in the right places, I
      think they already are, and improving on this is only a good
      thing all around

  o Possibly have an option to generate more machine readable logging
    output.

    This should be as simple as possible and not add significant
    overhead - the purpose here is for benchmarks, so we shouldn't be
    adding any overhead with fancy serialization libraries

    I feel we should avoid this if possible, the current logging is
    already quite machine readable, perhaps just a tweak here and there
    in the regular master log output will make this possible.

This way we have nothing running beside the benchmarked BuildStream,
just an ability to parse the master log file which *anyway* gets
generated under regular operation, and an ability to make nice
comparisons from multiple runs.

Does this sound like a sane way forward ?

Also note about cProfile, we do have buildstream/_profile.py which we
use internally mostly for load time activities, this is more
interesting when there is a clear performance issue and it's time to
figure out where exactly the time is being spent.

Also, this gets a bit more complex with Source plugins which exercise
host tooling - it should be noted that if we run the benchmarks in 2017
and then a year later on a different host - the performance man improve
or regress depending on developments which have occurred in third party
tooling - I dont think it's important to setup an identical environment
for benchmarking though; just that we run the full benchmarking suite
against every interesting version of BuildStream on the same host
setup, and know that external debris might change performance.

Very good points, it's really important to have a basis for comparison. We
can't really change environments and BuildStream version at the same time and
expect to have a useful comparison. One variable at a time! Testing on all
interesting versions should give us a good basis, I think if we manage some of
the environment then we can make an even better basis.

Right, so circling back - lets take an iterative approach and
prioritize a bit, I think a perfectly controlled environment is low on
the list of priorities.

For instance, if we have a benchmarking setup which allows one to run
multiple versions; it matters much less whether host `git` is slow or
fast.

On a host with slow git, the numbers (durations) will be higher; but
this does not detract from our ability to measure how we have improved
or regressed performance from one version of BuildStream to another.

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]