Re: Automated benchmarks of BuildStream




On 26/03/18 09:50, Tristan Van Berkom wrote:
On Fri, 2018-03-23 at 10:08 +0000, Jim MacArthur wrote:
[...]
The goal I have in my mind is to have a directory produced by the GitLab
CI system which contains various artifacts of the benchmarking process.
JSON output from both benchmarks and log scraping will be the primary
form of data, and we'll also be able to produce visual indicators in
HTML and SVG or PNG, along with nicely-formatted tables, and CSV format
for pasting into a spreadsheet to do your own ad-hoc analysis.
Hi Jim,

So I'm sorry if my recent reply here sounded "short", a lot of this
benchmarking initiative has happened out of sight, and as such it's
impossible for me to know how much effort has been invested in this so
far.
It didn't sound short, no worries. I haven't had a proper chance to do any planning work on benchmarks, and I understand that it looks confusing. Hopefully today I'll be free of interruptions so I can organise some of the work that's been done by various people in the past few weeks.

Since I feel that since it's inception, benchmarks has veered off
course at least once, as Sam and I did not see exactly eye to eye on
this from the beginning - it worries me if a lot of effort is being
spent without necessarily being on the same page; I think we fixed that
in our discussions at the hackfest before FOSDEM, lets try to make sure
we remain on the same page.

First, here is the material I was able to gather on the subject:

   Angelos's original email in November, which appears to be a reply
   to me, which I can no longer find the origin:
   https://mail.gnome.org/archives/buildstream-list/2017-November/msg00001.html

   It's worth reading through the above thread, but here are some of my
   replies in that thread regardless:
   https://mail.gnome.org/archives/buildstream-list/2017-November/msg00005.html
   https://mail.gnome.org/archives/buildstream-list/2017-November/msg00017.html

   Sam's announcement of the beginnings of the benchmarks repo:
   https://mail.gnome.org/archives/buildstream-list/2018-February/msg00012.html

   A flagship issue in the buildstream repo:
   https://gitlab.com/BuildStream/buildstream/issues/205
The three unticked parts here are the things we've been working on recently - Antoine on Buildstream project generation, me on log scraping and Dominic on output formats. All of them are implemented to some degree, just not merged/integrated into the benchmarks repo yet.

   And a README in the benchmarks repo:
   https://gitlab.com/BuildStream/benchmarks


With a re-read of the above things, I *think* we are *mostly* on the
same page here, regarding:

   o This is something standalone that a developer can:
     - Run on their laptop
     - Render and view the results
     - Select which parts of the benchmarks they want to run, for
       quicker observation of the imacts of their code changes

   o Leveraging of BuildStream logging which already features the
     timing of "things" we would want to analyze in benchmarks, in
     order to reduce the observer effect.

So far so good. One thing I'm more concerned about is what exactly we
are measuring, and how we set about measuring that; what exactly do we
want to be observing ?

When we say "I want to observe the time it takes to stage files" or...
"I want to observe the time it takes to load .bst files" I want to
observe *time per record* for each "thing" we want to benchmark, I want
to observe if we handle this in linear time or not, and I want to
compare that across versions of BuildStream.
We'll break down the results in terms of time per record and have some means of observing linearity. Some people, however, will only be concerned with how long it takes to build one project, e.g. freedesktop-sdk.

When I see in the above linked README file:

   "Configurable aspects:

      * Scale of generated projects, e.g. 1 file, 10 files, 100 files...
        lots of data points allow analyzing how a feature scales, but
        also means we have lots of data.
      ...
   "

This raises a flag for me, rather I am interested in seeing the results
of every run of N items, where N is incrementing, in one graph, and
this really should be default (if it is configurable, it leads me to
suspect we cannot observe non-linear operations for a single run of
benchmarks).
The benchmarks tool will allow you to write a configuration file which tests builds of 1, 10, 100, and 1000 components. I think it's unlikely we'll have anything wired into benchmarks to test log10 variations when that's easily done externally. I expect we'll have a log10 component size test as part of the standard benchmark, though.


That said, configuring an upper bound limit on which numbers we want to
test (or a list of numbers of records) is interesting such as we avoid
*requiring* that a developer run the benchmarks for hours and hours.

Ultimately, what I want to see for a given "thing" that we measure, be
it loading a .bst file, staging a file into a sandbox, caching an
artifact, or whatnot; is always a time per record.

Allow me some ascii art here to illustrate more clearly what I am
hoping to see:

                         Loading bst files
  40ms +------------------------------------------------------------+
       |                                                            |
       |                                                            |
  20ms |                                                            |
       |                                                            |
       |                                                            |
  10ms |                                                     o      |
       |                                       o                    |
       |   o          o           o                                 |
   0ms +------------------------------------------------------------+
           |          |           |            |             |
       (1 file)  (10 files)  (100 files) (1,000 files) (10,000 files)


In the above, we would have some lines connecting the dots, probably
due to the recursive operations we need to run for circular dependency
detection, and presorting of dependencies on each element; this
function will most likely be non linear, but we would ultimately want
to make it linear.

Each sample represented here is the sample of the "Time it took to load
the whole project, divided by the number of elements being loaded",
where "Time to load the project" is isolated, does not include python
startup time or the time it takes to say, display the pipeline in `bst
show` (so we need to use the log parsing to isolate these things).

We can have multiple versions of BuildStream rendered into the same
graph above, with a legend depicting which color is used to depict the
version of BuildStream being sampled, so we can easily see how a code
change has effected the performance of a given "thing".
I don't think output presentation is being considered at the moment; we intend to produce data in a convenient manner and you can display that with Excel, R or gnuplot as you see fit. There will be some graphs produced as part of the CI process, but I consider those to be an example of what can be done with the benchmarks tool, rather than a standard. Nonetheless, thanks for the graph above - it does clearly communicate what you want to see out of the benchmarks, which is very valuable.
If we introduce randomization of data sets in here, which may be
important for generating more realistic data sets (for instance, it is
not meaningful to run the above test on a single target which directly
depends on 10,000 bst files; we need some realistic "depth" of
dependencies) - then it becomes important to rerun the same sample of
say "10 files" many many times (with different randomized datasets),
and observing the average of those totals.

In the future, we can also extend this to plot out separate graphs for
memory consumption, however for accurate readings, we will need to make
some extensions to BuildStream's logging Message objects to enable
memory consumption snapshots to optionally be observed and reported at
the right places.

At a high level, it's important to keep in mind that we use benchmarks
to identify bottlenecks (and then later we use profiling to inspect the
identified bottleneck to then optimize them).

While at the same time, it can be interesting in some way to observe
something simple like "The time it took to complete a well known build"
of something, such numbers are not useful for identifying bottlenecks
and improving things (they can only act as a global monitor of how
well, or how badly we perform).
Both these uses are equally important. The benchmarks repository should be there for both indicative, continuously-run benchmarks and investigative work. In my experience almost all performance analysis requires custom benchmarks - even on previous projects when we've had 1/3rd of the team dedicated to benchmarking and about 10 bare-metal servers per engineer running tests continuously, it was rare that the CI system would tell us anything useful for analysis. In most cases we'd have to alter the code under test to add extra instrumentation. So I wouldn't expect the benchmarks run as continuous integration to identify bottlenecks or point at ways to improve things; they should be there as a guard against unexpected performance changes.


In closing: I suspect that we are mostly all on the same page as to
what we are doing with the benchmarks initiative, but since it seems to
me I have had a hard time to communicate this in the past, and since I
have not had feedback in a long time and cannot measure the efforts
being spent here, I just feel that we have to make sure we are really
still on the same page here once more.

i think we have a potential difference in expectation between continuous testing and analysis work, but I'm hoping to create a tool that can be used for both. Thanks for all your input above, it will be a great help in putting requirements together.

Jim


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]