Re: Automated benchmarks of BuildStream





On 2018-03-31 09:20, Tristan Van Berkom wrote:
I made some comments on:

    https://gitlab.com/BuildStream/benchmarks/merge_requests/7

But that is really the wrong place, and to be clear: I don't want my
discussion there to be perceived as blocking the landing of that patch.

That said, I think this is important and I'd like to reiterate on that
comment in this thread a bit more clearly.

While I'm very happy with your response in general, I would really like
to see rendered output as a first class citizen of this repository and
not just an exercise left to the user.

I'll just reply to some of the things from your mail inline, too:

On Mon, 2018-03-26 at 12:05 +0100, Jim MacArthur wrote:
[...]
I don't think output presentation is being considered at the moment; we  intend to produce data in a convenient manner and you can display that 
with Excel, R or gnuplot as you see fit. There will be some graphs 
produced as part of the CI process, but I consider those to be an 
example of what can be done with the benchmarks tool, rather than a 
standard.

[...]
In my experience almost all performance analysis 
requires custom benchmarks - even on previous projects when we've had 
1/3rd of the team dedicated to benchmarking and about 10 bare-metal 
servers per engineer running tests continuously, it was rare that the CI 
system would tell us anything useful for analysis. In most cases we'd 
have to alter the code under test to add extra instrumentation. So I 
wouldn't expect the benchmarks run as continuous integration to identify  bottlenecks or point at ways to improve things; they should be there as 
a guard against unexpected performance changes.

Ok so the more I think of the perspective shown in the above two
points, the more I perceive this to be a problem.

One problem is, if the benchmarks are mostly data, and graphs are just
an exercise left to the user, then the distance between running
benchmarks and viewing the results is just too far - in other words,
nobody is going to notice performance differences in various previously
analyzed places - unless they go ahead and write the code to plot the
graph themselves - which just wont happen until a problem is observed
and investigated.

Another problem with this is that we are writing this with the
expectation of doing "throw away" work - which I don't like.

I fully intend to produce some graphs automatically. However, I've never considered graphs anything other than indicative. In many ways they're the opposite of automation - taking data which can be used to automatically flag performance reduction and turning them into someone only human-readable.

We can add any type of graph later if anyone asks for one, and in the meantime we can produce the types which have already been suggested. The work to create a graph from a CSV table is about 30 seconds - and if anyone is doing that regularly, I'm happy to automate it. I also think we should add new *metrics* anyone can create (as you describe later), but I'd like to be more selective about graphs. 1-5 graphs are useful, 100 graphs aren't.


Where you say:

  "In most cases we'd have to alter the code under test to add extra 
   instrumentation."

I fully agree, what I'd like to see here though is:

  o A strategy for upstreaming of new log messages to support new
    analysis being landed in both BuildStream and the benchmarks repo.

  o The benchmarking plotting should just ignore versions of
    BuildStream which do not yet have support for a given measurement.

  o Ability to fine tune which kinds of messages BuildStream will be
    emitting, such that one can run benchmarks against buildstream
    with only certain messages turned on (in the case that some
    benchmarking of micro activities slows down the whole processes
    significantly and results in skewing of other results).

While the above is not going to be possible for 100% of analysis, I
expect that it will come very, very close (otherwise, we are straying
from benchmarking territory, and moving into profiling territory).

It would be a shame I think, if when one does the analysis of:

  "The time it takes to run integration commands in compose elements
   in the case where a file is moved into a new directory as a result
   of the integration command - benchmarking whether time is
   exponential depending on the number of new directories created"

... and the result of this is not integrated into a "benchmark suite".

Without thinking first about policy for extending the suite, the
analysis in this scenario would be done only once; and after that point
the exercise is just lost - this would really be lacking foresight. 

Rather; once we do the analysis of this the first time, we should have
policy in place for how to add this to a suite, such that we continue
to see these results in rendered graphs every time the full benchmarks
are run, and keep seeing these results in 3, 5, or 10 years time.>
Does this make sense ?

Yes. As above, adding new metrics to the benchmark suite should be an easy process. Adding more log entries or more instrumentation to BuildStream will have to be done with some care to avoid causing performance problems and code bloat. We'll probably want either a more finely-grained log level or some toggles which can show or hide different output.

As for the benchmarks repository, I think we can put any number of tests or analysis scripts into the repository as long as they're curated well. We'll need some mechanism to prioritise tests as the time taken to run them will be limited sometimes. If we keep all the metrics and analysis we've ever done to avoid duplicating work, then we will need to periodically review them and archive or adjust priorities of older tests. Perhaps we'll do this every time BuildStream is released.



Cheers,
    -Tristan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]