Re: Gtk+ unit tests (brainstorming)

From: Carl Worth <cworth cworth org>
To: Tim Janik <timj imendio com>
Cc: Federico Mena Quintero <federico ximian com>, keithp keithp com, Gtk+ Developers <gtk-devel-list gnome org>
Subject: Re: Gtk+ unit tests (brainstorming)
Date: Fri, 10 Nov 2006 12:18:11 -0800
On Tue, 31 Oct 2006 10:26:41 -0800, Carl Worth wrote:
> On Tue, 31 Oct 2006 15:26:35 +0100 (CET), Tim Janik wrote:
> > i.e. using averaging, your numbers include uninteresting outliers
> > that can result from scheduling artefacts (like measuring a whole second
> > for copying a single pixel), and they hide the interesting information,
> > which is the fastest possible performance encountered for your test code.
>
> If computing an average, it's obviously very important to eliminate
> the slow outliers, because they will otherwise skew it radically. What
> cairo-perf is currently doing for outliers is really cheesy,
> (ignoring a fixed percentage of the slowest results). One thing I
> started on was to do adaptive identification of outliers based on the
> "> Q3 + 1.5 * IQR" rule as discussed here:
>
> 	http://en.wikipedia.org/wiki/Outlier

For reference (or curiosity), in cairo's performance suite, I've now
changed the cairo-perf program, (which does "show me the performance
for the current cairo revision"), to report minimum (and median) times
and it does do the adaptive outlier detection mentioned above.

But when I take two of these reports generated separately and compare
them, I'm still seeing more noise than I'd like to see, (things like a
40% change when I _know_ that nothing in that area has changed).

I think one problem that is happening here is that even though we're
doing many iterations for any given test, we're doing them all right
together so some system-wide condition might affect all of them and
get captured in the summary.

So I've now taken a new approach which is working much better. What
I'm doing now for cairo-perf-diff which does "show me the performance
difference between two different revisions of cairo" is to save the
raw timing for every iteration of every test. Then, statistics are
generated only just before the comparison. This makes it easy to go
back and append additional data if some of the results look off. This
has several advantages:

 * I can append more data only for tests where the results look bad,
   so that's much faster.

 * I can run fewer iterations in the first place, since I'll be
   appending more later as needed. This makes the whole process much
   faster.

 * Appending data later means that I'm temporally separating runs for
   the same test and library version, so I'm more immune to random
   system-wide disturbances.

 * Also, when re-running the suite with only a small subset of the
   tests, the two versions of the library are compared at very close
   to the same time, so system-wide changes are less likely to make a
   difference in the result.

I'm really happy with the net result now. I don't even bother worrying
about not using my laptop while the performance suite is running
anymore, since it's quick and easy to correct problems later. And when
I see the results, if some of the results looks funny, I re-run just
those tests, and sure enough the goofy stuff just disappears,
(validating my assumption that it was bogus), or it sticks around no
matter how many times I re-run it, (leading me to investigate and
learn about some unexpected performance impact).

And it caches all of those timing samples so it doesn't have to
rebuild or re-run the suite to compare against something it has seen
before, (the fact that git has hashes just sitting there for the
content of every directory made this easy and totally free). The
interface looks like this:

# What's the performance impact of the latest commit?
cairo-perf-diff HEAD

# How has performance changed from 1.2.0 to 1.2.6? from 1.2.6 to now?
cairo-perf-diff 1.2.0 1.2.6
cairo-perf-diff 1.2.6 HEAD

# As above, but force a re-run even though there's cached data:
cairo-perf-diff -f 1.2.6 HEAD

# As above, but only re-run the named tests:
cairo-perf-diff -f 1.2.6 HEAD -- stroke fill

The same ideas could be implemented with any library performance
suite, and with pretty much any revision control system. It is handy
that git makes it easy to easily name ranges of commits. So, if I
wanted a commit-by-commit report of every change that is unique to
some branch, (let's say, whats on HEAD since 1.2 split off), I could
do something like this:

for rev in $(git rev-list 1.2.6..HEAD); do
	cairo-perf-diff rev
done

-Carl

PS. Yes, it is embarrassing that no matter what the topic I end up
plugging git eventually.
Attachment: pgptMUKCNSOrs.pgp
Description: PGP signature
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]