Re: [cairo] Re: Better coverage from cairo performance suite (and some results)

On Thu, 05 Oct 2006 19:55:01 -0500, Federico Mena Quintero wrote:
> Are those SVGs available anywhere?

The script is here:

As Vlad discusses, the results so far aren't that useful, so I won't
bother linking to any results yet.

> [  0]    image-rgba               paint_solid_rgb_over-64      0.008  0.03%   100
> Is [0] really 100 iterations of 0.008 milliseconds each?

Yes, that really is what it's claiming there. But I do not claim that
that result is accurate---see below.

> One *huge* problem I had in the initial Pango benchmarks is that they
> had way too much noise from the rest of the system.  I was getting
> wildly different numbers each time.

I don't think that's a problem I'm having.

What I'm doing here is 100 independent iterations of each test, timing
each one, (using CPU performance counters when available). After I do
the 100 runs, I discard the slowest 15% of the runs, (on the
assumption that they are outliers due to various kinds of system
interference), and then I report the standard deviation over the 85%
of runs that are left. That's the percentage column above.

So, the very low standard deviation shows that the times I'm getting
back from each iteration are very consistent.

Now, in spite of that, a measurement of 8 microseconds still is way
too low to be reliable. There is overhead in the timing framework
itself, (such as function calls to start and stop the timer, etc.), so
you're definitely justified in being suspicious of that.

In fact, I spent some time yesterday to measure what the overhead
is. The conclusion I came to is that for the image backend any result
that is not several 10s of microseconds or more is unreliable, and for
the xlib backend a result that is not at least in the 100s of
microseconds is not reliable. Here are the details in how I came up
with those values:

So I'll change the paint test to not bother with testing any image
size less than 256x256, since those tests are currently reporting times
of 28 microseconds or less, (compared to 100 microseconds for a size
of 256x256).

I've considered making the test suite automatically measure the
overhead or otherwise adaptively adjust the number of iterations a
test should run for in order to get reliable results. But I don't like
the impact that would have on how tests would have to be written, so
instead I think I'll just hard-code the thresholds I've measured and
have the test suite issue an "unreliable result" warning if a test
ever reports a time that's less than a particular backend's minimum.

>                                     The problem is that the test was
> running only for a few seconds; simply increasing the number of
> iterations so that the whole test suite runs for a few minutes (instead
> of less than 10 seconds) gave me very stable numbers.

The tunable parameter I have now is the 100 "outer" iterations which
are there to be able to measure and report on how stable the numbers
are. I'm suspicious of any measurement that doesn't include something
like a standard deviation or some other "margin of error" with it.

I don't have a suite-wide parameter for the "inner" iterations because
I would like to be able to tune each test independently to get it to
run as quickly as possible while still returning a reliable result.

And can't it be the case that _reducing_ the length of individual
tests can make the results more reliable? For example, if the run time
can be short enough to make the probability of the most significant
system disturbance low, then we can eliminate those disturbances
completely by discarding the same percentage of the slowest
results. Right?

> How long does the test suite take to run on your machine?  If it takes
> seconds rather than minutes, I'd be somewhat suspicious of the numbers
> it gives you.  If so, just increase the number of iterations.  You may
> not get accurate timings if num_iters*time_per_test is very close to the
> kernel's HZ.

I just ran it and it took over 34 minutes. (This is with an XAA X
server and I'm quite sure that it was a lot faster with my KAA X
server, but I hadn't actually measured that.) Another way of looking
at that is that the test suite really only takes about 20 seconds to
run, but we do that 100 times in order to be able to report the
standard deviation for each test. I'm guessing we might be able to get
away with fewer than 100 iterations here and still have useful
numbers, but I haven't investigated that much yet.

And I am very interested in getting results that are as reliable as
possible. So if people can point out mistakes in the technique we're
using, or improvements that can be made, I'd be glad to hear them.

> This is *excellent* stuff, Carl.  Keep us posted :)

Thanks, I will. By the way, most of the interesting work here is not
my doing at all. So many thanks to the following individuals:

* Vladimir Vukicevic - Earlier cairo performance suite that provided
  inspiration, (and initial code). First attempt at significant
  coverage in the test suite.

* Benjamin Otte - Proposal (and implementation) of much of the details
  of the approach toward measurement and statistics reporting.

* David Schleef - Performance counter implementation (from liboil).

* Keith Packard - Guidance on how to synchronize with X server.


Attachment: pgpxgbQk1lctB.pgp
Description: PGP signature

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]