On Thu, 05 Oct 2006 19:55:01 -0500, Federico Mena Quintero wrote: > Are those SVGs available anywhere? The script is here: http://lists.freedesktop.org/archives/cairo/2006-September/007863.html As Vlad discusses, the results so far aren't that useful, so I won't bother linking to any results yet. > [ 0] image-rgba paint_solid_rgb_over-64 0.008 0.03% 100 > > Is [0] really 100 iterations of 0.008 milliseconds each? Yes, that really is what it's claiming there. But I do not claim that that result is accurate---see below. > One *huge* problem I had in the initial Pango benchmarks is that they > had way too much noise from the rest of the system. I was getting > wildly different numbers each time. I don't think that's a problem I'm having. What I'm doing here is 100 independent iterations of each test, timing each one, (using CPU performance counters when available). After I do the 100 runs, I discard the slowest 15% of the runs, (on the assumption that they are outliers due to various kinds of system interference), and then I report the standard deviation over the 85% of runs that are left. That's the percentage column above. So, the very low standard deviation shows that the times I'm getting back from each iteration are very consistent. Now, in spite of that, a measurement of 8 microseconds still is way too low to be reliable. There is overhead in the timing framework itself, (such as function calls to start and stop the timer, etc.), so you're definitely justified in being suspicious of that. In fact, I spent some time yesterday to measure what the overhead is. The conclusion I came to is that for the image backend any result that is not several 10s of microseconds or more is unreliable, and for the xlib backend a result that is not at least in the 100s of microseconds is not reliable. Here are the details in how I came up with those values: http://lists.freedesktop.org/archives/cairo/2006-October/008119.html So I'll change the paint test to not bother with testing any image size less than 256x256, since those tests are currently reporting times of 28 microseconds or less, (compared to 100 microseconds for a size of 256x256). I've considered making the test suite automatically measure the overhead or otherwise adaptively adjust the number of iterations a test should run for in order to get reliable results. But I don't like the impact that would have on how tests would have to be written, so instead I think I'll just hard-code the thresholds I've measured and have the test suite issue an "unreliable result" warning if a test ever reports a time that's less than a particular backend's minimum. > The problem is that the test was > running only for a few seconds; simply increasing the number of > iterations so that the whole test suite runs for a few minutes (instead > of less than 10 seconds) gave me very stable numbers. The tunable parameter I have now is the 100 "outer" iterations which are there to be able to measure and report on how stable the numbers are. I'm suspicious of any measurement that doesn't include something like a standard deviation or some other "margin of error" with it. I don't have a suite-wide parameter for the "inner" iterations because I would like to be able to tune each test independently to get it to run as quickly as possible while still returning a reliable result. And can't it be the case that _reducing_ the length of individual tests can make the results more reliable? For example, if the run time can be short enough to make the probability of the most significant system disturbance low, then we can eliminate those disturbances completely by discarding the same percentage of the slowest results. Right? > How long does the test suite take to run on your machine? If it takes > seconds rather than minutes, I'd be somewhat suspicious of the numbers > it gives you. If so, just increase the number of iterations. You may > not get accurate timings if num_iters*time_per_test is very close to the > kernel's HZ. I just ran it and it took over 34 minutes. (This is with an XAA X server and I'm quite sure that it was a lot faster with my KAA X server, but I hadn't actually measured that.) Another way of looking at that is that the test suite really only takes about 20 seconds to run, but we do that 100 times in order to be able to report the standard deviation for each test. I'm guessing we might be able to get away with fewer than 100 iterations here and still have useful numbers, but I haven't investigated that much yet. And I am very interested in getting results that are as reliable as possible. So if people can point out mistakes in the technique we're using, or improvements that can be made, I'd be glad to hear them. > This is *excellent* stuff, Carl. Keep us posted :) Thanks, I will. By the way, most of the interesting work here is not my doing at all. So many thanks to the following individuals: * Vladimir Vukicevic - Earlier cairo performance suite that provided inspiration, (and initial code). First attempt at significant coverage in the test suite. * Benjamin Otte - Proposal (and implementation) of much of the details of the approach toward measurement and statistics reporting. * David Schleef - Performance counter implementation (from liboil). * Keith Packard - Guidance on how to synchronize with X server. -Carl
Attachment:
pgpxgbQk1lctB.pgp
Description: PGP signature