Re: Fixed point cairo.. or no cairo?

From: Behdad Esfahbod <behdad behdad org>
To: michael meeks novell com
Cc: performance-list gnome org
Subject: Re: Fixed point cairo.. or no cairo?
Date: Mon, 21 Aug 2006 11:54:56 -0400

On Mon, 2006-08-21 at 07:16 -0400, Michael Meeks wrote:
> On Mon, 2006-08-21 at 12:49 +0300, Aivars Kalvans wrote:
> > I did include simple C program that produced such result in previous
> > mail. It uses http://en.wikipedia.org/wiki/RDTSC to count CPU ticks.
> 
> 	Ah - I missed that; sorry. I get a huge amount of jitter from these
> numbers - but I guess the lowest numbers are static; I imagine
> time-slicing screws up RDTSC, indeed - (for me) these numbers look
> unreliable enough to be pretty scary ;-)

Exactly.  First I tried that thing I got only zeros.  Now I played more
with it and here are my findings:

  - If I compile with anything other than -O0, I get all zeros printed.
Any optimization level makes gcc drop the inline assembly away.
Incidentally, no optimization is hardly the best thing to profile.

  - I added a call to the function before the measurement, to resolve
the symbol first.

  - I get:

branching:              61 cycles
function call:          690 cycles
inline multiply:        548 cycles

  - Needless to say that's junk.  On a HZ=1000 system that's hardly
unexpected:

  $ sudo nice -n -30 ./a
branching:              61 cycles
function call:          128 cycles
inline multiply:        60 cycles

That's almost as good as I can get.  The branching and inline multiply
values do not change across runs.  Function call resonates in the range
of 130..180.  Which means the FPU in this centrino systems sucks more
than the ones in AMD64s.

  - Adding G_UNLIKELY to the branch doesn't change the results, as
optimization is off.

  - Inspecting the generated assembly, it's clear that the high number
for branch is because of saving the 64-bit timer value to memory.  So I
add an empty measurement block to measure the overhead:

    asm(".byte 0x0f, 0x31" : "=A" (start));
    asm(".byte 0x0f, 0x31" : "=A" (end));
    printf("overhead:\t\t%llu cycles\n", end - start);

I actually put two of them to make sure the memory line involved is in
cache.  Which results in (taking lowest in multiple runs):

$ sudo nice -n -30 ./a
overhead:               44 cycles
overhead:               44 cycles
branching:              50 cycles
function call:          125 cycles
inline multiply:        60 cycles

Now that looks a lot better!  The 44 values are fixed.  I get 60 instead
of 50 sometimes.  I think that has got to do with cache lines.  Not
sure.

This means:

  branching (with no optimization): 6 cycles
  inline multiply: 16 cycles
  function call overhead: 65 cycles

That's a lot more realistic I guess.

-- 
behdad
http://behdad.org/

"Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill"
        -- Dan Bern, "New American Language"

Follow-Ups:
- Re: Fixed point cairo.. or no cairo?
  - From: Aivars Kalvans

References:
- Re: Fixed point cairo.. or no cairo?
  - From: Aivars Kalvans
- Re: Fixed point cairo.. or no cairo?
  - From: Michael Meeks

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]