Re: Fast factor 2 resampling



On Mon, 6 Mar 2006, Stefan Westerfeld wrote:

  Hi!

I have worked quite a bit now on writing code for fast factor two
resampling based on FIR filters. So the task is similar to the last
posting I made. The differences are:

- I tried to really optimize for speed; I used gcc's SIMD primitives to
 design a version that runs really fast on my machine:

 model name      : AMD Athlon(tm) 64 Processor 3400+
 stepping        : 10
 cpu MHz         : 2202.916

 running in 64bit mode.

- I put some more effort into designing the coefficients for the filter;
 I used octave to do it; the specifications I tried to meet are listed
 in the coeffs.h file.

hm, can you put up a description about how to derive the coefficients with
octave or with some other tool then. so they can be reproduced by someone
else?

The resamplers are designed for streaming use; they do smart history
keeping. Thus a possible use case I designed them for would be to
upsample BEAST at the input devices and downsample BEAST at the output
devices.

The benefits of using the code for this tasks are:

- the filters are linear-phase

*why* exactly is this a benefit?

- the implementation should be fast enough (at least on my machine)
- the implementation should be precise enough (near -96 dB error == 16
 bit precision)

what is required to beef this up to -120dB, or provide an alternative
implementation. i'm asking because float or 24bit datahandles are not at
all unlikely for the future.
likewise, a 12bit variant may make sense as well for some handles (maybe
even an 8bit variant in case that's still significantly faster than the
12bit version).

The downside may be the delay of the filters.

I put some effort into making this code easy to test, with four kinds of
tests:

(p) Performance tests measure how fast the code runs

   I tried on my machine with both: gcc-3.4 and gcc-4.0; you'll see the
   results below. The speedup gain achieved using SIMD instructions
   (SSE3 or whatever AMD64 uses) is

                  gcc-4.0    gcc-3.4
   -------------+---------------------
   upsampling   |   2.82      2.85
   downsampling |   2.54      2.50
   oversampling |   2.70      2.64

   where oversampling is first performing upsampling and then
   performing downsampling. Note that there is a bug in gcc-3.3 which
   will not allow combining C++ code with SIMD instructions.

   The other output should be self-explaining (if not, feel free to
   ask).

hm, these figures are pretty much meaningless without knowing:
- what exactly was performed that took 2.82 or 2.85
- what is the unit of those figures? milli seconds? hours? dollars?

(a) Accuracy tests, which compare what should be the result with what is
   the result; you'll see that using SIMD instructions means a small
   loss of precision, but it should be acceptable. It occurs because
   the code doesn't use doubles to store the accumulated sum, but
   floats, to enable SIMD speedup.

what's the cost of using doubles for intermediate values anyway (is that
possible at all?)
and what does the precision loss mean in dB?

(g) Gnuplot; much the same like accuracy, but it writes out data which
   can be plottet by gnuplot. So it is possible to "see" the
   interpolation error, rather than just get it as output.

(i) Impulse response; this one can be used for debugging - it will give
   the impulse response for the (for sub- and oversampling combined)
   system, so you can for instance see the delay in the time domain or
   plot the filter response in the frequency domain.

So I am attaching all code and scripts that I produced so far. For
compiling, I use g++ -O3 -funroll-loops as options; however, I suppose
on x86 machines you need to tell the compiler to generate SSE code.

I just tried it briefly on my laptop, and the SSE version there is much
slower than the non-SSE version. Currently I can't say why this is. I
know that AMD64 has extra registers compared to standard SSE. However,
I designed the inner loop (fir_process_4samples_sse) with having in mind
not to use more than 8 registers: out0..out3, input, taps, intermediate
sum/product whatever. These are 7 registers. Well, maybe AMD64 isn't
faster because of more registers, but better adressing mode, or
whatever.

Maybe its just my laptop being slow, and other non-AMD64-systems will
perform better.

Maybe we need to write three versions of the inner loop. One for AMD64,
one for x86 with SSE and one for FPU.

In any case, I invite you to try it out, play with the code, and give
feedback about it.

first, i'd like to thank you for working on this.

but then, what you're sending here is still pretty rough and
looks cumbersome to deal with.
can you please provide more details on the exact API you intend to add
(best is to have this in bugzilla), and give precise build instructions
(best is usually down to the level of shell commands, so the reader just
needs to paste those).

also, more details of what exctaly your performance tests do and how
to use them would be apprechiated.

  Cu... Stefan

---
ciaoTJ



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]