Re: Fast factor 2 resampling

From: Stefan Westerfeld <stefan space twc de>
To: Tim Janik <timj gtk org>
Cc: beast gnome org
Subject: Re: Fast factor 2 resampling
Date: Tue, 28 Mar 2006 21:44:05 +0200
   Hi!

On Tue, Mar 28, 2006 at 03:27:14PM +0200, Tim Janik wrote:
> On Mon, 6 Mar 2006, Stefan Westerfeld wrote:
> >I have worked quite a bit now on writing code for fast factor two
> >resampling based on FIR filters. So the task is similar to the last
> >posting I made. The differences are:
> >
> >- I tried to really optimize for speed; I used gcc's SIMD primitives to
> > design a version that runs really fast on my machine:
> >
> > model name      : AMD Athlon(tm) 64 Processor 3400+
> > stepping        : 10
> > cpu MHz         : 2202.916
> >
> > running in 64bit mode.
> >
> >- I put some more effort into designing the coefficients for the filter;
> > I used octave to do it; the specifications I tried to meet are listed
> > in the coeffs.h file.
> 
> hm, can you put up a description about how to derive the coefficients with
> octave or with some other tool then. so they can be reproduced by someone
> else?

As I have done it, it requires extra octave code (a bunch of .m files
implementing the ultraspherical window). I've copypasted the code from a
paper, and hacked around until it worked (more or less) in octave.

But if we want to include it as octave code in the BEAST distribution,
it might be worth investing a little more work into this window so that
we can provide a matlab/octave implementation we really understand and
then can provide a C implementation as well, so it can be used from
BEAST directly.

> >The resamplers are designed for streaming use; they do smart history
> >keeping. Thus a possible use case I designed them for would be to
> >upsample BEAST at the input devices and downsample BEAST at the output
> >devices.
> >
> >The benefits of using the code for this tasks are:
> >
> >- the filters are linear-phase
> 
> *why* exactly is this a benefit?

Linear phase filtering means three things:

* we do "real interpolation", in the sense that for factor 2 upsampling,
  every other sample is exactly kept as it is; this means that we don't
  have to compute it

* we keep the shape of the signal intact, thus operations that modify
  the shape of the signal (non-linear operations, such as saturation)
  will sound the same when oversampling them

* we have the same delay for all frequencies - not having the same
  delay for all frequencies may result in audible differences between
  the original and up/downsampled signal

    http://en.wikipedia.org/wiki/Group_delay

  gives a table, which however seems to indicate that "not being quite"
  linear phase wouldn't lead to audible problems
 
> >- the implementation should be fast enough (at least on my machine)
> >- the implementation should be precise enough (near -96 dB error == 16
> > bit precision)
> 
> what is required to beef this up to -120dB, or provide an alternative
> implementation. i'm asking because float or 24bit datahandles are not at
> all unlikely for the future.

Why -120dB? 6 * 24 = 144...?

The first factor that influences the precision is of course the filter
(and the resampling code doesn't hardcode the filter coefficients). The
filter can be tweaked to offer a -144dB (or -120dB) frequency response
by redesigning the coefficients (with the octave method I used), it will
be longer then (more delay, slower computation).

The second factor is the SSE code itself, because SSE limits us to float
float precision. My implementation also uses a computation order that is
quite fast - but not too good for precision. Usually, for FIR filters its
good to compute first the influence of small coefficients and then the
influence of larger ones. However I compute the influence of the
coefficients in the order they occur in the impulse response.

As conclusion: it might be that SSE code - at least as implemented -
cannot attain the precision we desire for 24bit audio. How good it gets
probably can't be determined without trying it.

> likewise, a 12bit variant may make sense as well for some handles (maybe
> even an 8bit variant in case that's still significantly faster than the
> 12bit version).

That should be no problems, simply by designing new coefficients.

> >The downside may be the delay of the filters.
> >
> >I put some effort into making this code easy to test, with four kinds of
> >tests:
> >
> >(p) Performance tests measure how fast the code runs
> >
> >   I tried on my machine with both: gcc-3.4 and gcc-4.0; you'll see the
> >   results below. The speedup gain achieved using SIMD instructions
> >   (SSE3 or whatever AMD64 uses) is
> >
> >                  gcc-4.0    gcc-3.4
> >   -------------+---------------------
> >   upsampling   |   2.82      2.85
> >   downsampling |   2.54      2.50
> >   oversampling |   2.70      2.64
> >
> >   where oversampling is first performing upsampling and then
> >   performing downsampling. Note that there is a bug in gcc-3.3 which
> >   will not allow combining C++ code with SIMD instructions.
> >
> >   The other output should be self-explaining (if not, feel free to
> >   ask).
> 
> hm, these figures are pretty much meaningless without knowing:
> - what exactly was performed that took 2.82 or 2.85
> - what is the unit of those figures? milli seconds? hours? dollars?

These are speedup gains. A speedup gain is a factor between the "normal"
implementation and the SSE implementation.

speedup_gain = time_normal / time_sse

It has no unit, because the "seconds" unit both times have will
disappear when dividing them.

If you want to know the times, and the number of samples processed in
that time, you should read the RESULTS file. It is much more detailed
than the table I gave above.

> >(a) Accuracy tests, which compare what should be the result with what is
> >   the result; you'll see that using SIMD instructions means a small
> >   loss of precision, but it should be acceptable. It occurs because
> >   the code doesn't use doubles to store the accumulated sum, but
> >   floats, to enable SIMD speedup.
> 
> what's the cost of using doubles for intermediate values anyway (is that
> possible at all?)
> and what does the precision loss mean in dB?

The non-SSE implementation does use doubles for intermediate values. The
SSE implementation could only use doubles if we rely on some higher
version of SSE (I think SSE2 or SSE3). However, the price of doing it
would be that the vectorized operations don't do four operations at
once, but two. That means it would become a lot slower to use SSE at
all.

As outlined above, the "real" performance loss is hard to predict.
However, I can give you one sample here for the -96dB filter:

$ ssefir au
accuracy test for factor 2 upsampling using FPU instructions
input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
max difference between correct and computed output: 0.000012 = -98.194477 dB
$ ssefir auf
accuracy test for factor 2 upsampling using SSE instructions
input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
max difference between correct and computed output: 0.000012 = -98.080477 dB

As you see, the variant which uses doubles for intermediate values is
not much better than the SSE variant, and both fulfill the spec without
problems.

However, as dB is a logarithmic measure, care has to be taken when
extrapolating what it would mean for a -144dB (or -120dB) filter.
And the other aspects that affect precision I mentioned above will also
affect the result.

> but then, what you're sending here is still pretty rough and
> looks cumbersome to deal with.
> can you please provide more details on the exact API you intend to add
> (best is to have this in bugzilla), and give precise build instructions
> (best is usually down to the level of shell commands, so the reader just
> needs to paste those).

I've uploaded a more recent version of the sources to bugzilla: #336366.
It also contains build instructions for the standalong thingy. For
ssefir.h, I added documentation comments

/**
 *...
 */

for those functions/classes that may be interesting for others. I also
marked a few more functions protected, so that only the interesting part
of the main classes, Upsampler2 and Downsampler2, remains public.

> also, more details of what exctaly your performance tests do and how
> to use them would be apprechiated.

Basically, they do the resampling processing for the same block of data
500000 times. You can modify the block size used. By timing this
operation, a throughput can be computed which then can be given as
samples per second, or for instance CPU usage for resampling a single
44100 Hz stream.

If you run the shell script (or read the test file RESULTS I had
attached to the initial mail), you may understand it a bit more, because
the output is somewhat verbose. On my system:

$ ssefir pu
performance test for factor 2 upsampling using FPU instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 3.667876
  samples / second = 17448790.501572
  which means the resampler can process 395.66 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.252740 % CPU usage
$ ssefir puf
performance test for factor 2 upsampling using SSE instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 1.346511
  samples / second = 47530250.673020
  which means the resampler can process 1077.78 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.092783 % CPU usage

The arguments here are:

p = performance
u = upsampling
f = fast -> SSE implementation

run ssefir without args for help.

   Cu... Stefan
-- 
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan
Follow-Ups:
- Re: Fast factor 2 resampling
  - From: Tim Janik
References:
- Fast factor 2 resampling
  - From: Stefan Westerfeld
- Re: Fast factor 2 resampling
  - From: Tim Janik
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]