Re: Fast factor 2 resampling



On Tue, 28 Mar 2006, Stefan Westerfeld wrote:

  Hi!

On Tue, Mar 28, 2006 at 03:27:14PM +0200, Tim Janik wrote:
On Mon, 6 Mar 2006, Stefan Westerfeld wrote:
I have worked quite a bit now on writing code for fast factor two
resampling based on FIR filters. So the task is similar to the last
posting I made. The differences are:

- I tried to really optimize for speed; I used gcc's SIMD primitives to
design a version that runs really fast on my machine:

model name      : AMD Athlon(tm) 64 Processor 3400+
stepping        : 10
cpu MHz         : 2202.916

running in 64bit mode.

- I put some more effort into designing the coefficients for the filter;
I used octave to do it; the specifications I tried to meet are listed
in the coeffs.h file.

hm, can you put up a description about how to derive the coefficients with
octave or with some other tool then. so they can be reproduced by someone
else?

As I have done it, it requires extra octave code (a bunch of .m files
implementing the ultraspherical window). I've copypasted the code from a
paper, and hacked around until it worked (more or less) in octave.

But if we want to include it as octave code in the BEAST distribution,
it might be worth investing a little more work into this window so that
we can provide a matlab/octave implementation we really understand and
then can provide a C implementation as well, so it can be used from
BEAST directly.

ok, ok, first things first ;)

as far as i see, we only have a couple use cases at hand, supposedly
comprehensive filter setups are:
-  8bit:  48dB
- 12bit:  72dB
- 16bit:  96dB
- 20bit: 120dB
- 24bit: 144dB

if we have those 5 cases covered by coefficient sets, that'd be good enough
to check the stuff in to CVS and have production ready up/down sampling.

then, if the octave files and the paper you pasted from permit, it'd be good
to put the relevant octave/matlab files into CVS under LGPL, so the coefficient
creation process can be reconstructed later on (and by other contributors).

last, once all of the above has been settled and there is a valid use case
for creating these filters from C, the octave/matlab code can be translated
to be available during runtime. do you actually see a use case for this?

The resamplers are designed for streaming use; they do smart history
keeping. Thus a possible use case I designed them for would be to
upsample BEAST at the input devices and downsample BEAST at the output
devices.

The benefits of using the code for this tasks are:

- the filters are linear-phase

*why* exactly is this a benefit?

Linear phase filtering means three things:

* we do "real interpolation", in the sense that for factor 2 upsampling,
 every other sample is exactly kept as it is; this means that we don't
 have to compute it

* we keep the shape of the signal intact, thus operations that modify
 the shape of the signal (non-linear operations, such as saturation)
 will sound the same when oversampling them

* we have the same delay for all frequencies - not having the same
 delay for all frequencies may result in audible differences between
 the original and up/downsampled signal

   http://en.wikipedia.org/wiki/Group_delay

 gives a table, which however seems to indicate that "not being quite"
 linear phase wouldn't lead to audible problems

ok, thanks for explaining this. we should have this and similar things
available in our docuemntation actually. either on a wiki page on synthesis,
or even a real documentation chapter about synthesis. thoughts?

- the implementation should be fast enough (at least on my machine)
- the implementation should be precise enough (near -96 dB error == 16
bit precision)

what is required to beef this up to -120dB, or provide an alternative
implementation. i'm asking because float or 24bit datahandles are not at
all unlikely for the future.

Why -120dB? 6 * 24 = 144...?

yeah, thanks for pointing this out. both are valid use cases, 20bit
samples and 24bit samples.

The first factor that influences the precision is of course the filter
(and the resampling code doesn't hardcode the filter coefficients). The
filter can be tweaked to offer a -144dB (or -120dB) frequency response
by redesigning the coefficients (with the octave method I used), it will
be longer then (more delay, slower computation).

The second factor is the SSE code itself, because SSE limits us to float
float precision. My implementation also uses a computation order that is
quite fast - but not too good for precision. Usually, for FIR filters its
good to compute first the influence of small coefficients and then the
influence of larger ones. However I compute the influence of the
coefficients in the order they occur in the impulse response.

As conclusion: it might be that SSE code - at least as implemented -
cannot attain the precision we desire for 24bit audio. How good it gets
probably can't be determined without trying it.

yeah, right. float might fall short on 20bit or 24bit (definitely the latter,
since 32bit floats have only 23bit of mantissa).
but as you say, we'll see once we have the other coefficient sets, and even
if at 144dB only the slow FPU variant can keep precision, the SSE code will
still speed up the most common use case which is 16bit.

what worries me a bit though is that you mentioned one of your machines
runs the SSE variant slower than the FPU varient. did you investigate
more here?

likewise, a 12bit variant may make sense as well for some handles (maybe
even an 8bit variant in case that's still significantly faster than the
12bit version).

That should be no problems, simply by designing new coefficients.

ok, as i understand, you're going to do this by manually, using octave or
mathlab as a tool now, right?

                 gcc-4.0    gcc-3.4
  -------------+---------------------
  upsampling   |   2.82      2.85
  downsampling |   2.54      2.50
  oversampling |   2.70      2.64

  where oversampling is first performing upsampling and then
  performing downsampling. Note that there is a bug in gcc-3.3 which
  will not allow combining C++ code with SIMD instructions.

  The other output should be self-explaining (if not, feel free to
  ask).

hm, these figures are pretty much meaningless without knowing:
- what exactly was performed that took 2.82 or 2.85
- what is the unit of those figures? milli seconds? hours? dollars?

These are speedup gains. A speedup gain is a factor between the "normal"
implementation and the SSE implementation.

ah, good you point this out.

If you want to know the times, and the number of samples processed in
that time, you should read the RESULTS file. It is much more detailed
than the table I gave above.

well, i did read through it now. first, what's oversampling? how's that
different from upsampling?
and second, reading through the output doesn't lend itself for proper
comparisions of the figures in a good way. compiling them into a table
would be better for that.
and third, since this is the output of a single run, i have no idea how
much those figures are affected by processor/sytem/etc. jitter, so even
if all the performance figures where next to each other, i'd still have
no idea within what ranges they are actually comparable.

i.e. a bit of posprocessing on your side would have helped to make the
information acquired be properly digestible ;)

i agree that the output itself is nicely verbose though.

(a) Accuracy tests, which compare what should be the result with what is
  the result; you'll see that using SIMD instructions means a small
  loss of precision, but it should be acceptable. It occurs because
  the code doesn't use doubles to store the accumulated sum, but
  floats, to enable SIMD speedup.

what's the cost of using doubles for intermediate values anyway (is that
possible at all?)
and what does the precision loss mean in dB?

The non-SSE implementation does use doubles for intermediate values. The
SSE implementation could only use doubles if we rely on some higher
version of SSE (I think SSE2 or SSE3). However, the price of doing it
would be that the vectorized operations don't do four operations at
once, but two. That means it would become a lot slower to use SSE at
all.

depending on sse2 also limits portability, e.g. out of 2 laptops here, only
1 has sse2 (both have sse), and out of 2 athlons here only one has sse (and
none sse2). the story is different with mmx of course, which is supported by
all 4 processors...

As outlined above, the "real" performance loss is hard to predict.
However, I can give you one sample here for the -96dB filter:

max difference between correct and computed output: 0.000012 = -98.194477 dB

accuracy test for factor 2 upsampling using SSE instructions

max difference between correct and computed output: 0.000012 = -98.080477 dB

thanks. now lets see those figures for 120dB and 144dB ;)

As you see, the variant which uses doubles for intermediate values is
not much better than the SSE variant, and both fulfill the spec without
problems.

have you by any chance benched the FPU variant with doubles against the
FPU variant with floats btw?

However, as dB is a logarithmic measure, care has to be taken when
extrapolating what it would mean for a -144dB (or -120dB) filter.
And the other aspects that affect precision I mentioned above will also
affect the result.

yeah, agree.

but then, what you're sending here is still pretty rough and
looks cumbersome to deal with.
can you please provide more details on the exact API you intend to add
(best is to have this in bugzilla), and give precise build instructions
(best is usually down to the level of shell commands, so the reader just
needs to paste those).

I've uploaded a more recent version of the sources to bugzilla: #336366.
It also contains build instructions for the standalong thingy. For
ssefir.h, I added documentation comments

/**
*...
*/

for those functions/classes that may be interesting for others. I also
marked a few more functions protected, so that only the interesting part
of the main classes, Upsampler2 and Downsampler2, remains public.

thanks for the good work, will have a look at it later.

  Cu... Stefan

---
ciaoTJ



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]