Re: Fast factor 2 resampling

From: Stefan Westerfeld <stefan space twc de>
To: Tim Janik <timj gtk org>
Cc: beast gnome org
Subject: Re: Fast factor 2 resampling
Date: Tue, 4 Apr 2006 22:28:58 +0200
   Hi!

On Tue, Mar 28, 2006 at 08:06:07PM +0200, Tim Janik wrote:
> On Tue, 28 Mar 2006, Stefan Westerfeld wrote:
> >>>- I put some more effort into designing the coefficients for the filter;
> >>>I used octave to do it; the specifications I tried to meet are listed
> >>>in the coeffs.h file.
> >>
> >>hm, can you put up a description about how to derive the coefficients with
> >>octave or with some other tool then. so they can be reproduced by someone
> >>else?
> >
> >As I have done it, it requires extra octave code (a bunch of .m files
> >implementing the ultraspherical window). I've copypasted the code from a
> >paper, and hacked around until it worked (more or less) in octave.
> >
> >But if we want to include it as octave code in the BEAST distribution,
> >it might be worth investing a little more work into this window so that
> >we can provide a matlab/octave implementation we really understand and
> >then can provide a C implementation as well, so it can be used from
> >BEAST directly.
> 
> ok, ok, first things first ;)
> 
> as far as i see, we only have a couple use cases at hand, supposedly
> comprehensive filter setups are:
> -  8bit:  48dB
> - 12bit:  72dB
> - 16bit:  96dB
> - 20bit: 120dB
> - 24bit: 144dB
> 
> if we have those 5 cases covered by coefficient sets, that'd be good enough
> to check the stuff in to CVS and have production ready up/down sampling.

Yes, these sound reasonable. Although picking which filter setup to use
may not be as easy as looking at the precision of the input data.

For example ogg input data could be resampled with 96dB coefficients for
performance reasons, or 8bit input data could be resampled with a higher
order filter to get better transition steepness.

Anyway, I'll design coefficients for these 5 cases, and if we want to
have more settings later on, we still can design new coefficients.

> then, if the octave files and the paper you pasted from permit, it'd be good
> to put the relevant octave/matlab files into CVS under LGPL, so the 
> coefficient
> creation process can be reconstructed later on (and by other contributors).

I've asked the author of the paper, and he said we can put his code in
our LGPL project. I still need to put some polishing into the octave
code, because I somewhat broke it when porting it from matlab to octave.

The original version has somewhat more readable/understandable filter
design parameters than my version. I hope I get the octave code right.

> last, once all of the above has been settled and there is a valid use case
> for creating these filters from C, the octave/matlab code can be translated
> to be available during runtime. do you actually see a use case for this?

One case I can think of is if we've got a FIR module with a GUI that
allows designing custom filters. But even then, ultraspherical
coefficient tweaking may be too much work for everyday use. Probably the
standard user will rather have "some" window which is somewhat optimal,
without a lot of tweaking, rather than an "almost optimal" window - such
as ultraspherical, with a lot of manual tweaking.

One could also try to automate the tweaking of the two window parameters
by using some kind of search algorithm. Then, it would be as easy to use
as a normal window.

> >Linear phase filtering means three things:
> >
> >* we do "real interpolation", in the sense that for factor 2 upsampling,
> > every other sample is exactly kept as it is; this means that we don't
> > have to compute it
> >
> >* we keep the shape of the signal intact, thus operations that modify
> > the shape of the signal (non-linear operations, such as saturation)
> > will sound the same when oversampling them
> >
> >* we have the same delay for all frequencies - not having the same
> > delay for all frequencies may result in audible differences between
> > the original and up/downsampled signal
> >
> >   http://en.wikipedia.org/wiki/Group_delay
> >
> > gives a table, which however seems to indicate that "not being quite"
> > linear phase wouldn't lead to audible problems
> 
> ok, thanks for explaining this. we should have this and similar things
> available in our docuemntation actually. either on a wiki page on synthesis,
> or even a real documentation chapter about synthesis. thoughts?

Maybe a new doxi file on synthesis details? I could write a few
paragraphs on the resampler.

> >Why -120dB? 6 * 24 = 144...?
> 
> yeah, thanks for pointing this out. both are valid use cases, 20bit
> samples and 24bit samples.

Although by the way -120dB should be ok for almost any practical use
case, because the human ear probably won't able to hear the difference.

Since these are relative values (unlike when talking about integer
precisions for samples), even signals which are not very loud will get
really good resampling.

Thus you have error scenarios like this: a signal with a loud desired
signal (sine wave with 0 dB) and a small error signal (sine wave with
-120dB).  I doubt that the human ear can pick up the error signal. I
even doubt it for the -96 dB case. But well, we could perform listening
tests to try it out.

> yeah, right. float might fall short on 20bit or 24bit (definitely the 
> latter,
> since 32bit floats have only 23bit of mantissa).
> but as you say, we'll see once we have the other coefficient sets, and even
> if at 144dB only the slow FPU variant can keep precision, the SSE code will
> still speed up the most common use case which is 16bit.

Yes. We need to try it once I have the coefficient sets. As I argued
above, the errors may be well below what the human ear can percieve.

> what worries me a bit though is that you mentioned one of your machines
> runs the SSE variant slower than the FPU varient. did you investigate
> more here?

Not yet.

> >>likewise, a 12bit variant may make sense as well for some handles (maybe
> >>even an 8bit variant in case that's still significantly faster than the
> >>12bit version).
> >
> >That should be no problems, simply by designing new coefficients.
> 
> ok, as i understand, you're going to do this by manually, using octave or
> mathlab as a tool now, right?

Yes.

> >>>                 gcc-4.0    gcc-3.4
> >>>  -------------+---------------------
> >>>  upsampling   |   2.82      2.85
> >>>  downsampling |   2.54      2.50
> >>>  oversampling |   2.70      2.64
> >>>
> >>>  where oversampling is first performing upsampling and then
> >>>  performing downsampling. Note that there is a bug in gcc-3.3 which
> >>>  will not allow combining C++ code with SIMD instructions.
> >>>
> >>>  The other output should be self-explaining (if not, feel free to
> >>>  ask).
> >>
> >>hm, these figures are pretty much meaningless without knowing:
> >>- what exactly was performed that took 2.82 or 2.85
> >>- what is the unit of those figures? milli seconds? hours? dollars?
> >
> >These are speedup gains. A speedup gain is a factor between the "normal"
> >implementation and the SSE implementation.
> 
> ah, good you point this out.
> 
> >If you want to know the times, and the number of samples processed in
> >that time, you should read the RESULTS file. It is much more detailed
> >than the table I gave above.
> 
> well, i did read through it now. first, what's oversampling? how's that
> different from upsampling?

Oversampling is first upsampling a 44100 Hz signal to 88200 Hz, and then
downsampling it again to 44100 Hz. Its what I first designed the filters
for: for oversampling the engine. Thus I benchmarked it as seperate
case.

> and second, reading through the output doesn't lend itself for proper
> comparisions of the figures in a good way. compiling them into a table
> would be better for that.

Right.

> and third, since this is the output of a single run, i have no idea how
> much those figures are affected by processor/sytem/etc. jitter, so even
> if all the performance figures where next to each other, i'd still have
> no idea within what ranges they are actually comparable.

Well, the of putting the code in bugs.gnome.org was that you could see
yourself how much jitter/... it produces.

> >>>(a) Accuracy tests, which compare what should be the result with what is
> >>>  the result; you'll see that using SIMD instructions means a small
> >>>  loss of precision, but it should be acceptable. It occurs because
> >>>  the code doesn't use doubles to store the accumulated sum, but
> >>>  floats, to enable SIMD speedup.
> >>
> >>what's the cost of using doubles for intermediate values anyway (is that
> >>possible at all?)
> >>and what does the precision loss mean in dB?
> >
> >The non-SSE implementation does use doubles for intermediate values. The
> >SSE implementation could only use doubles if we rely on some higher
> >version of SSE (I think SSE2 or SSE3). However, the price of doing it
> >would be that the vectorized operations don't do four operations at
> >once, but two. That means it would become a lot slower to use SSE at
> >all.
> 
> depending on sse2 also limits portability, e.g. out of 2 laptops here, only
> 1 has sse2 (both have sse), and out of 2 athlons here only one has sse (and
> none sse2). the story is different with mmx of course, which is supported by
> all 4 processors...

But MMX only accelerates integer operations, which doesn't help much for
our floating point based data handles.

> >As outlined above, the "real" performance loss is hard to predict.
> >However, I can give you one sample here for the -96dB filter:
> 
> >max difference between correct and computed output: 0.000012 = -98.194477 
> >dB
> 
> >accuracy test for factor 2 upsampling using SSE instructions
> 
> >max difference between correct and computed output: 0.000012 = -98.080477 
> >dB
> 
> thanks. now lets see those figures for 120dB and 144dB ;)

Later. I want to cleanup the octave files first (see above), before
doing the final design of the coefficients.

> >As you see, the variant which uses doubles for intermediate values is
> >not much better than the SSE variant, and both fulfill the spec without
> >problems.
> 
> have you by any chance benched the FPU variant with doubles against the
> FPU variant with floats btw?

Well, I tried it now: the FPU variant without doubles is quite a bit (15%)
faster than the variant which uses doubles as intermediate values.

If you want *really cool* speedups, you can use gcc-4.1 with float
temporaries -ftree-vectorize and -ffast-math. That auto vectorization
thing really works, and replaces the FPU instructions with SSE
instructions automagically. Its not much slower than my hand crafted
version. But then again, we wanted a FPU variant to have a FPU variant,
right?

I haven't benchmarked a version which uses double filter coefficients,
because it would have required a lot of rewriting to get it work with
my current source tree.

But here are the tests I described above:

### FPU CODE with temporary DOUBLES.
$ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops
$ ./ssefir pu
performance test for factor 2 upsampling using FPU instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 3.804806
  samples / second = 16820831.364426
  which means the resampler can process 381.42 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.262175 % CPU usage

### FPU CODE with temporary FLOATS.
$ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops
$ ./ssefir pu
performance test for factor 2 upsampling using FPU instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 3.317954
  samples / second = 19288996.585134
  which means the resampler can process 437.39 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.228628 % CPU usage

### SSE CODE with temporary FLOATS generated by AUTOVECTORIZER
$ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops -ffast-math -ftree-vectorize
$ ./ssefir pu
performance test for factor 2 upsampling using FPU instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 1.482929
  samples / second = 43157831.814407
  which means the resampler can process 978.64 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.102183 % CPU usage

### SSE CODE with temporary FLOATS generated by HAND
$ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops
$ ./ssefir puf
performance test for factor 2 upsampling using SSE instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 1.323285
  samples / second = 48364483.105296
  which means the resampler can process 1096.70 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.091183 % CPU usage

I must admit, I am really impressed with the quality of auto
vectorization. Its the first time I can try it out (installed gcc-4.1
today). Hand written code is a somewhat faster, but not much. Another
advantage of the hand written code is that it will work with any
compiler down to gcc 3.4.

> >I've uploaded a more recent version of the sources to bugzilla: #336366.
> >[...]
> 
> thanks for the good work, will have a look at it later.

I uploaded a new version with my current sources.

   Cu... Stefan
-- 
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan
Follow-Ups:
- Re: Fast factor 2 resampling
  - From: Tim Janik
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]