Re: Fast factor 2 resampling

From: Tim Janik <timj gtk org>
To: Stefan Westerfeld <stefan space twc de>
Cc: beast gnome org
Subject: Re: Fast factor 2 resampling
Date: Wed, 5 Apr 2006 02:05:37 +0200 (CEST)

On Tue, 4 Apr 2006, Stefan Westerfeld wrote:

  Hi!

On Tue, Mar 28, 2006 at 08:06:07PM +0200, Tim Janik wrote:

On Tue, 28 Mar 2006, Stefan Westerfeld wrote:

- I put some more effort into designing the coefficients for the filter;
I used octave to do it; the specifications I tried to meet are listed
in the coeffs.h file.


hm, can you put up a description about how to derive the coefficients with
octave or with some other tool then. so they can be reproduced by someone
else?


As I have done it, it requires extra octave code (a bunch of .m files
implementing the ultraspherical window). I've copypasted the code from a
paper, and hacked around until it worked (more or less) in octave.

But if we want to include it as octave code in the BEAST distribution,
it might be worth investing a little more work into this window so that
we can provide a matlab/octave implementation we really understand and
then can provide a C implementation as well, so it can be used from
BEAST directly.


ok, ok, first things first ;)

as far as i see, we only have a couple use cases at hand, supposedly
comprehensive filter setups are:
-  8bit:  48dB
- 12bit:  72dB
- 16bit:  96dB
- 20bit: 120dB
- 24bit: 144dB

if we have those 5 cases covered by coefficient sets, that'd be good enough
to check the stuff in to CVS and have production ready up/down sampling.


Yes, these sound reasonable. Although picking which filter setup to use
may not be as easy as looking at the precision of the input data.

For example ogg input data could be resampled with 96dB coefficients for
performance reasons, or 8bit input data could be resampled with a higher
order filter to get better transition steepness.


but that'd just be another choice out of those 5, other than the obvious one.
or am i misunderstanding you and you want to point out a missing setup?


Anyway, I'll design coefficients for these 5 cases, and if we want to
have more settings later on, we still can design new coefficients.


yeah.

then, if the octave files and the paper you pasted from permit, it'd be good
to put the relevant octave/matlab files into CVS under LGPL, so the
coefficient
creation process can be reconstructed later on (and by other contributors).


I've asked the author of the paper, and he said we can put his code in
our LGPL project. I still need to put some polishing into the octave
code, because I somewhat broke it when porting it from matlab to octave.


ugh. i think putting just the link to the paper into our docs or the code
comments would be enough, give it is publically available. but we can mirror
it on site if we got permission for redistribution and there is reason to
believe the original location may be non-permanent in any way.

Linear phase filtering means three things:

* we do "real interpolation", in the sense that for factor 2 upsampling,
every other sample is exactly kept as it is; this means that we don't
have to compute it

* we keep the shape of the signal intact, thus operations that modify
the shape of the signal (non-linear operations, such as saturation)
will sound the same when oversampling them

* we have the same delay for all frequencies - not having the same
delay for all frequencies may result in audible differences between
the original and up/downsampled signal

  http://en.wikipedia.org/wiki/Group_delay

gives a table, which however seems to indicate that "not being quite"
linear phase wouldn't lead to audible problems


ok, thanks for explaining this. we should have this and similar things
available in our docuemntation actually. either on a wiki page on synthesis,
or even a real documentation chapter about synthesis. thoughts?


Maybe a new doxi file on synthesis details? I could write a few
paragraphs on the resampler.


that'd be good. will you check that in to docs/ then?
one file that may be remotely suitable is:
  http://beast.gtk.org/architecture.html
but it's probably much better to just start synthesis-details.doxi.

Why -120dB? 6 * 24 = 144...?


yeah, thanks for pointing this out. both are valid use cases, 20bit
samples and 24bit samples.


Although by the way -120dB should be ok for almost any practical use
case, because the human ear probably won't able to hear the difference.

Since these are relative values (unlike when talking about integer
precisions for samples), even signals which are not very loud will get
really good resampling.

Thus you have error scenarios like this: a signal with a loud desired
signal (sine wave with 0 dB) and a small error signal (sine wave with
-120dB).  I doubt that the human ear can pick up the error signal. I
even doubt it for the -96 dB case. But well, we could perform listening
tests to try it out.


well, since we're writing a modular synthesis application here, keep in mind
that examining just one signal in isolation isn't good enough for all cases.
that'd be ok for a media player with one pluggable filter in it's output
chain, but our signals are used for various purposes and *may* be strongly
amplified.
i'm not saying 144dB will be the common use case, but i think it's reasonably
within the range of filters we migth want to offer synthesis users.

yeah, right. float might fall short on 20bit or 24bit (definitely the
latter,
since 32bit floats have only 23bit of mantissa).
but as you say, we'll see once we have the other coefficient sets, and even
if at 144dB only the slow FPU variant can keep precision, the SSE code will
still speed up the most common use case which is 16bit.


Yes. We need to try it once I have the coefficient sets. As I argued
above, the errors may be well below what the human ear can percieve.


i'll keep an eye on it. with our FFT scope. which allows allmost arbitrary
signal boosts ;)

what worries me a bit though is that you mentioned one of your machines
runs the SSE variant slower than the FPU varient. did you investigate
more here?


Not yet.


ok, please keep posting once you've done that then ;)

well, i did read through it now. first, what's oversampling? how's that
different from upsampling?


Oversampling is first upsampling a 44100 Hz signal to 88200 Hz, and then
downsampling it again to 44100 Hz. Its what I first designed the filters
for: for oversampling the engine. Thus I benchmarked it as seperate
case.


hm, i still don't have a good idea if we won't need n-times oversampling
for the whole engine. basically, because i have a rough idea on what usual
input output rates are or could be (44.1k, 48k, 88.2k, 96k), but not what
good rates are to run the synthesis engine at (48K, 56K, 66.15K, 64K, 72K)...

The non-SSE implementation does use doubles for intermediate values. The
SSE implementation could only use doubles if we rely on some higher
version of SSE (I think SSE2 or SSE3). However, the price of doing it
would be that the vectorized operations don't do four operations at
once, but two. That means it would become a lot slower to use SSE at
all.


depending on sse2 also limits portability, e.g. out of 2 laptops here, only
1 has sse2 (both have sse), and out of 2 athlons here only one has sse (and
none sse2). the story is different with mmx of course, which is supported by
all 4 processors...


But MMX only accelerates integer operations, which doesn't help much for
our floating point based data handles.


sure, i'm just pointing out the availability of different technologies here.
i.e. mmx vs. sse vs. sse2. and the athlons of course also have 3dnow.
to sum it up, SSE seems feasible at the moment, SSE2 not so, out of the
available instruction sets.

As you see, the variant which uses doubles for intermediate values is
not much better than the SSE variant, and both fulfill the spec without
problems.


have you by any chance benched the FPU variant with doubles against the
FPU variant with floats btw?


Well, I tried it now: the FPU variant without doubles is quite a bit (15%)
faster than the variant which uses doubles as intermediate values.

If you want *really cool* speedups, you can use gcc-4.1 with float
temporaries -ftree-vectorize and -ffast-math. That auto vectorization
thing really works, and replaces the FPU instructions with SSE
instructions automagically. Its not much slower than my hand crafted
version. But then again, we wanted a FPU variant to have a FPU variant,
right?


erm, i can't believe your gcc did that without also specifiying a
processor type...
and when we get processor specific, we have to provide alternative
compilation objects and need a mechanism to clearly identify and
select the required instruction sets during runtime.

I've uploaded a more recent version of the sources to bugzilla: #336366.
[...]


thanks for the good work, will have a look at it later.


I uploaded a new version with my current sources.


rock.

  Cu... Stefan


---
ciaoTJ

Follow-Ups:
- Re: Fast factor 2 resampling
  - From: Stefan Westerfeld

References:
- Re: Fast factor 2 resampling
  - From: Stefan Westerfeld

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]