*From*: Stefan Westerfeld <stefan space twc de>*To*: Tim Janik <timj gtk org>*Cc*: beast gnome org*Subject*: Re: Fast factor 2 resampling*Date*: Tue, 28 Mar 2006 21:44:05 +0200

Hi! On Tue, Mar 28, 2006 at 03:27:14PM +0200, Tim Janik wrote: > On Mon, 6 Mar 2006, Stefan Westerfeld wrote: > >I have worked quite a bit now on writing code for fast factor two > >resampling based on FIR filters. So the task is similar to the last > >posting I made. The differences are: > > > >- I tried to really optimize for speed; I used gcc's SIMD primitives to > > design a version that runs really fast on my machine: > > > > model name : AMD Athlon(tm) 64 Processor 3400+ > > stepping : 10 > > cpu MHz : 2202.916 > > > > running in 64bit mode. > > > >- I put some more effort into designing the coefficients for the filter; > > I used octave to do it; the specifications I tried to meet are listed > > in the coeffs.h file. > > hm, can you put up a description about how to derive the coefficients with > octave or with some other tool then. so they can be reproduced by someone > else? As I have done it, it requires extra octave code (a bunch of .m files implementing the ultraspherical window). I've copypasted the code from a paper, and hacked around until it worked (more or less) in octave. But if we want to include it as octave code in the BEAST distribution, it might be worth investing a little more work into this window so that we can provide a matlab/octave implementation we really understand and then can provide a C implementation as well, so it can be used from BEAST directly. > >The resamplers are designed for streaming use; they do smart history > >keeping. Thus a possible use case I designed them for would be to > >upsample BEAST at the input devices and downsample BEAST at the output > >devices. > > > >The benefits of using the code for this tasks are: > > > >- the filters are linear-phase > > *why* exactly is this a benefit? Linear phase filtering means three things: * we do "real interpolation", in the sense that for factor 2 upsampling, every other sample is exactly kept as it is; this means that we don't have to compute it * we keep the shape of the signal intact, thus operations that modify the shape of the signal (non-linear operations, such as saturation) will sound the same when oversampling them * we have the same delay for all frequencies - not having the same delay for all frequencies may result in audible differences between the original and up/downsampled signal http://en.wikipedia.org/wiki/Group_delay gives a table, which however seems to indicate that "not being quite" linear phase wouldn't lead to audible problems > >- the implementation should be fast enough (at least on my machine) > >- the implementation should be precise enough (near -96 dB error == 16 > > bit precision) > > what is required to beef this up to -120dB, or provide an alternative > implementation. i'm asking because float or 24bit datahandles are not at > all unlikely for the future. Why -120dB? 6 * 24 = 144...? The first factor that influences the precision is of course the filter (and the resampling code doesn't hardcode the filter coefficients). The filter can be tweaked to offer a -144dB (or -120dB) frequency response by redesigning the coefficients (with the octave method I used), it will be longer then (more delay, slower computation). The second factor is the SSE code itself, because SSE limits us to float float precision. My implementation also uses a computation order that is quite fast - but not too good for precision. Usually, for FIR filters its good to compute first the influence of small coefficients and then the influence of larger ones. However I compute the influence of the coefficients in the order they occur in the impulse response. As conclusion: it might be that SSE code - at least as implemented - cannot attain the precision we desire for 24bit audio. How good it gets probably can't be determined without trying it. > likewise, a 12bit variant may make sense as well for some handles (maybe > even an 8bit variant in case that's still significantly faster than the > 12bit version). That should be no problems, simply by designing new coefficients. > >The downside may be the delay of the filters. > > > >I put some effort into making this code easy to test, with four kinds of > >tests: > > > >(p) Performance tests measure how fast the code runs > > > > I tried on my machine with both: gcc-3.4 and gcc-4.0; you'll see the > > results below. The speedup gain achieved using SIMD instructions > > (SSE3 or whatever AMD64 uses) is > > > > gcc-4.0 gcc-3.4 > > -------------+--------------------- > > upsampling | 2.82 2.85 > > downsampling | 2.54 2.50 > > oversampling | 2.70 2.64 > > > > where oversampling is first performing upsampling and then > > performing downsampling. Note that there is a bug in gcc-3.3 which > > will not allow combining C++ code with SIMD instructions. > > > > The other output should be self-explaining (if not, feel free to > > ask). > > hm, these figures are pretty much meaningless without knowing: > - what exactly was performed that took 2.82 or 2.85 > - what is the unit of those figures? milli seconds? hours? dollars? These are speedup gains. A speedup gain is a factor between the "normal" implementation and the SSE implementation. speedup_gain = time_normal / time_sse It has no unit, because the "seconds" unit both times have will disappear when dividing them. If you want to know the times, and the number of samples processed in that time, you should read the RESULTS file. It is much more detailed than the table I gave above. > >(a) Accuracy tests, which compare what should be the result with what is > > the result; you'll see that using SIMD instructions means a small > > loss of precision, but it should be acceptable. It occurs because > > the code doesn't use doubles to store the accumulated sum, but > > floats, to enable SIMD speedup. > > what's the cost of using doubles for intermediate values anyway (is that > possible at all?) > and what does the precision loss mean in dB? The non-SSE implementation does use doubles for intermediate values. The SSE implementation could only use doubles if we rely on some higher version of SSE (I think SSE2 or SSE3). However, the price of doing it would be that the vectorized operations don't do four operations at once, but two. That means it would become a lot slower to use SSE at all. As outlined above, the "real" performance loss is hard to predict. However, I can give you one sample here for the -96dB filter: $ ssefir au accuracy test for factor 2 upsampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.194477 dB $ ssefir auf accuracy test for factor 2 upsampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.080477 dB As you see, the variant which uses doubles for intermediate values is not much better than the SSE variant, and both fulfill the spec without problems. However, as dB is a logarithmic measure, care has to be taken when extrapolating what it would mean for a -144dB (or -120dB) filter. And the other aspects that affect precision I mentioned above will also affect the result. > but then, what you're sending here is still pretty rough and > looks cumbersome to deal with. > can you please provide more details on the exact API you intend to add > (best is to have this in bugzilla), and give precise build instructions > (best is usually down to the level of shell commands, so the reader just > needs to paste those). I've uploaded a more recent version of the sources to bugzilla: #336366. It also contains build instructions for the standalong thingy. For ssefir.h, I added documentation comments /** *... */ for those functions/classes that may be interesting for others. I also marked a few more functions protected, so that only the interesting part of the main classes, Upsampler2 and Downsampler2, remains public. > also, more details of what exctaly your performance tests do and how > to use them would be apprechiated. Basically, they do the resampling processing for the same block of data 500000 times. You can modify the block size used. By timing this operation, a throughput can be computed which then can be given as samples per second, or for instance CPU usage for resampling a single 44100 Hz stream. If you run the shell script (or read the test file RESULTS I had attached to the initial mail), you may understand it a bit more, because the output is somewhat verbose. On my system: $ ssefir pu performance test for factor 2 upsampling using FPU instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 3.667876 samples / second = 17448790.501572 which means the resampler can process 395.66 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.252740 % CPU usage $ ssefir puf performance test for factor 2 upsampling using SSE instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 1.346511 samples / second = 47530250.673020 which means the resampler can process 1077.78 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.092783 % CPU usage The arguments here are: p = performance u = upsampling f = fast -> SSE implementation run ssefir without args for help. Cu... Stefan -- Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan

**Follow-Ups**:**Re: Fast factor 2 resampling***From:*Tim Janik

**References**:**Fast factor 2 resampling***From:*Stefan Westerfeld

**Re: Fast factor 2 resampling***From:*Tim Janik

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]