Fast factor 2 resampling

From: Stefan Westerfeld <stefan space twc de>
To: beast gnome org
Subject: Fast factor 2 resampling
Date: Mon, 6 Mar 2006 23:05:07 +0100

   Hi!

I have worked quite a bit now on writing code for fast factor two
resampling based on FIR filters. So the task is similar to the last
posting I made. The differences are:

- I tried to really optimize for speed; I used gcc's SIMD primitives to
  design a version that runs really fast on my machine:

  model name      : AMD Athlon(tm) 64 Processor 3400+
  stepping        : 10
  cpu MHz         : 2202.916

  running in 64bit mode.

- I put some more effort into designing the coefficients for the filter;
  I used octave to do it; the specifications I tried to meet are listed
  in the coeffs.h file.

The resamplers are designed for streaming use; they do smart history
keeping. Thus a possible use case I designed them for would be to
upsample BEAST at the input devices and downsample BEAST at the output
devices.

The benefits of using the code for this tasks are:

- the filters are linear-phase
- the implementation should be fast enough (at least on my machine)
- the implementation should be precise enough (near -96 dB error == 16
  bit precision)

The downside may be the delay of the filters.

I put some effort into making this code easy to test, with four kinds of
tests:

(p) Performance tests measure how fast the code runs

    I tried on my machine with both: gcc-3.4 and gcc-4.0; you'll see the
    results below. The speedup gain achieved using SIMD instructions
    (SSE3 or whatever AMD64 uses) is

                   gcc-4.0    gcc-3.4
    -------------+---------------------
    upsampling   |   2.82      2.85
    downsampling |   2.54      2.50
    oversampling |   2.70      2.64

    where oversampling is first performing upsampling and then
    performing downsampling. Note that there is a bug in gcc-3.3 which
    will not allow combining C++ code with SIMD instructions.

    The other output should be self-explaining (if not, feel free to
    ask).

(a) Accuracy tests, which compare what should be the result with what is
    the result; you'll see that using SIMD instructions means a small
    loss of precision, but it should be acceptable. It occurs because
    the code doesn't use doubles to store the accumulated sum, but
    floats, to enable SIMD speedup.

(g) Gnuplot; much the same like accuracy, but it writes out data which
    can be plottet by gnuplot. So it is possible to "see" the
    interpolation error, rather than just get it as output.

(i) Impulse response; this one can be used for debugging - it will give
    the impulse response for the (for sub- and oversampling combined)
    system, so you can for instance see the delay in the time domain or
    plot the filter response in the frequency domain.

So I am attaching all code and scripts that I produced so far. For
compiling, I use g++ -O3 -funroll-loops as options; however, I suppose
on x86 machines you need to tell the compiler to generate SSE code.

I just tried it briefly on my laptop, and the SSE version there is much
slower than the non-SSE version. Currently I can't say why this is. I
know that AMD64 has extra registers compared to standard SSE. However,
I designed the inner loop (fir_process_4samples_sse) with having in mind
not to use more than 8 registers: out0..out3, input, taps, intermediate
sum/product whatever. These are 7 registers. Well, maybe AMD64 isn't
faster because of more registers, but better adressing mode, or
whatever.

Maybe its just my laptop being slow, and other non-AMD64-systems will
perform better.

Maybe we need to write three versions of the inner loop. One for AMD64,
one for x86 with SSE and one for FPU.

In any case, I invite you to try it out, play with the code, and give
feedback about it.

   Cu... Stefan
-- 
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan

/* SSE optimized FIR Resampling code
 * Copyright (C) 2006 Stefan Westerfeld
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public
 * License as published by the Free Software Foundation; either
 * version 2 of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General
 * Public License along with this library; if not, write to the
 * Free Software Foundation, Inc., 59 Temple Place, Suite 330,
 * Boston, MA 02111-1307, USA.
 */

/* 
 * halfband FIR filter for factor 2 resampling, created with octave
 *
 * design method: windowed sinc,  using ultraspherical window
 *
 *   coefficients = 32
 *             x0 = 1.01065
 *          alpha = 0.75
 *
 * design criteria (44100 Hz => 88200 Hz):
 * 
 *       passband = [     0, 18000 ]  1 - 2^-16 <= H(z) <= 1+2^-16
 *     transition = [ 18000, 26100 ]
 *       stopband = [ 26100, 44100 ]  | H(z) | <= -96 dB
 *
 * and for 48 kHz => 96 kHz:
 *
 *       passband = [     0, 19589 ]  1 - 2^-16 <= H(z) <= 1+2^-16
 *     transition = [ 19588, 29386 ]
 *       stopband = [ 29386, 48000 ]  | H(z) | <= -96 dB
 *
 * in order to keep the coefficient number down to 32, the filter
 * does only "almost" fulfill the spec, but its really really close
 * (no stopband ripple > -95 dB)
 */

const double halfband_fir_upsample2_96db_coeffs[32] = {
    -3.48616530828033e-05,
     0.000112877490936198,
    -0.000278961878372482,
     0.000590495306376081,
    -0.00112566995029848,
     0.00198635062559427,
    -0.00330178798332932,
     0.00523534239035401,
    -0.00799905465189065,
     0.0118867161189188,
    -0.0173508611368417,
     0.0251928452706978,
    -0.0370909694665106,
     0.057408291607388,
    -0.102239638342325,
     0.317002929635456,
     /* here, a 0.5 coefficient will be used */
     0.317002929635456,
    -0.102239638342325,
     0.0574082916073878,
    -0.0370909694665105,
     0.0251928452706976,
    -0.0173508611368415,
     0.0118867161189186,
    -0.00799905465189052,
     0.0052353423903539,
    -0.00330178798332923,
     0.00198635062559419,
    -0.00112566995029842,
     0.000590495306376034,
    -0.00027896187837245,
     0.000112877490936177,
    -3.48616530827983e-05
};

/* SSE optimized FIR Resampling code
 * Copyright (C) 2006 Stefan Westerfeld
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public
 * License as published by the Free Software Foundation; either
 * version 2 of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General
 * Public License along with this library; if not, write to the
 * Free Software Foundation, Inc., 59 Temple Place, Suite 330,
 * Boston, MA 02111-1307, USA.
 */
#include <malloc.h>
#include <stdio.h>
#include <math.h>
#include <assert.h>
#include <vector>
#include <sys/time.h>
#include <time.h>

using std::vector;
using std::min;
using std::max;
using std::copy;

/* see: http://ds9a.nl/gcc-simd/ */
typedef float v4sf __attribute__ ((vector_size (16))); //((mode(V4SF))); // vector of four single floats
union f4vector 
{
  v4sf v;
  float f[4];
};

template<int N>
inline float fir_process_one_sample (const float *input, const float *taps)
{
  double out = 0;
  for (int i = 0; i < N; i++)
    out += input[i] * taps[i];
  return out;
}

template<int N>
inline void fir_process_4samples_sse (const float *input, const float *taps,
				      float& out0, float& out1, float& out2, float& out3)
{
  assert (N >= 4 && (N % 4) == 0);

  /* input and taps must be 16-byte aligned */
  const f4vector *input_v = reinterpret_cast<const f4vector *> (input);
  const f4vector *taps_v = reinterpret_cast<const f4vector *> (taps);
  f4vector out0_v, out1_v, out2_v, out3_v;

  out0_v.v = input_v[0].v * taps_v[0].v;
  out1_v.v = input_v[0].v * taps_v[1].v;
  out2_v.v = input_v[0].v * taps_v[2].v;
  out3_v.v = input_v[0].v * taps_v[3].v;

  for (int i = 1; i < (N+4)/4; i++) /* FIXME: upper loop boundary may be wrong; only tested with (N % 4) == 0 */
    {
      out0_v.v += input_v[i].v * taps_v[i*4+0].v;
      out1_v.v += input_v[i].v * taps_v[i*4+1].v;
      out2_v.v += input_v[i].v * taps_v[i*4+2].v;
      out3_v.v += input_v[i].v * taps_v[i*4+3].v;
    }

  out0 = out0_v.f[0] + out0_v.f[1] + out0_v.f[2] + out0_v.f[3];
  out1 = out1_v.f[0] + out1_v.f[1] + out1_v.f[2] + out1_v.f[3];
  out2 = out2_v.f[0] + out2_v.f[1] + out2_v.f[2] + out2_v.f[3];
  out3 = out3_v.f[0] + out3_v.f[1] + out3_v.f[2] + out3_v.f[3];
}

/*
 * we require a special ordering of the FIR taps, to get maximum benefit of the SSE SIMD operations
 *
 * example: suppose the FIR taps are [ x1 x2 x3 x4 x5 x6 x7 x8 x9 ], then the SSE taps become
 *
 * [ x1 x2 x3 x4   0 x1 x2 x3   0  0 x1 x2   0  0  0 x1      <- for input[0]
 *   x5 x6 x7 x8  x4 x5 x6 x7  x3 x4 x5 x6  x2 x3 x4 x5      <- for input[1]
 *   x9  0  0  0  x8 x9  0  0  x7 x8 x9  0  x6 x7 x8 x9 ]    <- for input[2]
 * \------------/\-----------/\-----------/\-----------/
 *    for out0     for out1      for out2     for out3
 *
 * so that we can compute out0, out1, out2 and out3 simultaneously
 * from input[0]..input[2]
 */
vector<float> fir_compute_taps_sse (const vector<float>& no_sse_taps)
{
  const int T = no_sse_taps.size();
  vector<float> sse_taps ((T + 4) * 4); /* FIXME: may be wrong; code only tested with (T % 4) == 0 */

  for (int j = 0; j < 4; j++)
    for (int i = 0; i < T; i++)
      {
	int k = i + j;
	sse_taps[(k/4)*16+(k%4)+j*4] = no_sse_taps[i];
      }

  return sse_taps;
}

template<class T, int ALIGN>
class AlignedMem {
  unsigned char *unaligned_mem;
  T *data;
  unsigned int n_elements;

  void allocate_aligned_data()
  {
    assert ((ALIGN % sizeof(T)) == 0);
    unaligned_mem = static_cast <unsigned char *> (malloc (n_elements * sizeof(T) + (ALIGN-1)));
    
    unsigned char *aligned_mem = unaligned_mem;
    if (ptrdiff_t (unaligned_mem) % ALIGN)
      aligned_mem += ALIGN - ptrdiff_t (unaligned_mem) % ALIGN;

    data = reinterpret_cast<T *> (aligned_mem);
  }
  /*
   * no copy constructor
   */
  AlignedMem (const AlignedMem&);
  /*
   * no assignment operator
   */
  AlignedMem& operator= (const AlignedMem&);
public:
  AlignedMem (const vector<T>& elements)
    : n_elements (elements.size())
  {
    allocate_aligned_data();

    for (unsigned int i = 0; i < n_elements; i++)
      new (data + i) T(elements[i]);
  }
  AlignedMem (unsigned int n_elements)
    : n_elements (n_elements)
  {
    allocate_aligned_data();

    for (unsigned int i = 0; i < n_elements; i++)
      new (data + i) T();
  }
  ~AlignedMem()
  {
    /*
     * C++ destruction order: last allocated element is deleted first
     */
    while (n_elements)
      data[--n_elements].~T();

    free (unaligned_mem);
  }
  T& operator[] (unsigned int pos)
  {
    return data[pos];
  }
  const T& operator[] (unsigned int pos) const
  {
    return data[pos];
  }
  unsigned int size()
  {
    return n_elements;
  }
};

template<unsigned int T, bool USE_SSE>
class Upsampler2 {
  vector<float>        no_sse_taps;
  AlignedMem<float,16> history;
  AlignedMem<float,16> sse_taps;
public:
  Upsampler2 (float *init_taps)
    : no_sse_taps (init_taps, init_taps + T),
      history (2 * T),
      sse_taps (fir_compute_taps_sse (no_sse_taps))
  {
    assert ((T & 1) == 0);    /* even order filter */
  }
  /*
   * fast SSE optimized convolution
   */
  void process_4samples_aligned (const float *input /* aligned */, float *output)
  {
    const int H = (T / 2) - 1; /* half the filter length */

    output[0] = input[H];
    output[2] = input[H+1];
    output[4] = input[H+2];
    output[6] = input[H+3];

    fir_process_4samples_sse<T> (&input[0], &sse_taps[0], output[1], output[3], output[5], output[7]);
  }
  /*
   * slow non-SSE convolution
   */
  void process_sample_unaligned (const float *input, float *output)
  {
    const int H = (T / 2) - 1; /* half the filter length */
    output[0] = input[H];
    output[1] = fir_process_one_sample<T> (&input[0], &no_sse_taps[0]);
  }
  void process_block_aligned (const float *input, unsigned n_input_samples, float *output)
  {
    unsigned int i = 0;
    if (USE_SSE)
      {
	while (i + 3 < n_input_samples)
	  {
	    process_4samples_aligned (&input[i], &output[i*2]);
	    i += 4;
	  }
      }
    while (i < n_input_samples)
      {
	process_sample_unaligned (&input[i], &output[2*i]);
	i++;
      }
  }
  void process_block_unaligned (const float *input, unsigned n_input_samples, float *output)
  {
    const int H = (T / 2) - 1; /* half the filter length */
    unsigned int i = 0;
    if (USE_SSE)
      {
	while ((reinterpret_cast<ptrdiff_t> (&input[i]) & 15) && i < n_input_samples)
	  {
	    process_sample_unaligned (&input[i], &output[2*i]);
	    i++;
	  }
      }
    process_block_aligned (&input[i], n_input_samples - i, &output[2*i]);
  }
  void process_block (const float *input, unsigned int n_input_samples, float *output)
  {
    unsigned int history_todo = min (n_input_samples, T);

    copy (input, input + history_todo, &history[T]);
    process_block_aligned (&history[0], history_todo, output);
    if (n_input_samples >= T)
      {
	process_block_unaligned (input, n_input_samples - history_todo, &output [2*history_todo]);

	// build new history from new input
	copy (input + n_input_samples - T, input + n_input_samples, &history[0]);
      }
    else
      {
	// build new history from end of old history
	// (very expensive if n_input_samples tends to be a lot smaller than T often)
	memmove (&history[0], &history[n_input_samples], sizeof (history[0]) * T);
      }
  }
};

template<unsigned int T, bool USE_SSE>
class Downsampler2 {
  vector<float>        no_sse_taps;
  AlignedMem<float,16> history_even;
  AlignedMem<float,16> history_odd;
  AlignedMem<float,16> sse_taps;
public:
  Downsampler2 (float *init_taps)
    : no_sse_taps (init_taps, init_taps + T),
      history_even (2 * T),
      history_odd (2 * T),
      sse_taps (fir_compute_taps_sse (no_sse_taps))
  {
    assert ((T & 1) == 0);    /* even order filter */
  }
  /*
   * fast SSE optimized convolution
   */
  template<int ODD_STEPPING>
  void process_4samples_aligned (const float *input_even /* aligned */, const float *input_odd, float *output)
  {
    const int H = (T / 2) - 1; /* half the filter length */

    fir_process_4samples_sse<T> (&input_even[0], &sse_taps[0], output[0], output[1], output[2], output[3]);

    output[0] += 0.5 * input_odd[H*ODD_STEPPING];
    output[1] += 0.5 * input_odd[(H+1)*ODD_STEPPING];
    output[2] += 0.5 * input_odd[(H+2)*ODD_STEPPING];
    output[3] += 0.5 * input_odd[(H+3)*ODD_STEPPING];
  }
  /*
   * slow non-SSE convolution
   */
  template<int ODD_STEPPING>
  float process_sample_unaligned (const float *input_even, const float *input_odd)
  {
    const int H = (T / 2) - 1; /* half the filter length */

    return fir_process_one_sample<T> (&input_even[0], &no_sse_taps[0]) + 0.5 * input_odd[H * ODD_STEPPING];
  }
  template<int ODD_STEPPING>
  void
  process_block_aligned (const float *input_even, const float *input_odd, float *output, unsigned int n_output_samples)
  {
    unsigned int i = 0;
    if (USE_SSE)
      {
	while (i + 3 < n_output_samples)
	  {
	    process_4samples_aligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING], &output[i]);
	    i += 4;
	  }
      }
    while (i < n_output_samples)
      {
	output[i] = process_sample_unaligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING]);
	i++;
      }
  }
  template<int ODD_STEPPING>
  void
  process_block_unaligned (const float *input_even, const float *input_odd, float *output, unsigned int n_output_samples)
  {
    const int H = (T / 2) - 1; /* half the filter length */
    unsigned int i = 0;
    if (USE_SSE)
      {
	while ((reinterpret_cast<ptrdiff_t> (&input_even[i]) & 15) && i < n_output_samples)
	  {
	    output[i] = process_sample_unaligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING]);
	    i++;
	  }
      }
    process_block_aligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING], &output[i], n_output_samples);
  }
  void deinterleave2 (const float *data, unsigned int n_data_values, float *output)
  {
    for (unsigned int i = 0; i < n_data_values; i += 2)
      output[i / 2] = data[i];
  }
  void process_block (const float *input, unsigned int n_input_samples, float *output)
  {
    assert ((n_input_samples & 1) == 0);

    const unsigned int BLOCKSIZE = 1024;

    f4vector  block[BLOCKSIZE / 4]; /* using f4vector ensures 16-byte alignment */
    float    *input_even = &block[0].f[0];

    while (n_input_samples)
      {
	unsigned int n_input_todo = min (n_input_samples, BLOCKSIZE * 2);

	/*
	 * since the halfband filter contains zeros every other sample
	 * and since we're using SIMD instructions, which expect the
	 * data to be consecutively represented in memory, we prepare
	 * a block of samples containing only even-indexed samples
	 *
	 * we keep the deinterleaved data on the stack (instead of per-class
	 * allocated memory), to ensure that even running a lot of these
	 * downsampler streams will not result in cache trashing
	 */
	/*
	 * FIXME: this implementation is suboptimal for non-SSE, because it
	 * performs an extra deinterleaving step in any case, but deinterleaving
	 * is only required for SSE instructions
	 */
	deinterleave2 (input, n_input_todo, input_even);

	const float       *input_odd = input + 1; /* we process this one with a stepping of 2 */

	const unsigned int n_output_todo = n_input_todo / 2;
	const unsigned int history_todo = min (n_output_todo, T);

	copy (input_even, input_even + history_todo, &history_even[T]);
	deinterleave2 (input_odd, history_todo * 2, &history_odd[T]);

	process_block_aligned <1> (&history_even[0], &history_odd[0], output, history_todo);
	if (n_output_todo >= T)
	  {
	    process_block_unaligned<2> (input_even, input_odd, &output[history_todo], n_output_todo - history_todo);

	    // build new history from new input
	    copy (input_even + n_output_todo - T, input_even + n_output_todo, &history_even[0]);
	    deinterleave2 (input_odd + n_input_todo - T * 2, T * 2, &history_odd[0]); /* FIXME: can be optimized */
	  }
	else
	  {
	    // build new history from end of old history
	    // (very expensive if n_output_todo tends to be a lot smaller than T often)
	    memmove (&history_even[0], &history_even[n_output_todo], sizeof (history_even[0]) * T);
	    memmove (&history_odd[0], &history_odd[n_output_todo], sizeof (history_odd[0]) * T);
	  }

	n_input_samples -= n_input_todo;
	input += n_input_todo;
	output += n_output_todo;
      }
  }
};

#include "coeffs.h"

void exit_usage (int status)
{
  printf ("usage: ssefir <options> [ <block_size> ] [ <test_frequency> ]\n");
  printf ("\n");
  printf ("  p - performance\n");
  printf ("  a - accuracy\n");
  printf ("  g - generate output for gnuplot\n");
  printf ("  i - impulse response\n");
  printf ("\n");
  printf ("  d - downsample\n");
  printf ("  u - upsample\n");
  printf ("  s - subsample (= downsample and upsample after that)\n");
  printf ("  o - oversample (= upsample and downsample after that)\n");
  printf ("\n");
  printf ("  f - fast implementation (sse)\n");
  printf ("\n");
  printf ("  block_size defaults to 128 values\n");
  printf ("  test_frequency defaults to 440 Hz\n");
  printf ("\n");
  printf ("examples:\n");
  printf ("  ssefir puf 256     # check performance of fast upsampling with 256 value blocks\n");
  printf ("  ssefir au          # check accuracy of standard upsampling\n");
  printf ("  ssefir au 128 500  # check accuracy of standard upsampling using a 500 Hz frequency\n");
  exit (status);
}

enum TestType { TEST_PERFORMANCE, TEST_ACCURACY, TEST_GNUPLOT, TEST_IMPULSE, TEST_NONE } test_type = TEST_NONE;
enum ResampleType { RES_DOWNSAMPLE, RES_UPSAMPLE, RES_SUBSAMPLE, RES_OVERSAMPLE, RES_NONE } resample_type = RES_NONE;
enum ImplType { IMPL_SSE, IMPL_NORMAL } impl_type = IMPL_NORMAL;

unsigned int block_size = 128;
double test_frequency = 440.0;

double
gettime ()
{
  timeval tv;
  gettimeofday (&tv, 0);

  return double(tv.tv_sec) + double(tv.tv_usec) * (1.0 / 1000000.0);
}

template <int TEST, int RESAMPLE, int IMPL> int perform_test()
{
  if (TEST == TEST_IMPULSE) /* need to have space for impulse response for all 4 tests */
    block_size = 150;

  /* initialize up- and downsampler */
  const int T = 32;
  const bool USE_SSE = (IMPL == IMPL_SSE);
  float utaps[T], dtaps[T];
  for (int i = 0; i < T; i++)
    {
      utaps[i] = halfband_fir_upsample2_96db_coeffs[i] * 2;
      dtaps[i] = halfband_fir_upsample2_96db_coeffs[i];
    }

  Upsampler2<T,USE_SSE> ups (utaps);
  Downsampler2<T,USE_SSE> downs (dtaps);

  f4vector in_v[block_size / 2 + 1], out_v[block_size / 2 + 1], out2_v[block_size / 2 + 1];
  float *input = &in_v[0].f[0], *output = &out_v[0].f[0], *output2 = &out2_v[0].f[0]; /* ensure aligned data */

  if (TEST == TEST_PERFORMANCE)
    {
      for (int i = 0; i < block_size; i++)
	input[i] = sin (i * test_frequency/44100.0 * 2 * M_PI);

      double start_time = gettime();
      long long k = 0;
      for (int i = 0; i < 500000; i++)
	{
	  if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE)
	    {
	      downs.process_block (input, block_size, output);
	      if (RESAMPLE == RES_SUBSAMPLE)
		ups.process_block (output, block_size / 2, output2);
	    }
	  if (RESAMPLE == RES_UPSAMPLE || RESAMPLE == RES_OVERSAMPLE)
	    {
	      ups.process_block (input, block_size, output);
	      if (RESAMPLE == RES_OVERSAMPLE)
		downs.process_block (output, block_size * 2, output2);
	    }
	  k += block_size;
	}
      double end_time = gettime();
      if (RESAMPLE == RES_DOWNSAMPLE)
	{
	  printf ("  (performance will be normalized to downsampler output samples)\n");
	  k /= 2;
	}
      else if (RESAMPLE == RES_UPSAMPLE)
	{
	  printf ("  (performance will be normalized to upsampler input samples)\n");
	}
      printf ("  total samples processed = %lld\n", k);
      printf ("  processing_time = %f\n", end_time - start_time);
      printf ("  samples / second = %f\n", k / (end_time - start_time));
      printf ("  which means the resampler can process %.2f 44100 Hz streams simultaneusly\n",
	      k / (end_time - start_time) / 44100.0);
      printf ("  or one 44100 Hz stream takes %f %% CPU usage\n", 
	        100.0 / (k / (end_time - start_time) / 44100.0));
    }
  else if (TEST == TEST_ACCURACY || TEST == TEST_GNUPLOT)
    {
      long long k = 0;
      double phase = 0;
      double max_diff = 0;
      for (int b = 0; b < 1000; b++)
	{
	  int misalign = rand() % 4;
	  int bs = rand() % (block_size - misalign);

	  if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE)
	    bs -= bs & 1;

	  for (int i = 0; i < bs; i++)
	    {
	      input[i+misalign] = sin (phase);
	      phase += test_frequency/44100.0 * 2 * M_PI;
	    }
	  if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE)
	    {
	      downs.process_block (input + misalign, bs, output);
	      if (RESAMPLE == RES_SUBSAMPLE)
		ups.process_block (output, bs / 2, output2);
	    }
	  if (RESAMPLE == RES_UPSAMPLE || RESAMPLE == RES_OVERSAMPLE)
	    {
	      ups.process_block (input + misalign, bs, output);
	      if (RESAMPLE == RES_OVERSAMPLE)
		downs.process_block (output, bs * 2, output2);
	    }

	  /* validate output */
	  double sin_shift;
	  double freq_factor;
	  unsigned int out_bs;
	  float *check = output;

	  if (RESAMPLE == RES_UPSAMPLE)
	    {
	      sin_shift = 34;
	      freq_factor = 0.5;
	      out_bs = bs * 2;
	    }
	  else if (RESAMPLE == RES_DOWNSAMPLE)
	    {
	      sin_shift = 16.5;
	      freq_factor = 2;
	      out_bs = bs / 2;
	    }
	  else if (RESAMPLE == RES_OVERSAMPLE)
	    {
	      sin_shift = 33.5;
	      freq_factor = 1;
	      check = output2;
	      out_bs = bs;
	    }
	  else if (RESAMPLE == RES_SUBSAMPLE)
	    {
	      sin_shift = 67;
	      freq_factor = 1;
	      check = output2;
	      out_bs = bs;
	    }

	  for (unsigned int i = 0; i < out_bs; i++, k++)
	    if (k > 100)
	      {
		double correct_output = sin ((k - sin_shift) * 2 * freq_factor * test_frequency/44100.0 * M_PI);
		if (TEST == TEST_GNUPLOT)
		  printf ("%lld %.17f %.17f\n", k, check[i], correct_output);
		else
		  max_diff = max (max_diff, check[i] - correct_output);
	      }
	}
      double max_diff_db = 20 * log (max_diff) / log (10);
      if (TEST == TEST_ACCURACY)
	{
	  printf ("  input frequency used to perform test = %.2f Hz (SR = 44100.0 Hz)\n", test_frequency);
	  printf ("  max difference between correct and computed output: %f = %f dB\n", max_diff, max_diff_db);
	}
    }
  else if (TEST == TEST_IMPULSE)
    {
      input[0] = 1;
      for (int i = 1; i < block_size; i++)
	{
	  input[i] = 0;
	  output[i] = 0;
	  output2[i] = 0;
	}

      if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE)
	{
	  downs.process_block (input, block_size, output);
	  if (RESAMPLE == RES_SUBSAMPLE)
	    ups.process_block (output, block_size / 2, output2);
	}
      if (RESAMPLE == RES_UPSAMPLE || RESAMPLE == RES_OVERSAMPLE)
	{
	  ups.process_block (input, block_size, output);
	  if (RESAMPLE == RES_OVERSAMPLE)
	    downs.process_block (output, block_size * 2, output2);
	}

      float *check = output;
      if (RESAMPLE == RES_OVERSAMPLE || RESAMPLE == RES_SUBSAMPLE)
	check = output2;

      for (int i = 0; i < block_size; i++)
	printf ("%.17f\n", check[i]);
    }
}

template <int TEST, int RESAMPLE> int perform_test()
{
  switch (impl_type)
    {
    case IMPL_SSE:    printf ("using SSE instructions\n"); return perform_test<TEST, RESAMPLE, IMPL_SSE> ();
    case IMPL_NORMAL: printf ("using FPU instructions\n"); return perform_test<TEST, RESAMPLE, IMPL_NORMAL> ();
    }
}

template <int TEST> int perform_test ()
{
  switch (resample_type)
    {
    case RES_DOWNSAMPLE:  printf ("for factor 2 downsampling "); return perform_test<TEST, RES_DOWNSAMPLE> ();
    case RES_UPSAMPLE:	  printf ("for factor 2 upsampling "); return perform_test<TEST, RES_UPSAMPLE> ();
    case RES_SUBSAMPLE:	  printf ("for factor 2 subsampling "); return perform_test<TEST, RES_SUBSAMPLE> ();
    case RES_OVERSAMPLE:  printf ("for factor 2 oversampling "); return perform_test<TEST, RES_OVERSAMPLE> ();
    }
}

int perform_test()
{
  switch (test_type)
    {
    case TEST_PERFORMANCE:  printf ("performance test "); return perform_test<TEST_PERFORMANCE> ();
    case TEST_ACCURACY:	    printf ("accuracy test "); return perform_test<TEST_ACCURACY> ();
    case TEST_GNUPLOT:	    printf ("# gnuplot test "); return perform_test<TEST_GNUPLOT> ();
    case TEST_IMPULSE:	    printf ("# impulse response test "); return perform_test<TEST_IMPULSE> ();
    }
}

int main (int argc, char **argv)
{
  if (argc >= 2)
    {
      for (int i = 0; i < strlen (argv[1]); i++)
	{
	  switch (argv[1][i])
	    {
	    case 'p': test_type = TEST_PERFORMANCE; break;
	    case 'a': test_type = TEST_ACCURACY; break;
	    case 'g': test_type = TEST_GNUPLOT; break;
	    case 'i': test_type = TEST_IMPULSE; break;

	    case 'd': resample_type = RES_DOWNSAMPLE; break;
	    case 'u': resample_type = RES_UPSAMPLE; break;
	    case 's': resample_type = RES_SUBSAMPLE; break;
	    case 'o': resample_type = RES_OVERSAMPLE; break;

	    case 'f': impl_type = IMPL_SSE; break;
	    }
	}
    }

  if (argc >= 3)
    block_size = atoi (argv[2]);

  if (argc >= 4)
    test_frequency = atoi (argv[3]);

  if ((block_size & 1) == 1)
    block_size++;

  if (block_size < 2)
    block_size = 2;

  if (test_type == TEST_NONE || resample_type == RES_NONE)
    exit_usage (1);

  return perform_test();
}

Attachment: ssefirtests.sh
Description: Bourne shell script

# g++-4.0 -funroll-loops -O3    -c -o ssefir.o ssefir.cc
# g++-4.0 -funroll-loops -O3  -o ssefir ssefir.o

performance test for factor 2 upsampling using FPU instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 3.582959
  samples / second = 17862330.233790
  which means the resampler can process 405.04 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.246888 % CPU usage
performance test for factor 2 upsampling using SSE instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 1.268400
  samples / second = 50457270.836486
  which means the resampler can process 1144.16 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.087401 % CPU usage
performance test for factor 2 downsampling using FPU instructions
  (performance will be normalized to downsampler output samples)
  total samples processed = 32000000
  processing_time = 2.099055
  samples / second = 15244955.091818
  which means the resampler can process 345.69 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.289276 % CPU usage
performance test for factor 2 downsampling using SSE instructions
  (performance will be normalized to downsampler output samples)
  total samples processed = 32000000
  processing_time = 0.826667
  samples / second = 38709658.514582
  which means the resampler can process 877.77 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.113925 % CPU usage
performance test for factor 2 oversampling using FPU instructions
  total samples processed = 64000000
  processing_time = 7.717842
  samples / second = 8292473.356379
  which means the resampler can process 188.04 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.531808 % CPU usage
performance test for factor 2 oversampling using SSE instructions
  total samples processed = 64000000
  processing_time = 2.863337
  samples / second = 22351542.660578
  which means the resampler can process 506.84 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.197302 % CPU usage
accuracy test for factor 2 upsampling using FPU instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.194477 dB
accuracy test for factor 2 upsampling using SSE instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.080477 dB
accuracy test for factor 2 downsampling using FPU instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000006 = -104.811480 dB
accuracy test for factor 2 downsampling using SSE instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000006 = -104.761055 dB
accuracy test for factor 2 oversampling using FPU instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.194477 dB
accuracy test for factor 2 oversampling using SSE instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.080477 dB

# g++-3.4 -funroll-loops -O3    -c -o ssefir.o ssefir.cc
# g++-3.4 -funroll-loops -O3  -o ssefir ssefir.o

performance test for factor 2 upsampling using FPU instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 3.797012
  samples / second = 16855359.553811
  which means the resampler can process 382.21 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.261638 % CPU usage
performance test for factor 2 upsampling using SSE instructions
  (performance will be normalized to upsampler input samples)
  total samples processed = 64000000
  processing_time = 1.332213
  samples / second = 48040368.623546
  which means the resampler can process 1089.35 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.091798 % CPU usage
performance test for factor 2 downsampling using FPU instructions
  (performance will be normalized to downsampler output samples)
  total samples processed = 32000000
  processing_time = 2.284448
  samples / second = 14007760.860869
  which means the resampler can process 317.64 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.314825 % CPU usage
performance test for factor 2 downsampling using SSE instructions
  (performance will be normalized to downsampler output samples)
  total samples processed = 32000000
  processing_time = 0.913186
  samples / second = 35042155.471063
  which means the resampler can process 794.61 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.125848 % CPU usage
performance test for factor 2 oversampling using FPU instructions
  total samples processed = 64000000
  processing_time = 8.265052
  samples / second = 7743448.102385
  which means the resampler can process 175.59 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.569514 % CPU usage
performance test for factor 2 oversampling using SSE instructions
  total samples processed = 64000000
  processing_time = 3.125921
  samples / second = 20473965.840909
  which means the resampler can process 464.26 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.215395 % CPU usage
accuracy test for factor 2 upsampling using FPU instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.194477 dB
accuracy test for factor 2 upsampling using SSE instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.080477 dB
accuracy test for factor 2 downsampling using FPU instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000006 = -104.811480 dB
accuracy test for factor 2 downsampling using SSE instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000006 = -104.761055 dB
accuracy test for factor 2 oversampling using FPU instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.194477 dB
accuracy test for factor 2 oversampling using SSE instructions
  input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz)
  max difference between correct and computed output: 0.000012 = -98.080477 dB

Follow-Ups:
- Re: Fast factor 2 resampling
  - From: Tim Janik

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]