Re: [tim-janik/beast] TESTS: testresampler: fix resampler tests for clang9 (#140)



Did you find out why this is needed?
I.e. peeked at the generated assembly to figure if -ffast-math related options possibly allow transformations that could become problematic for us in the long term?

Short answer:
I spent some time debugging it, -mfma is causing this - removing the flag gives us the old behaviour, clang does use fused multiply add instructions in the inner loop of the resampler. I don't believe that this optimization has negative impact for us, so adjusting the threshold is the sane thing to do here.

Long answer:

Lets first look at what exactly is failing here. We have (in explicit form):

$ out/tests/suite1 --resampler accuracy --fpu --precision=24 --subsample --freq-scan=90,9000,983 --freq-scan-verbose --verbose

############## clang9 without fma

# accuracy test for factor 2 subsampling using FPU instructions
#   input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.45954766716391759
1073.00000000000000000 -130.16677118621706200
2056.00000000000000000 -129.70751203525446726
3039.00000000000000000 -132.36528831269109219
4022.00000000000000000 -128.39849621726884266
5005.00000000000000000 -128.95512230052295877
5988.00000000000000000 -131.49506213647262598
6971.00000000000000000 -131.66641240173382243
7954.00000000000000000 -131.26927901299603718
8937.00000000000000000 -134.06944494072371299
#   max difference between correct and computed output: 0.000000 = -128.398496 dB

############## clang9 with fma

# accuracy test for factor 2 subsampling using FPU instructions
#   input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.87588683579858184
1073.00000000000000000 -126.11881934685534645
2056.00000000000000000 -128.02398873831205606
3039.00000000000000000 -128.56669380739046460
4022.00000000000000000 -124.92935085933235939
5005.00000000000000000 -128.10076816664792432
5988.00000000000000000 -128.49709000891502342
6971.00000000000000000 -127.39047422816582866
7954.00000000000000000 -127.65962334132801459
8937.00000000000000000 -130.40537251311835121
#   max difference between correct and computed output: 0.000001 = -124.929351 dB

So we're testing 24 bit downsampling followed by upsampling with FPU instructions here. Using a 4022 Hz sine wave performs worse with fma. Note that 24 bit is the most problematic test we have, since the accuracy of the floating point computations is not really good enough to reliably evaluate the convolution of the large FIR filter. Anyway if we look at the source and assembly of the FPU code, we'll see the difference:

Source Code:

template<class Accumulator> static inline Accumulator
fir_process_one_sample (const float *input,
                        const float *taps, /* [0..order-1] */
                        const uint   order)
{
  Accumulator out = 0;
  for (uint i = 0; i < order; i++)
    out += input[i] * taps[i];
  return out;
}

Both assembly dumps use loop unrolling, so I'm truncating the assembly code. Also I only show the upsampling step here, but downsampling looks the same.

Code generated with clang9 & -mfma

0000000000000000 <Bse::Resampler2::Upsampler2<52u, true>::process_sample_unaligned(float const*, float*)>:
   0:   48 8b 47 08             mov    0x8(%rdi),%rax
   4:   c5 fa 10 06             vmovss (%rsi),%xmm0
   8:   c5 fa 10 4e 04          vmovss 0x4(%rsi),%xmm1
   d:   c5 f2 59 48 04          vmulss 0x4(%rax),%xmm1,%xmm1
  12:   c4 e2 79 b9 08          vfmadd231ss (%rax),%xmm0,%xmm1
  17:   c5 fa 10 46 08          vmovss 0x8(%rsi),%xmm0
  1c:   c4 e2 71 99 40 08       vfmadd132ss 0x8(%rax),%xmm1,%xmm0
  22:   c5 fa 10 4e 0c          vmovss 0xc(%rsi),%xmm1
  27:   c4 e2 79 99 48 0c       vfmadd132ss 0xc(%rax),%xmm0,%xmm1
  2d:   c5 fa 10 46 10          vmovss 0x10(%rsi),%xmm0
  32:   c4 e2 71 99 40 10       vfmadd132ss 0x10(%rax),%xmm1,%xmm0
  38:   c5 fa 10 4e 14          vmovss 0x14(%rsi),%xmm1
  3d:   c4 e2 79 99 48 14       vfmadd132ss 0x14(%rax),%xmm0,%xmm1
  43:   c5 fa 10 46 18          vmovss 0x18(%rsi),%xmm0
  48:   c4 e2 71 99 40 18       vfmadd132ss 0x18(%rax),%xmm1,%xmm0
...

Code generated with clang9 without -mfma

 6204 
 6205 0000000000000000 <Bse::Resampler2::Upsampler2<52u, true>::process_sample_unaligned(float const*, float*)>:
 6206    0:   48 8b 47 08             mov    0x8(%rdi),%rax
 6207    4:   c5 fa 10 06             vmovss (%rsi),%xmm0
 6208    8:   c5 fa 10 4e 04          vmovss 0x4(%rsi),%xmm1
 6209    d:   c5 fa 59 00             vmulss (%rax),%xmm0,%xmm0
 6210   11:   c5 f2 59 48 04          vmulss 0x4(%rax),%xmm1,%xmm1
 6211   16:   c5 fa 58 c1             vaddss %xmm1,%xmm0,%xmm0
 6212   1a:   c5 fa 10 4e 08          vmovss 0x8(%rsi),%xmm1
 6213   1f:   c5 f2 59 48 08          vmulss 0x8(%rax),%xmm1,%xmm1
 6214   24:   c5 fa 10 56 0c          vmovss 0xc(%rsi),%xmm2
 6215   29:   c5 ea 59 50 0c          vmulss 0xc(%rax),%xmm2,%xmm2
 6216   2e:   c5 f2 58 ca             vaddss %xmm2,%xmm1,%xmm1
 6217   32:   c5 fa 58 c1             vaddss %xmm1,%xmm0,%xmm0
 6218   36:   c5 fa 10 4e 10          vmovss 0x10(%rsi),%xmm1
 6219   3b:   c5 f2 59 48 10          vmulss 0x10(%rax),%xmm1,%xmm1
 6220   40:   c5 fa 10 56 14          vmovss 0x14(%rsi),%xmm2
 6221   45:   c5 ea 59 50 14          vmulss 0x14(%rax),%xmm2,%xmm2
 6222   4a:   c5 f2 58 ca             vaddss %xmm2,%xmm1,%xmm1
 6223   4e:   c5 fa 10 56 18          vmovss 0x18(%rsi),%xmm2
 6224   53:   c5 ea 59 50 18          vmulss 0x18(%rax),%xmm2,%xmm2
 6225   58:   c5 f2 58 ca             vaddss %xmm2,%xmm1,%xmm1
 6226   5c:   c5 fa 58 c1             vaddss %xmm1,%xmm0,%xmm0
 6227   60:   c5 fa 10 4e 1c          vmovss 0x1c(%rsi),%xmm1
 6228   65:   c5 f2 59 48 1c          vmulss 0x1c(%rax),%xmm1,%xmm1
...

So the difference here is that without -mfma we are multiplying/adding in two steps, each time truncating down to float precision after each step.

With -mfma we are multiplying/adding in one step (with "infinite resolution"), and then truncating down to float precision.

This means that we're getting a different result. That in this particular case the -mfma code performs worse than the version using individual multiply/add instructions is probably because "different result" could mean better or worse in the total effects of somewhat random errors caused by limited precision of floating point math. But both appear to be valid resamplers, and both appear to be permitted translations of C++ to assembly code.

How does it perform?

Finally just for fun, lets benchmark things.

clang9 & -mfma:

$ out/tests/suite1 --resampler perf --fpu --precision=24 --subsample
performance test for factor 2 subsampling using FPU instructions
  total samples processed = 64000000
  processing_time = 1.333207
  samples / second = 48004552.319213
  which means the resampler can process 1088.54 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.091866 % CPU usage

clang9 without -mfma:

$ out/tests/suite1 --resampler perf --fpu --precision=24 --subsample
performance test for factor 2 subsampling using FPU instructions
  total samples processed = 64000000
  processing_time = 0.792904
  samples / second = 80715936.375136
  which means the resampler can process 1830.29 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.054636 % CPU usage

clang9 with -mfma with SSE:

$ out/tests/suite1 --resampler perf --precision=24 --subsample
performance test for factor 2 subsampling using SSE instructions
  total samples processed = 64000000
  processing_time = 0.449175
  samples / second = 142483403.990601
  which means the resampler can process 3230.92 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.030951 % CPU usage

clang9 without -mfma with SSE:

$ out/tests/suite1 --resampler perf --precision=24 --subsample
performance test for factor 2 subsampling using SSE instructions
  total samples processed = 64000000
  processing_time = 0.344894
  samples / second = 185564296.725403
  which means the resampler can process 4207.81 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.023765 % CPU usage

Remarks:

SSE is always faster than FPU implementation. The best throughput is provided by the SSE without -mfma version.

On my Ryzen-7 machine, the "optimizations" that clang9 does with -mfma make the code slower. I'd assume that reducing instruction count should have a positive effect here. However, maybe mulss and addss are faster because they don't require "infinite precision", so they are cheaper to implement on the CPU.

There is an effect of -mfma on the SSE code. A quick investigation with perf showed that this is due to the test code intentionally testing resampling of non-SSE-aligned memory, so that the unaligned parts need to be computed using the FPU (where -mfma has an effect).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]