Did you find out why this is needed?
I.e. peeked at the generated assembly to figure if -ffast-math related options possibly allow transformations that could become problematic for us in the long term?
Short answer:
I spent some time debugging it, -mfma is causing this - removing the flag gives us the old behaviour, clang does use fused multiply add instructions in the inner loop of the resampler. I don't believe that this optimization has negative impact for us, so adjusting the threshold is the sane thing to do here.
Long answer:
Lets first look at what exactly is failing here. We have (in explicit form):
$ out/tests/suite1 --resampler accuracy --fpu --precision=24 --subsample --freq-scan=90,9000,983 --freq-scan-verbose --verbose
############## clang9 without fma
# accuracy test for factor 2 subsampling using FPU instructions
# input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.45954766716391759
1073.00000000000000000 -130.16677118621706200
2056.00000000000000000 -129.70751203525446726
3039.00000000000000000 -132.36528831269109219
4022.00000000000000000 -128.39849621726884266
5005.00000000000000000 -128.95512230052295877
5988.00000000000000000 -131.49506213647262598
6971.00000000000000000 -131.66641240173382243
7954.00000000000000000 -131.26927901299603718
8937.00000000000000000 -134.06944494072371299
# max difference between correct and computed output: 0.000000 = -128.398496 dB
############## clang9 with fma
# accuracy test for factor 2 subsampling using FPU instructions
# input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.87588683579858184
1073.00000000000000000 -126.11881934685534645
2056.00000000000000000 -128.02398873831205606
3039.00000000000000000 -128.56669380739046460
4022.00000000000000000 -124.92935085933235939
5005.00000000000000000 -128.10076816664792432
5988.00000000000000000 -128.49709000891502342
6971.00000000000000000 -127.39047422816582866
7954.00000000000000000 -127.65962334132801459
8937.00000000000000000 -130.40537251311835121
# max difference between correct and computed output: 0.000001 = -124.929351 dB
So we're testing 24 bit downsampling followed by upsampling with FPU instructions here. Using a 4022 Hz sine wave performs worse with fma. Note that 24 bit is the most problematic test we have, since the accuracy of the floating point computations is not really good enough to reliably evaluate the convolution of the large FIR filter. Anyway if we look at the source and assembly of the FPU code, we'll see the difference:
Source Code:
template<class Accumulator> static inline Accumulator
fir_process_one_sample (const float *input,
const float *taps, /* [0..order-1] */
const uint order)
{
Accumulator out = 0;
for (uint i = 0; i < order; i++)
out += input[i] * taps[i];
return out;
}
Both assembly dumps use loop unrolling, so I'm truncating the assembly code. Also I only show the upsampling step here, but downsampling looks the same.
Code generated with clang9 & -mfma
0000000000000000 <Bse::Resampler2::Upsampler2<52u, true>::process_sample_unaligned(float const*, float*)>:
0: 48 8b 47 08 mov 0x8(%rdi),%rax
4: c5 fa 10 06 vmovss (%rsi),%xmm0
8: c5 fa 10 4e 04 vmovss 0x4(%rsi),%xmm1
d: c5 f2 59 48 04 vmulss 0x4(%rax),%xmm1,%xmm1
12: c4 e2 79 b9 08 vfmadd231ss (%rax),%xmm0,%xmm1
17: c5 fa 10 46 08 vmovss 0x8(%rsi),%xmm0
1c: c4 e2 71 99 40 08 vfmadd132ss 0x8(%rax),%xmm1,%xmm0
22: c5 fa 10 4e 0c vmovss 0xc(%rsi),%xmm1
27: c4 e2 79 99 48 0c vfmadd132ss 0xc(%rax),%xmm0,%xmm1
2d: c5 fa 10 46 10 vmovss 0x10(%rsi),%xmm0
32: c4 e2 71 99 40 10 vfmadd132ss 0x10(%rax),%xmm1,%xmm0
38: c5 fa 10 4e 14 vmovss 0x14(%rsi),%xmm1
3d: c4 e2 79 99 48 14 vfmadd132ss 0x14(%rax),%xmm0,%xmm1
43: c5 fa 10 46 18 vmovss 0x18(%rsi),%xmm0
48: c4 e2 71 99 40 18 vfmadd132ss 0x18(%rax),%xmm1,%xmm0
...
Code generated with clang9 without -mfma
6204
6205 0000000000000000 <Bse::Resampler2::Upsampler2<52u, true>::process_sample_unaligned(float const*, float*)>:
6206 0: 48 8b 47 08 mov 0x8(%rdi),%rax
6207 4: c5 fa 10 06 vmovss (%rsi),%xmm0
6208 8: c5 fa 10 4e 04 vmovss 0x4(%rsi),%xmm1
6209 d: c5 fa 59 00 vmulss (%rax),%xmm0,%xmm0
6210 11: c5 f2 59 48 04 vmulss 0x4(%rax),%xmm1,%xmm1
6211 16: c5 fa 58 c1 vaddss %xmm1,%xmm0,%xmm0
6212 1a: c5 fa 10 4e 08 vmovss 0x8(%rsi),%xmm1
6213 1f: c5 f2 59 48 08 vmulss 0x8(%rax),%xmm1,%xmm1
6214 24: c5 fa 10 56 0c vmovss 0xc(%rsi),%xmm2
6215 29: c5 ea 59 50 0c vmulss 0xc(%rax),%xmm2,%xmm2
6216 2e: c5 f2 58 ca vaddss %xmm2,%xmm1,%xmm1
6217 32: c5 fa 58 c1 vaddss %xmm1,%xmm0,%xmm0
6218 36: c5 fa 10 4e 10 vmovss 0x10(%rsi),%xmm1
6219 3b: c5 f2 59 48 10 vmulss 0x10(%rax),%xmm1,%xmm1
6220 40: c5 fa 10 56 14 vmovss 0x14(%rsi),%xmm2
6221 45: c5 ea 59 50 14 vmulss 0x14(%rax),%xmm2,%xmm2
6222 4a: c5 f2 58 ca vaddss %xmm2,%xmm1,%xmm1
6223 4e: c5 fa 10 56 18 vmovss 0x18(%rsi),%xmm2
6224 53: c5 ea 59 50 18 vmulss 0x18(%rax),%xmm2,%xmm2
6225 58: c5 f2 58 ca vaddss %xmm2,%xmm1,%xmm1
6226 5c: c5 fa 58 c1 vaddss %xmm1,%xmm0,%xmm0
6227 60: c5 fa 10 4e 1c vmovss 0x1c(%rsi),%xmm1
6228 65: c5 f2 59 48 1c vmulss 0x1c(%rax),%xmm1,%xmm1
...
So the difference here is that without -mfma we are multiplying/adding in two steps, each time truncating down to float precision after each step.
With -mfma we are multiplying/adding in one step (with "infinite resolution"), and then truncating down to float precision.
This means that we're getting a different result. That in this particular case the -mfma code performs worse than the version using individual multiply/add instructions is probably because "different result" could mean better or worse in the total effects of somewhat random errors caused by limited precision of floating point math. But both appear to be valid resamplers, and both appear to be permitted translations of C++ to assembly code.
How does it perform?
Finally just for fun, lets benchmark things.
clang9 & -mfma:
$ out/tests/suite1 --resampler perf --fpu --precision=24 --subsample
performance test for factor 2 subsampling using FPU instructions
total samples processed = 64000000
processing_time = 1.333207
samples / second = 48004552.319213
which means the resampler can process 1088.54 44100 Hz streams simultaneusly
or one 44100 Hz stream takes 0.091866 % CPU usage
clang9 without -mfma:
$ out/tests/suite1 --resampler perf --fpu --precision=24 --subsample
performance test for factor 2 subsampling using FPU instructions
total samples processed = 64000000
processing_time = 0.792904
samples / second = 80715936.375136
which means the resampler can process 1830.29 44100 Hz streams simultaneusly
or one 44100 Hz stream takes 0.054636 % CPU usage
clang9 with -mfma with SSE:
$ out/tests/suite1 --resampler perf --precision=24 --subsample
performance test for factor 2 subsampling using SSE instructions
total samples processed = 64000000
processing_time = 0.449175
samples / second = 142483403.990601
which means the resampler can process 3230.92 44100 Hz streams simultaneusly
or one 44100 Hz stream takes 0.030951 % CPU usage
clang9 without -mfma with SSE:
$ out/tests/suite1 --resampler perf --precision=24 --subsample
performance test for factor 2 subsampling using SSE instructions
total samples processed = 64000000
processing_time = 0.344894
samples / second = 185564296.725403
which means the resampler can process 4207.81 44100 Hz streams simultaneusly
or one 44100 Hz stream takes 0.023765 % CPU usage
Remarks:
SSE is always faster than FPU implementation. The best throughput is provided by the SSE without -mfma version.
On my Ryzen-7 machine, the "optimizations" that clang9 does with -mfma make the code slower. I'd assume that reducing instruction count should have a positive effect here. However, maybe mulss and addss are faster because they don't require "infinite precision", so they are cheaper to implement on the CPU.
There is an effect of -mfma on the SSE code. A quick investigation with perf showed that this is due to the test code intentionally testing resampling of non-SSE-aligned memory, so that the unaligned parts need to be computed using the FPU (where -mfma has an effect).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.