Description
In my audio DSP application, the simple function that applies a biquad filter to a waveform sometimes runs much slower than expected.
The results are correct, just the performance is affected.
This only happens on Intel CPUs, when compiling with the default target settings.
See the discussion and investigation here: https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance
The affected function:
fn process_single(&mut self, input: f64) -> f64 {
let out = self.s1 + self.coeffs.b0 * input;
self.s1 = self.s2 + self.coeffs.b1 * input - self.coeffs.a1 * out;
self.s2 = self.coeffs.b2 * input - self.coeffs.a2 * out;
out
}
#[inline(never)]
fn process_waveform(&mut self, waveform: &mut [f64]) {
for item in waveform.iter_mut() {
*item = self.process_single(*item);
}
}
This is taken from the example published here: https://github.com/HEnquist/sse_subnormal_issue
The slowness has been traced down to subnormal numbers. This slows down Intel CPUs by an order of magnitude, while AMD cpus are nearly unaffected.
There are however no subnormals in any of the involved variables. The problem comes from that the compiler uses an SSE2 vector multiplication to do a single multiplication, without ensuring that the unused lane in the registers contain proper floating point values. The speed depends on what happens to be in the xmm4 register when entering the function. When it's a "bad" value, performance is terrible.
This is the assembly of the process_waveform function:
00000000000085f0 <_ZN7xmmtest6Biquad16process_waveform17hff1442b2f1181681E>:
85f0: f2 0f 10 47 20 movsd 0x20(%rdi),%xmm0
85f5: f2 0f 10 57 28 movsd 0x28(%rdi),%xmm2
85fa: 66 0f 10 0f movupd (%rdi),%xmm1
85fe: 66 0f 10 5f 10 movupd 0x10(%rdi),%xmm3
8603: 66 0f 16 67 30 movhpd 0x30(%rdi),%xmm4
8608: 31 c0 xor %eax,%eax
860a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
8610: f2 0f 10 2c c6 movsd (%rsi,%rax,8),%xmm5
8615: 66 0f 28 f0 movapd %xmm0,%xmm6
8619: f2 0f 59 f5 mulsd %xmm5,%xmm6
861d: f2 0f 58 f1 addsd %xmm1,%xmm6
8621: f2 0f 11 34 c6 movsd %xmm6,(%rsi,%rax,8)
8626: 48 8d 48 01 lea 0x1(%rax),%rcx
862a: 66 0f 28 fa movapd %xmm2,%xmm7
862e: f2 0f 59 fd mulsd %xmm5,%xmm7
8632: 66 0f 15 c9 unpckhpd %xmm1,%xmm1
8636: 66 0f 58 cf addpd %xmm7,%xmm1
863a: 66 0f 14 fd unpcklpd %xmm5,%xmm7
863e: 66 0f 59 fc mulpd %xmm4,%xmm7
8642: 66 0f c6 cf 02 shufpd $0x2,%xmm7,%xmm1
8647: 66 0f 14 f6 unpcklpd %xmm6,%xmm6
864b: 66 0f 59 f3 mulpd %xmm3,%xmm6
864f: 66 0f 5c ce subpd %xmm6,%xmm1
8653: 48 89 c8 mov %rcx,%rax
8656: 48 81 f9 00 04 00 00 cmp $0x400,%rcx
865d: 75 b1 jne 8610 <_ZN7xmmtest6Biquad16process_waveform17hff1442b2f1181681E+0x20>
865f: 66 0f 11 0f movupd %xmm1,(%rdi)
8663: c3 ret
8664: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
866b: 00 00 00
866e: 66 90 xchg %ax
The problem occurs in xmm4. This register only occurs twice, first when loading a value into the high lane, which leaves the lower lane unchanged:
8603: 66 0f 16 67 30 movhpd 0x30(%rdi),%xmm4
Then it is multiplied with xmm7, storing the result in xmm7:
863e: 66 0f 59 fc mulpd %xmm4,%xmm7
Finally, only the high lane is copied from xmm7 to xmm1, and the garbage result in the lower lane is never used:
8642: 66 0f c6 cf 02 shufpd $0x2,%xmm7,%xmm1
This gives the correct result, but the unused lower lane will multiply whatever was in xmm4 before entering the function. There is a high likelihood that this is not a normal floating point value, for example uint64 value 2 gets interpreted as a very small subnormal number (9.88e-324). And this makes the multiplication very slow.
I suppose it uses the MULPD instruction to keep using SSE and avoiding mixing SSE and normal FPU instructions. But when doing this, I would expect it to ensure that any unused lane contains good values that don't cause any trouble.
Meta
rustc --version --verbose
:
rustc 1.75.0-nightly (e0d7ed1f4 2023-10-01)
binary: rustc
commit-hash: e0d7ed1f453fb54578cc96dfea859b0e7be15016
commit-date: 2023-10-01
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
LLVM version: 17.0.2
and
rustc 1.72.1 (d5c2e9c34 2023-09-13)
binary: rustc
commit-hash: d5c2e9c342b358556da91d61ed4133f6f50fc0c3
commit-date: 2023-09-13
host: x86_64-unknown-linux-gnu
release: 1.72.1
LLVM version: 16.0.5
Backtrace
(compiles without issue, nothing to see)