Description
Update (2025-02-03)
The semantics of round
will not be changed. However, the docs for round
should point out that on most hardware, round_ties_even
is faster.
Original issue
The scalar fallback for the sinewave benchmark in fearless_simd is very slow as of the current commit, and the reason is the f32::round() operation. When that's changed to (x + 0.5).floor() it goes from 1622ns to 347ns, and 205ns with target_cpu=haswell. With default x86_64 cpu, floorf() is a function call, but it's an efficient one. The asm of roundf() that I looked at was very unoptimized (it moved the float value into int registers and did bit fiddling there). In addition, round() doesn't get auto-vectorized, but floor() does.
I think there's a rich and sordid history behind this. The C standard library has 3 different functions for rounding: round
, rint
, and nearbyint
. Of these, the first rounds values with a 0.5 fraction away from zero, and the other two use the stateful rounding direction mode. This last is arguably a wart on C and it's a good thing the idea doesn't exist in Rust. In any case, the default value is FE_TONEAREST, which rounds these values to the nearest even integer (see Gnu libc documentation and Wikipedia; the latter does a reasonably good job of motivating why you'd want to do this, the tl;dr is that it avoids some biases).
The implementation of f32::floor is usually intrinsics::floorf32 (but it's intrinsics::floorf64 on msvc, for reasons described there). That in turn is llvm.floor.f32. Generally the other round functions are similar, til it gets to llvm. Inside llvm, one piece of evidence that "round" is special is that it's not listed in the list of instrinsics that get auto-vectorized.
Neither the C standard library nor llvm intrinsics have a function that rounds with "round half to even" behavior. This is arguably a misfeature. A case can be made that Rust should have this function; in cases where a recent Intel CPU is set as target_cpu or target_feature, it compiles to roundps $8
(analogous to $9
and $a
for floor and ceil, respectively), and in compatibility mode the asm shouldn't be any slower than the existing code. I haven't investigated non-x86 architectures though.
For signal processing (the main use case of fearless_simd) I don't care much about the details of rounding of exactly 0.5 fraction values, and just want rounding to be fast. Thus, I think I'll use the _mm_round intrinsics in simd mode (with round half to even behavior) and (x + 0.5).floor() in fallback mode (with round half up behavior). It's not the case now (where I call f32::round) that the rounding behavior matches the SIMD case anyway. If there were a function with "round half to even" behavior, it would match the SIMD, would auto-vectorize well, and would have dramatically better performance with modern target_cpu.