Skip to content

float rounding is slow #55107

Open
Open
@raphlinus

Description

@raphlinus

Update (2025-02-03)

The semantics of round will not be changed. However, the docs for round should point out that on most hardware, round_ties_even is faster.

Original issue

The scalar fallback for the sinewave benchmark in fearless_simd is very slow as of the current commit, and the reason is the f32::round() operation. When that's changed to (x + 0.5).floor() it goes from 1622ns to 347ns, and 205ns with target_cpu=haswell. With default x86_64 cpu, floorf() is a function call, but it's an efficient one. The asm of roundf() that I looked at was very unoptimized (it moved the float value into int registers and did bit fiddling there). In addition, round() doesn't get auto-vectorized, but floor() does.

I think there's a rich and sordid history behind this. The C standard library has 3 different functions for rounding: round, rint, and nearbyint. Of these, the first rounds values with a 0.5 fraction away from zero, and the other two use the stateful rounding direction mode. This last is arguably a wart on C and it's a good thing the idea doesn't exist in Rust. In any case, the default value is FE_TONEAREST, which rounds these values to the nearest even integer (see Gnu libc documentation and Wikipedia; the latter does a reasonably good job of motivating why you'd want to do this, the tl;dr is that it avoids some biases).

The implementation of f32::floor is usually intrinsics::floorf32 (but it's intrinsics::floorf64 on msvc, for reasons described there). That in turn is llvm.floor.f32. Generally the other round functions are similar, til it gets to llvm. Inside llvm, one piece of evidence that "round" is special is that it's not listed in the list of instrinsics that get auto-vectorized.

Neither the C standard library nor llvm intrinsics have a function that rounds with "round half to even" behavior. This is arguably a misfeature. A case can be made that Rust should have this function; in cases where a recent Intel CPU is set as target_cpu or target_feature, it compiles to roundps $8 (analogous to $9 and $a for floor and ceil, respectively), and in compatibility mode the asm shouldn't be any slower than the existing code. I haven't investigated non-x86 architectures though.

For signal processing (the main use case of fearless_simd) I don't care much about the details of rounding of exactly 0.5 fraction values, and just want rounding to be fast. Thus, I think I'll use the _mm_round intrinsics in simd mode (with round half to even behavior) and (x + 0.5).floor() in fallback mode (with round half up behavior). It's not the case now (where I call f32::round) that the rounding behavior matches the SIMD case anyway. If there were a function with "round half to even" behavior, it would match the SIMD, would auto-vectorize well, and would have dramatically better performance with modern target_cpu.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-docsArea: Documentation for any part of the project, including the compiler, standard library, and toolsA-floating-pointArea: Floating point numbers and arithmeticE-help-wantedCall for participation: Help is requested to fix this issue.I-slowIssue: Problems and improvements with respect to performance of generated code.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions