Description
Today I learned about the existence of the simd_reduce_add_unordered
intrinsic. When called on a float, this compiles to LLVM's vector.reduce.fadd
with the "fast" flag set, which means that passing in NAN or INF is UB and optimizations are allowed "to treat the sign of a zero argument or zero result as insignificant" (which I think means the sign of input zeros is non-deterministically swapped and returned zeros have non-deterministic sign).
This intrinsic is not used a lot in stdarch, but it has a total of 8 uses (all in avx512f.rs
). 4 of these are integer intrinsics, where this should be entirely equivalent to simd_reduce_add
; not sure why the "unordered" version is used. The other 4 are float intrinsics, _mm512_reduce_add_ps
being the first:
/// Reduce the packed single-precision (32-bit) floating-point elements in a by addition. Returns the sum of all elements in a.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_reduce_add_ps&expand=4562)
#[inline]
#[target_feature(enable = "avx512f")]
#[unstable(feature = "stdarch_x86_avx512", issue = "111137")]
pub unsafe fn _mm512_reduce_add_ps(a: __m512) -> f32 {
simd_reduce_add_unordered(a.as_f32x16())
}
Neither the docs here nor Intel's docs mention that this is UB on NAN or INF, and the concerns around signed zeros and doing the addition in an unspecified order. Given that the Intel docs should be the authoritative docs (since this is a vendor intrinsic), why is it even correct to use fast-math flags here? Either the docs need to be updated to state the fast-math preconditions, or the implementation needs to be updated to avoid the fast-math flag. Maybe it should only use "reassoc", not the full but unsafe "fast" flag? But even that should probably be mentioned in the docs.