Description
Proposal
Problem statement
Currently, checking for whether two target features are enabled is inefficient. In zlib-rs we see a 3% slowdown in one test case from checking for an additional target feature.
Performing a runtime check for 2 target features requires roughly double the number of instructions versus checking for just one feature.
Motivating examples or use cases
In zlib-rs, we want to check for both the avx2
and bmi2
features, but that check is slower than just checking for avx2
.
Looking at just the happy path (where the features are already cached and both are available):
https://godbolt.org/z/f935sP6dr
// using `pclmulqdq` here because avx2 and bmi2 use the same integer constant
pub fn foo() -> bool {
std::is_x86_feature_detected!("pclmulqdq")
}
pub fn bar() -> bool {
std::is_x86_feature_detected!("avx2") && std::is_x86_feature_detected!("pclmulqdq")
}
example::foo::h4a487a8c8dbb996a:
mov rax, qword ptr [rip + std_detect::detect::cache::CACHE::h6b648acf387db542@GOTPCREL]
mov rax, qword ptr [rax]
test rax, rax
je .LBB0_1
and eax, 2
xor ecx, ecx
or rax, rcx
setne al
ret
example::bar::h1992ebebbee721d0:
push rbx
mov rbx, qword ptr [rip + std_detect::detect::cache::CACHE::h6b648acf387db542@GOTPCREL]
mov rax, qword ptr [rbx]
test rax, rax
je .LBB1_1
and eax, 32768
xor ecx, ecx
or rax, rcx
je .LBB1_3
.LBB1_4:
mov rax, qword ptr [rbx]
test rax, rax
je .LBB1_5
.LBB1_6:
and eax, 2
xor ecx, ecx
or rax, rcx
setne al
pop rbx
ret
So checking for 2 features roughly doubles the number of instructions, and performs 2 (atomic) loads.
This all makes sense, given that the cache is stored in an atomic, so the read value cannot be reused, and the expansion looks like this:
pub fn bar() -> bool {
(false || ::std_detect::detect::__is_feature_detected::avx2()) &&
(false || ::std_detect::detect::__is_feature_detected::pclmulqdq())
}
Solution sketch
I'd like the macro to expand to something like this instead, where __is_feature_detected()
returns a bitmap of enabled features:
pub fn bar() -> bool {
false || {
let mask = ::std_detect::detect::AVX2 | ::std_detect::detect::PCLMULQDQ;
::std_detect::detect::__is_feature_detected() & mask == mask
}
}
For that to work, a single call to a is_*_feature_detected
macro must be able to accept multiple target features. I can see two ways to do that:
is_x86_feature_detected("avx2", "bmi")
is_x86_feature_detected("avx2,bmi")
Option 2 has precedent in e.g. #[target_feature(enable = "avx2,bmi2")]
, but option 1 can (I believe) be implemented with macro_rules!
and also works better with e.g. #[cfg(...)]
. I personally prefer option 1.
Alternatives
There is a workaround:
#[inline(always)]
pub fn is_enabled_avx2_and_bmi2() -> bool {
#[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
#[cfg(feature = "std")]
{
use std::sync::atomic::{AtomicU8, Ordering};
static CACHE: AtomicU8 = AtomicU8::new(2);
return match CACHE.load(Ordering::Relaxed) {
0 => false,
1 => true,
_ => {
let detected = std::is_x86_feature_detected!("avx2")
&& std::is_x86_feature_detected!("bmi2");
CACHE.store(u8::from(detected), Ordering::Relaxed);
detected
}
};
}
false
}
Links and related work
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.