Open
Description
I was looking into optimizing a function that checks that all values in a slice are in range. It is not that surprising that the version with all
does not get optimized because returning early (although in theory rust should be allowed to read more elements from the slice before breaking), but it is surprising that adding copied
before folding makes a difference in autovectorization.
Sample code (https://rust.godbolt.org/z/5eznWbMcf):
pub fn check_range_all(keys: &[u32], max: u32) -> bool {
keys.iter().all(|x| *x < max)
}
pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
keys.iter().fold(true, |a, x| a && *x < max)
}
pub fn check_range_copied_fold(keys: &[u32], max: u32) -> bool {
keys.iter().copied().fold(true, |a, x| a && x < max)
}
check_range_all
compares one element per loop iteration, usingcopied
does not change the assembly at all (both functions are merged)check_range_fold
unrolls the check 8 times, each iteration it branchless, but does not use any vector instructionscheck_range_copied_fold
usesavx
instructions and checks 32 elements per loop iteration
Metadata
Metadata
Assignees
Labels
Area: Autovectorization, which can impact perf or code sizeArea: Code generationCategory: An issue highlighting optimization opportunities or PRs implementing suchIssue: Problems and improvements with respect to binary size of generated code.Issue: Problems and improvements with respect to performance of generated code.