Open
Description
I'm currently writing some code that counts the amount of a certain byte (newlines in this example) in a large byte slice (>1GB):
fn count(b: &[u8]) -> usize {
b.iter().filter(|&&x| x == b'\n').count()
}
When compiled with -C target-feature=+avx2
, avx2 instructions are emitted from autovectorization, but is still around 2x slower than bytecount.
Using portable_simd
, the code can be made faster:
fn count_simd(b: &[u8]) -> usize {
let (begin, mid, end) = b.as_simd::<64>();
count(begin)
+ count(end)
+ mid
.iter()
.map(|x| {
x.simd_eq(Simd::splat(b'\n'))
.select(Simd::splat(1u8), Simd::splat(0u8))
.reduce_sum()
})
.map(|x| x as usize)
.sum::<usize>()
}
This has similar performance with bytecount::count
.
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: Autovectorization, which can impact perf or code sizeCategory: An issue highlighting optimization opportunities or PRs implementing suchRelevant to the compiler team, which will review and decide on the PR/issue.