-
Notifications
You must be signed in to change notification settings - Fork 98
BitView SIMD filtering #5356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
BitView SIMD filtering #5356
Conversation
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
| let mut read_ptr = data; | ||
| let mut write_ptr = data; | ||
|
|
||
| for word in mask.iter_words() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This outer loop is duplicative, maybe lift it into parent
Signed-off-by: Nicholas Gates <nick@nickgates.com>
CodSpeed Performance ReportMerging #5356 will degrade performances by 14.99%Comparing Summary
Benchmarks breakdown
Footnotes
|
Codecov Report❌ Patch coverage is ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| /// In theory, we could use vqtbl1q_u8 for 16-byte vectors, but that would require a 65KB LUT | ||
| /// (256 entries × 16 bytes each), which is too large for practical use as it thrashes | ||
| /// the CPU cache. | ||
| #[cfg(target_arch = "aarch64")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of only adding neon intrinsics, opposed to also having x86. To explore whether this outperforms auto-vectorization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is kind of no reasonable auto-vectorization for filtering that I've seen.
@connortsui20 was working on x86 / AVX512 so I figured I'd poke at neon!
| _ => { | ||
| // Finally, use the lookup table to compress selected elements | ||
| // Load uint8x8 values and compress them using the lookup table. | ||
| let values = vld1_u8(read_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption is that branching and branch mispredictions have a significant overhead here if we do it on a per byte level. I think it's worth benchmarking only keeping the _ => case.
What might be reasonable is keeping these checks on a word level.
0u64 => {
// Skip empty chunks
}
0xFF => {
// All bits set - fast path
ptr::copy(read_ptr, write_ptr, 8);
write_ptr = write_ptr.add(8);
}
But same here: Would be interesting to look at different workloads and see whether introducing the branches on a word level is actually a win over running _ => unconditionally.
| } | ||
|
|
||
| /// Runs the provided function `f` for each index of a `true` bit in the view. | ||
| pub fn iter_ones<F>(&self, mut f: F) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while raw != 0 {
let bit_pos = raw.trailing_zeros();
f(bit_idx + bit_pos as usize);
raw &= raw - 1; // Clear the bit at `bit_pos`
}
Not introduced as part of this PR, but iterating based on while raw seems to be bad for pipelining, as an iteration depends on the previous one.
|
actually I think this deserves some cleaning up / monomorphizing before we merge this (I will make the scalar implementation better on my PR #5399) |
No description provided.