Skip to content

Conversation

@gatesn
Copy link
Contributor

@gatesn gatesn commented Nov 16, 2025

No description provided.

Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
@gatesn gatesn added the feature Release label indicating a new feature or request label Nov 16, 2025
let mut read_ptr = data;
let mut write_ptr = data;

for word in mask.iter_words() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This outer loop is duplicative, maybe lift it into parent

Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 18, 2025

CodSpeed Performance Report

Merging #5356 will degrade performances by 14.99%

Comparing ngates/bitview-filter (a783a69) with develop (35d12a3)

Summary

❌ 15 regressions
✅ 1395 untouched
🆕 157 new
⏩ 631 skipped1
🗄️ 28 archived benchmarks run2

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
take_indices[(1000, 16, 0.005)] 20.8 µs 24.4 µs -14.95%
take_indices[(1000, 16, 0.01)] 20.9 µs 24.6 µs -14.97%
take_indices[(1000, 16, 0.03)] 22.8 µs 26.4 µs -13.62%
take_indices[(1000, 256, 0.005)] 20.5 µs 24.2 µs -14.99%
take_indices[(1000, 256, 0.01)] 20.7 µs 24.4 µs -14.97%
take_indices[(1000, 256, 0.03)] 21.1 µs 24.8 µs -14.84%
take_indices[(1000, 4, 0.005)] 23.3 µs 26.9 µs -13.29%
take_indices[(1000, 4, 0.01)] 22.1 µs 25.7 µs -14.01%
take_indices[(1000, 4, 0.03)] 25.1 µs 28.7 µs -12.55%
take_indices[(10000, 16, 0.005)] 24.4 µs 28 µs -12.85%
take_indices[(10000, 16, 0.01)] 27.3 µs 30.9 µs -11.66%
take_indices[(10000, 256, 0.005)] 22.7 µs 26.4 µs -13.94%
take_indices[(10000, 256, 0.01)] 24.8 µs 28.4 µs -12.84%
take_indices[(10000, 4, 0.005)] 27 µs 30.6 µs -11.77%
take_indices[(10000, 4, 0.01)] 30.8 µs 34.4 µs -10.48%
🆕 filter_u128[ActualFilter, 0.01] N/A 823.3 ns N/A
🆕 filter_u128[ActualFilter, 0.05] N/A 865.8 ns N/A
🆕 filter_u128[ActualFilter, 0.1] N/A 880.8 ns N/A
🆕 filter_u128[ActualFilter, 0.25] N/A 1 µs N/A
🆕 filter_u128[ActualFilter, 0.5] N/A 1.2 µs N/A
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Footnotes

  1. 631 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. 28 benchmarks were run, but are now archived. If they were deleted in another branch, consider rebasing to remove them from the report. Instead if they were added back, click here to restore them.

Signed-off-by: Nicholas Gates <nick@nickgates.com>
@codecov
Copy link

codecov bot commented Nov 18, 2025

Codecov Report

❌ Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.89%. Comparing base (1aac9ca) to head (492dffd).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
vortex-compute/src/filter/slice/scalar.rs 0.00% 36 Missing ⚠️
vortex-buffer/src/bit/view.rs 0.00% 4 Missing ⚠️
vortex-compute/src/filter/slice/mod.rs 0.00% 3 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Nicholas Gates <nick@nickgates.com>
@gatesn gatesn marked this pull request as ready for review November 18, 2025 16:01
@gatesn gatesn requested a review from connortsui20 November 18, 2025 16:02
/// In theory, we could use vqtbl1q_u8 for 16-byte vectors, but that would require a 65KB LUT
/// (256 entries × 16 bytes each), which is too large for practical use as it thrashes
/// the CPU cache.
#[cfg(target_arch = "aarch64")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of only adding neon intrinsics, opposed to also having x86. To explore whether this outperforms auto-vectorization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is kind of no reasonable auto-vectorization for filtering that I've seen.

@connortsui20 was working on x86 / AVX512 so I figured I'd poke at neon!

_ => {
// Finally, use the lookup table to compress selected elements
// Load uint8x8 values and compress them using the lookup table.
let values = vld1_u8(read_ptr);
Copy link
Contributor

@0ax1 0ax1 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that branching and branch mispredictions have a significant overhead here if we do it on a per byte level. I think it's worth benchmarking only keeping the _ => case.

What might be reasonable is keeping these checks on a word level.

0u64 => {
    // Skip empty chunks
}
0xFF => {
    // All bits set - fast path
    ptr::copy(read_ptr, write_ptr, 8);
    write_ptr = write_ptr.add(8);
}

But same here: Would be interesting to look at different workloads and see whether introducing the branches on a word level is actually a win over running _ => unconditionally.

}

/// Runs the provided function `f` for each index of a `true` bit in the view.
pub fn iter_ones<F>(&self, mut f: F)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 while raw != 0 {
                        let bit_pos = raw.trailing_zeros();
                        f(bit_idx + bit_pos as usize);
                        raw &= raw - 1; // Clear the bit at `bit_pos`
                    }

Not introduced as part of this PR, but iterating based on while raw seems to be bad for pipelining, as an iteration depends on the previous one.

@connortsui20
Copy link
Contributor

actually I think this deserves some cleaning up / monomorphizing before we merge this (I will make the scalar implementation better on my PR #5399)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Release label indicating a new feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants