Skip to content

feature: Add fastlanes bit unpacking cuda kernels#6145

Merged
robert3005 merged 16 commits intodevelopfrom
rk/cudabitpacking
Jan 28, 2026
Merged

feature: Add fastlanes bit unpacking cuda kernels#6145
robert3005 merged 16 commits intodevelopfrom
rk/cudabitpacking

Conversation

@robert3005
Copy link
Contributor

@robert3005 robert3005 commented Jan 26, 2026

I'm happy to big bang the whole thing in one pr or we can merge this as an intermediary step that generates the cuda kernels

Signed-off-by: Robert Kruszewski github@robertk.io

@robert3005 robert3005 added feature A feature request changelog/feature A new feature and removed feature A feature request labels Jan 26, 2026
@joseph-isaacs
Copy link
Contributor

shall we have this in -cuda for now?

@robert3005
Copy link
Contributor Author

We should not, we should move things from vortex-cuda, there's too many things there

@robert3005 robert3005 requested a review from 0ax1 January 26, 2026 18:16
@codspeed-hq
Copy link

codspeed-hq bot commented Jan 26, 2026

Merging this PR will degrade performance by 82.84%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 7 improved benchmarks
❌ 11 regressed benchmarks
✅ 1143 untouched benchmarks
⏩ 1323 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation canonical_into_non_nullable[(10000, 1, 0.0)] 25.7 µs 36.2 µs -29.12%
Simulation canonical_into_non_nullable[(10000, 10, 0.01)] 222.6 µs 306.1 µs -27.28%
Simulation canonical_into_non_nullable[(10000, 1, 0.01)] 32.2 µs 41.1 µs -21.61%
Simulation canonical_into_non_nullable[(10000, 10, 0.0)] 195.5 µs 279 µs -29.91%
Simulation canonical_into_non_nullable[(10000, 1, 0.1)] 48 µs 57 µs -15.76%
Simulation canonical_into_non_nullable[(10000, 10, 0.1)] 382.1 µs 471.6 µs -18.98%
Simulation canonical_into_nullable[(10000, 100, 0.0)] 5 ms 4.4 ms +14.03%
Simulation into_canonical_non_nullable[(10000, 1, 0.01)] 46.3 µs 39.1 µs +18.27%
Simulation into_canonical_non_nullable[(10000, 10, 0.01)] 229.2 µs 309.2 µs -25.87%
Simulation into_canonical_non_nullable[(10000, 10, 0.0)] 201.6 µs 282.4 µs -28.59%
Simulation into_canonical_non_nullable[(10000, 10, 0.1)] 385.1 µs 471.5 µs -18.34%
Simulation into_canonical_non_nullable[(10000, 1, 0.0)] 40.4 µs 33.1 µs +22.14%
Simulation into_canonical_non_nullable[(10000, 1, 0.1)] 62.8 µs 55.2 µs +13.74%
Simulation into_canonical_nullable[(10000, 10, 0.0)] 540.8 µs 458.2 µs +18.02%
Simulation into_canonical_nullable[(10000, 100, 0.0)] 5.1 ms 4.3 ms +16.84%
Simulation into_canonical_nullable[(10000, 10, 0.1)] 632 µs 718.9 µs -12.09%
Simulation into_canonical_nullable[(10000, 100, 0.1)] 6.9 ms 6.1 ms +13.74%
WallTime u8_FoR[10M] 6.9 µs 40.5 µs -82.84%

Comparing rk/cudabitpacking (960ce72) with develop (68130ce)

Open in CodSpeed

Footnotes

  1. 1323 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@joseph-isaacs
Copy link
Contributor

Can we have a test running a kernel?

Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 changed the title feature: Add generator for fastlanes bit unpacking cuda kernels feature: Add fastlanes bit unpacking cuda kernels Jan 28, 2026
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 enabled auto-merge (squash) January 28, 2026 14:20
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 enabled auto-merge (squash) January 28, 2026 16:05
@robert3005 robert3005 merged commit 6ab6b5f into develop Jan 28, 2026
45 of 46 checks passed
@robert3005 robert3005 deleted the rk/cudabitpacking branch January 28, 2026 16:15
danking pushed a commit that referenced this pull request Feb 6, 2026
Signed-off-by: Robert Kruszewski <github@robertk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants