Skip to content

feat[gpu]: nvidia cub filter kernel#6188

Merged
0ax1 merged 8 commits intodevelopfrom
ad/cub-device-select
Jan 28, 2026
Merged

feat[gpu]: nvidia cub filter kernel#6188
0ax1 merged 8 commits intodevelopfrom
ad/cub-device-select

Conversation

@0ax1
Copy link
Contributor

@0ax1 0ax1 commented Jan 28, 2026

No description provided.

@0ax1 0ax1 marked this pull request as ready for review January 28, 2026 15:50
@0ax1 0ax1 added the feature A feature request label Jan 28, 2026
@0ax1 0ax1 force-pushed the ad/cub-device-select branch from ea450ab to c1531a5 Compare January 28, 2026 15:51
@0ax1 0ax1 added changelog/feature A new feature and removed feature A feature request labels Jan 28, 2026
@0ax1 0ax1 changed the title feat[gpu]: nvidia cub gpu filter kernel feat[gpu]: nvidia cub filter kernel Jan 28, 2026
@0ax1
Copy link
Contributor Author

0ax1 commented Jan 28, 2026

Filter_cuda_i64/1M_10pct/100000
                        time:   [35.605 µs 35.867 µs 36.041 µs]
                        thrpt:  [206.72 GiB/s 207.73 GiB/s 209.26 GiB/s]
Filter_cuda_i64/1M_50pct/500000
                        time:   [38.558 µs 38.703 µs 38.785 µs]
                        thrpt:  [192.10 GiB/s 192.51 GiB/s 193.23 GiB/s]
Filter_cuda_i64/1M_90pct/900000
                        time:   [45.271 µs 45.448 µs 45.596 µs]
                        thrpt:  [163.40 GiB/s 163.93 GiB/s 164.58 GiB/s]
Filter_cuda_i64/10M_10pct/1000000
                        time:   [216.37 µs 217.62 µs 218.51 µs]
                        thrpt:  [340.98 GiB/s 342.37 GiB/s 344.35 GiB/s]
Filter_cuda_i64/10M_50pct/5000000
                        time:   [287.96 µs 288.69 µs 289.13 µs]
                        thrpt:  [257.69 GiB/s 258.08 GiB/s 258.74 GiB/s]
Filter_cuda_i64/10M_90pct/9000000
                        time:   [346.53 µs 347.00 µs 347.57 µs]
                        thrpt:  [214.36 GiB/s 214.72 GiB/s 215.01 GiB/s]
Found 2 outliers among 10 measurements (20.00%)

0ax1 added 6 commits January 28, 2026 16:18
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cub-device-select branch from 11ffecc to a454644 Compare January 28, 2026 16:18
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@codspeed-hq
Copy link

codspeed-hq bot commented Jan 28, 2026

CodSpeed Performance Report

Merging this PR will degrade performance by 29.91%

Comparing ad/cub-device-select (cd0dd0c) with develop (68130ce)1

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 7 improved benchmarks
❌ 10 regressed benchmarks
✅ 1144 untouched benchmarks
🆕 18 new benchmarks
⏩ 1323 skipped benchmarks2

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation canonical_into_non_nullable[(10000, 1, 0.0)] 25.7 µs 36.2 µs -29.12%
Simulation canonical_into_non_nullable[(10000, 10, 0.0)] 195.5 µs 279 µs -29.91%
Simulation canonical_into_non_nullable[(10000, 10, 0.1)] 382.1 µs 471.6 µs -18.98%
Simulation canonical_into_non_nullable[(10000, 1, 0.01)] 32.2 µs 41.1 µs -21.61%
Simulation canonical_into_non_nullable[(10000, 10, 0.01)] 222.6 µs 306.1 µs -27.28%
Simulation canonical_into_non_nullable[(10000, 1, 0.1)] 48 µs 57 µs -15.76%
Simulation canonical_into_nullable[(10000, 100, 0.0)] 5 ms 4.4 ms +14.03%
Simulation into_canonical_non_nullable[(10000, 1, 0.01)] 46.3 µs 39.1 µs +18.27%
Simulation into_canonical_non_nullable[(10000, 1, 0.0)] 40.4 µs 33.1 µs +22.14%
Simulation into_canonical_non_nullable[(10000, 10, 0.0)] 201.6 µs 282.4 µs -28.59%
Simulation into_canonical_non_nullable[(10000, 10, 0.01)] 229.2 µs 309.2 µs -25.87%
Simulation into_canonical_nullable[(10000, 10, 0.1)] 632 µs 718.9 µs -12.09%
Simulation into_canonical_non_nullable[(10000, 10, 0.1)] 385.1 µs 471.5 µs -18.34%
Simulation into_canonical_nullable[(10000, 100, 0.1)] 6.9 ms 6.1 ms +13.74%
Simulation into_canonical_nullable[(10000, 100, 0.0)] 5.1 ms 4.3 ms +16.84%
Simulation into_canonical_non_nullable[(10000, 1, 0.1)] 62.8 µs 55.2 µs +13.74%
Simulation into_canonical_nullable[(10000, 10, 0.0)] 540.8 µs 458.2 µs +18.02%
🆕 WallTime 1M_10pct[100000] N/A 45.8 µs N/A
🆕 WallTime 10M_10pct[1000000] N/A 132.6 µs N/A
🆕 WallTime 1M_50pct[500000] N/A 21.9 µs N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Footnotes

  1. No successful run was found on develop (6ab6b5f) during the generation of this report, so 68130ce was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

  2. 1323 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Copy link
Contributor

@a10y a10y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet

@0ax1 0ax1 merged commit 60530b1 into develop Jan 28, 2026
81 of 85 checks passed
@0ax1 0ax1 deleted the ad/cub-device-select branch January 28, 2026 16:55
danking pushed a commit that referenced this pull request Feb 6, 2026
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants