-
Notifications
You must be signed in to change notification settings - Fork 132
perf: CUDA FoR loop unrolling #6017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
554c118 to
d55e783
Compare
d55e783 to
46dab08
Compare
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
46dab08 to
979292b
Compare
joseph-isaacs
approved these changes
Jan 19, 2026
danking
pushed a commit
that referenced
this pull request
Feb 6, 2026
This PR changes the FoR impl to use loop unrolling which in some
scenarios leads up to a ~85% speedup. As part of that, new benchmarks
and tests are introduced.
Benchmarks were run on an A10:
```
FoR_cuda_u8/u8_FoR/1K time: [4.6619 µs 4.6818 µs 4.7073 µs]
thrpt: [202.59 MiB/s 203.70 MiB/s 204.57 MiB/s]
change:
time: [−6.9968% −6.1892% −5.2750%] (p = 0.00 < 0.05)
thrpt: [+5.5687% +6.5976% +7.5232%]
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
FoR_cuda_u8/u8_FoR/10K time: [4.4583 µs 4.4809 µs 4.5139 µs]
thrpt: [2.0632 GiB/s 2.0784 GiB/s 2.0890 GiB/s]
change:
time: [−0.0233% +0.7648% +1.5942%] (p = 0.10 > 0.05)
thrpt: [−1.5692% −0.7590% +0.0233%]
No change in performance detected.
FoR_cuda_u8/u8_FoR/100K time: [4.5262 µs 4.5439 µs 4.5736 µs]
thrpt: [20.363 GiB/s 20.496 GiB/s 20.576 GiB/s]
change:
time: [−8.9445% −7.7897% −6.6445%] (p = 0.00 < 0.05)
thrpt: [+7.1175% +8.4477% +9.8231%]
Performance has improved.
FoR_cuda_u8/u8_FoR/1M time: [4.4380 µs 4.4598 µs 4.4891 µs]
thrpt: [207.46 GiB/s 208.83 GiB/s 209.85 GiB/s]
change:
time: [+0.0679% +0.9488% +1.7990%] (p = 0.05 > 0.05)
thrpt: [−1.7672% −0.9399% −0.0679%]
No change in performance detected.
FoR_cuda_u8/u8_FoR/10M time: [4.4880 µs 4.5013 µs 4.5293 µs]
thrpt: [2056.2 GiB/s 2069.0 GiB/s 2075.2 GiB/s]
change:
time: [−3.4123% −2.6199% −1.6909%] (p = 0.00 < 0.05)
thrpt: [+1.7200% +2.6903% +3.5328%]
Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high mild
FoR_cuda_u16/u16_FoR/1K time: [4.7696 µs 4.7820 µs 4.8017 µs]
thrpt: [397.22 MiB/s 398.86 MiB/s 399.90 MiB/s]
change:
time: [−31.818% −31.216% −30.430%] (p = 0.00 < 0.05)
thrpt: [+43.739% +45.384% +46.666%]
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
FoR_cuda_u16/u16_FoR/10K
time: [4.9942 µs 5.0030 µs 5.0114 µs]
thrpt: [3.7168 GiB/s 3.7231 GiB/s 3.7296 GiB/s]
change:
time: [−46.715% −46.619% −46.535%] (p = 0.00 < 0.05)
thrpt: [+87.039% +87.332% +87.671%]
Performance has improved.
FoR_cuda_u16/u16_FoR/100K
time: [11.371 µs 11.387 µs 11.396 µs]
thrpt: [16.345 GiB/s 16.358 GiB/s 16.381 GiB/s]
change:
time: [−26.577% −26.455% −26.344%] (p = 0.00 < 0.05)
thrpt: [+35.767% +35.972% +36.197%]
Performance has improved.
FoR_cuda_u16/u16_FoR/1M time: [4.9764 µs 4.9958 µs 5.0073 µs]
thrpt: [371.98 GiB/s 372.84 GiB/s 374.29 GiB/s]
change:
time: [−46.584% −46.382% −46.210%] (p = 0.00 < 0.05)
thrpt: [+85.906% +86.503% +87.210%]
Performance has improved.
FoR_cuda_u16/u16_FoR/10M
time: [11.157 µs 11.211 µs 11.248 µs]
thrpt: [1656.0 GiB/s 1661.4 GiB/s 1669.5 GiB/s]
change:
time: [−26.916% −26.493% −26.164%] (p = 0.00 < 0.05)
thrpt: [+35.435% +36.042% +36.828%]
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
FoR_cuda_u32/u32_FoR/1K time: [5.2116 µs 5.2613 µs 5.3102 µs]
thrpt: [718.37 MiB/s 725.05 MiB/s 731.96 MiB/s]
change:
time: [−26.511% −25.998% −25.368%] (p = 0.00 < 0.05)
thrpt: [+33.990% +35.132% +36.075%]
Performance has improved.
FoR_cuda_u32/u32_FoR/10K
time: [5.5475 µs 5.5554 µs 5.5633 µs]
thrpt: [6.6962 GiB/s 6.7057 GiB/s 6.7152 GiB/s]
change:
time: [−39.450% −39.349% −39.240%] (p = 0.00 < 0.05)
thrpt: [+64.582% +64.877% +65.153%]
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
FoR_cuda_u32/u32_FoR/100K
time: [9.3362 µs 9.3806 µs 9.4250 µs]
thrpt: [39.525 GiB/s 39.713 GiB/s 39.902 GiB/s]
change:
time: [−25.760% −25.359% −24.988%] (p = 0.00 < 0.05)
thrpt: [+33.312% +33.974% +34.698%]
Performance has improved.
FoR_cuda_u32/u32_FoR/1M time: [13.072 µs 13.168 µs 13.267 µs]
thrpt: [280.78 GiB/s 282.91 GiB/s 284.98 GiB/s]
change:
time: [−18.493% −14.861% −9.3593%] (p = 0.00 < 0.05)
thrpt: [+10.326% +17.455% +22.689%]
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
FoR_cuda_u32/u32_FoR/10M
time: [174.68 µs 174.95 µs 175.20 µs]
thrpt: [212.63 GiB/s 212.94 GiB/s 213.26 GiB/s]
change:
time: [−1.4814% −1.2404% −1.0022%] (p = 0.00 < 0.05)
thrpt: [+1.0124% +1.2560% +1.5036%]
Performance has improved.
FoR_cuda_u64/u64_FoR/1K time: [5.8007 µs 5.8204 µs 5.8478 µs]
thrpt: [1.2741 GiB/s 1.2801 GiB/s 1.2844 GiB/s]
change:
time: [−18.401% −18.040% −17.687%] (p = 0.00 < 0.05)
thrpt: [+21.488% +22.010% +22.551%]
Performance has improved.
FoR_cuda_u64/u64_FoR/10K
time: [13.322 µs 13.378 µs 13.445 µs]
thrpt: [5.5417 GiB/s 5.5695 GiB/s 5.5925 GiB/s]
change:
time: [−17.451% −17.049% −16.645%] (p = 0.00 < 0.05)
thrpt: [+19.969% +20.553% +21.140%]
Performance has improved.
FoR_cuda_u64/u64_FoR/100K
time: [12.205 µs 12.319 µs 12.462 µs]
thrpt: [59.788 GiB/s 60.478 GiB/s 61.044 GiB/s]
change:
time: [−19.829% −19.168% −18.499%] (p = 0.00 < 0.05)
thrpt: [+22.697% +23.713% +24.734%]
Performance has improved.
FoR_cuda_u64/u64_FoR/1M time: [33.368 µs 33.405 µs 33.464 µs]
thrpt: [222.65 GiB/s 223.04 GiB/s 223.28 GiB/s]
change:
time: [−7.8447% −7.4612% −7.0847%] (p = 0.00 < 0.05)
thrpt: [+7.6249% +8.0627% +8.5124%]
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
FoR_cuda_u64/u64_FoR/10M
time: [341.44 µs 341.72 µs 342.10 µs]
thrpt: [217.79 GiB/s 218.03 GiB/s 218.21 GiB/s]
change:
time: [−1.8864% −1.6840% −1.4901%] (p = 0.00 < 0.05)
thrpt: [+1.5126% +1.7128% +1.9226%]
Performance has improved.
```
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the FoR impl to use loop unrolling which in some scenarios leads up to a ~85% speedup. As part of that, new benchmarks and tests are introduced.
Benchmarks were run on an A10: