-
-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look into this alleged optimal algorithm for bounded random integers #1172
Comments
Lemire seems to think it is correct https://twitter.com/lemire/status/1433523690696757251 |
Indeed, this looks nice, especially the practical option of fixing the maximum number of random bits used. Some notes:
Then of course it would be nice to compare to #1154. This approach looks good, but it can require up to two widening multiplies (or one and a half). Still worth benchmarking. @TheIronBorn would you be interested (again)? |
Sure. Working on constant-range benchmarks as well |
I put together a quick-and-dirty implementation and a benchmark. Here's the branch https://github.com/TheIronBorn/rand/tree/unif-benches. I added a modulo to I didn't implement 128-bit integers (what do we do if using https://github.com/BurntSushi/cargo-benchcmp, name (w/ Xoshiro256PlusPlus) oneill ns/iter canon ns/iter diff ns/iter diff % speedup
+_distr_uniform_high_reject_i16 37,595 (1063 MB/s) 37,065 (1079 MB/s) -530 -1.41% x 1.01
+_distr_uniform_high_reject_i32 200,248 (399 MB/s) 110,348 (724 MB/s) -89,900 -44.89% x 1.81
+_distr_uniform_high_reject_i64 201,781 (792 MB/s) 110,587 (1446 MB/s) -91,194 -45.19% x 1.82
+_distr_uniform_high_reject_i8 37,887 (527 MB/s) 37,231 (537 MB/s) -656 -1.73% x 1.02
-_distr_uniform_low_reject_i16 37,451 (1068 MB/s) 40,636 (984 MB/s) 3,185 8.50% x 0.92
-_distr_uniform_low_reject_i32 37,554 (2130 MB/s) 40,591 (1970 MB/s) 3,037 8.09% x 0.93
+_distr_uniform_low_reject_i64 35,988 (4445 MB/s) 35,958 (4449 MB/s) -30 -0.08% x 1.00
-_distr_uniform_low_reject_i8 37,605 (531 MB/s) 40,507 (493 MB/s) 2,902 7.72% x 0.93
+_gen_range_i16_high 47,645 (839 MB/s) 47,396 (843 MB/s) -249 -0.52% x 1.01
+_gen_range_i16_low 48,485 (824 MB/s) 43,836 (912 MB/s) -4,649 -9.59% x 1.11
+_gen_range_i32_high 221,542 (361 MB/s) 133,900 (597 MB/s) -87,642 -39.56% x 1.65
+_gen_range_i32_low 40,666 (1967 MB/s) 40,596 (1970 MB/s) -70 -0.17% x 1.00
+_gen_range_i64_high 239,002 (669 MB/s) 131,817 (1213 MB/s) -107,185 -44.85% x 1.81
+_gen_range_i64_low 40,780 (3923 MB/s) 39,284 (4072 MB/s) -1,496 -3.67% x 1.04
-_gen_range_i8_high 40,472 (494 MB/s) 47,257 (423 MB/s) 6,785 16.76% x 0.86
-_gen_range_i8_low 40,575 (492 MB/s) 43,446 (460 MB/s) 2,871 7.08% x 0.93 It's a lot faster when rejections are likely and about the same otherwise. 8-bit and 16-bit have less speedup because the internal 32/64 bit rejections are less likely and perhaps because of the more expensive 64->128 bit multiplication. Note: I used name (w/ Pcg64Mcg) oneill ns/iter canon ns/iter diff ns/iter diff % speedup
-_distr_uniform_high_reject_i16 43,841 (912 MB/s) 126,301 (316 MB/s) 82,460 188.09% x 0.35
+_distr_uniform_high_reject_i32 244,767 (326 MB/s) 127,360 (628 MB/s) -117,407 -47.97% x 1.92
+_distr_uniform_high_reject_i64 247,989 (645 MB/s) 139,844 (1144 MB/s) -108,145 -43.61% x 1.77
-_distr_uniform_high_reject_i8 43,864 (455 MB/s) 127,611 (156 MB/s) 83,747 190.92% x 0.34
-_distr_uniform_low_reject_i16 47,197 (847 MB/s) 139,575 (286 MB/s) 92,378 195.73% x 0.34
-_distr_uniform_low_reject_i32 47,280 (1692 MB/s) 139,337 (574 MB/s) 92,057 194.71% x 0.34
-_distr_uniform_low_reject_i64 40,512 (3949 MB/s) 131,778 (1214 MB/s) 91,266 225.28% x 0.31
-_distr_uniform_low_reject_i8 47,263 (423 MB/s) 139,405 (143 MB/s) 92,142 194.96% x 0.34
-_gen_range_i16_high 54,133 (738 MB/s) 60,703 (658 MB/s) 6,570 12.14% x 0.89
-_gen_range_i16_low 48,056 (832 MB/s) 60,850 (657 MB/s) 12,794 26.62% x 0.79
+_gen_range_i32_high 259,113 (308 MB/s) 165,511 (483 MB/s) -93,602 -36.12% x 1.57
-_gen_range_i32_low 45,581 (1755 MB/s) 55,439 (1443 MB/s) 9,858 21.63% x 0.82
+_gen_range_i64_high 266,040 (601 MB/s) 184,480 (867 MB/s) -81,560 -30.66% x 1.44
-_gen_range_i64_low 47,443 (3372 MB/s) 64,619 (2476 MB/s) 17,176 36.20% x 0.73
-_gen_range_i8_high 47,831 (418 MB/s) 63,884 (313 MB/s) 16,053 33.56% x 0.75
-_gen_range_i8_low 45,598 (438 MB/s) 63,041 (317 MB/s) 17,443 38.25% x 0.72 |
Thanks. I believe
The Swift impl just uses generics (
Interesting. Ideally we'd want to benchmark across several RNGs each across several different CPUs. But using Another question here is: exactly how much bias is acceptable for the cut-off? |
You’re referring to the modulo in constant? The author recommended that (here) though perhaps I’m not applying it correctly. It’s not intended to reduce bias, just to make the second generation and multiplication less likely |
And perhaps we could change the comments there to discourage that understanding |
Added 128-bit implementation let (new_hi_order, _) = (rng.gen::<u64>() as $u_extra_large).wmul(range as $u_extra_large); name (w/ Xoshiro256PlusPlus) oneill ns/iter canon ns/iter diff ns/iter diff % speedup
+_distr_uniform_high_reject_i128 460,530 (694 MB/s) 157,020 (2037 MB/s) -303,510 -65.90% x 2.93
+_distr_uniform_low_reject_i128 368,029 (869 MB/s) 108,908 (2938 MB/s) -259,121 -70.41% x 3.38
+_gen_range_i128_high 515,440 (620 MB/s) 363,085 (881 MB/s) -152,355 -29.56% x 1.42
-_gen_range_i128_low 233,882 (1368 MB/s) 238,505 (1341 MB/s) 4,623 1.98% x 0.98 |
our shuffle benchmark is almost 2x faster
(with |
SIMD performance is surprising. I thought the lack of an easy SIMD 64->128 multiplication would slow things down too much but it seems for name oneill ns/iter canon ns/iter diff ns/iter diff % speedup
-_distr_uniform_high_reject_i16x8 100,478 (796 MB/s) 172,737 (463 MB/s) 72,259 71.92% x 0.58
+_distr_uniform_high_reject_i32x4 212,538 (376 MB/s) 152,175 (525 MB/s) -60,363 -28.40% x 1.40
+_distr_uniform_high_reject_i64x2 155,429 (514 MB/s) 69,997 (1142 MB/s) -85,432 -54.97% x 2.22
-_distr_uniform_low_reject_i16x8 25,793 (3101 MB/s) 226,827 (352 MB/s) 201,034 779.41% x 0.11
-_distr_uniform_low_reject_i32x4 26,227 (3050 MB/s) 145,611 (549 MB/s) 119,384 455.20% x 0.18
-_distr_uniform_low_reject_i64x2 32,843 (2435 MB/s) 95,787 (835 MB/s) 62,944 191.65% x 0.34
+_gen_range_i16x8_high_reject 301,268 (265 MB/s) 112,707 (709 MB/s) -188,561 -62.59% x 2.67
+_gen_range_i16x8_low_reject 817,562 (97 MB/s) 168,693 (474 MB/s) -648,869 -79.37% x 4.85
+_gen_range_i32x4_high_reject 150,803 (530 MB/s) 75,148 (1064 MB/s) -75,655 -50.17% x 2.01
+_gen_range_i32x4_low_reject 475,460 (168 MB/s) 138,031 (579 MB/s) -337,429 -70.97% x 3.44
+_gen_range_i64x2_high_reject 267,958 (298 MB/s) 42,753 (1871 MB/s) -225,205 -84.04% x 6.27
+_gen_range_i64x2_low_reject 453,003 (176 MB/s) 103,087 (776 MB/s) -349,916 -77.24% x 4.39 |
Interesting results. But you didn't push your SIMD code? Why does Of course, the new method requires generating
It might end up being better using 32-bit steps for (some) SIMD impls since it reduces the amount of data up-front. The drawback is that we may need up to two extra steps to get bias below 1-in-2^64 samples. If there are significant performance implications, we could consider variants with differing bias limits. In fact, providing it is well documented, we could probably simply reduce the bias tolerance to |
I did push SIMD commits? https://github.com/TheIronBorn/rand/blob/9e91003d5cdb429bc2a53e4f8a33a34613534be5/src/distributions/uniform.rs#L725-L759 I've since also added benchmarks for those comparisons (branchless, branchy, mod_branchy) you mentioned but haven't run them bc it would take a while. I think I should just worry about the 128-bit (XMM) sized types for now.
As you say two steps of |
If we had math for exactly how many bits are required we might be able to provide guarantees for 8-bit widths. I don't understand the proposal well enough for that though. |
We never know in advance how many bits are required. The method works by adding bits until either it is known that further precision cannot change the result or the expectation of an incorrect result is sufficiently small (no more than one biased result in 2^64 samples). The reason the algorithm works well for single results is that on a 64-bit CPU, throwing 64-bits at a problem expected to require far less is cheap. But throwing 64×8=512 bits at We could even use 16-bit steps, but I'm less convinced, because we'd require up to four extra steps and testing is probably also a significant cost. It may simply be better to keep the old method for some of these SIMD impls. |
I couldn't figure out how to do two-step method for the extra bits (without just doing a 64-bit mul using 32-bit operations that is) But I put together a comparison of branchless/branchy/mod here: name canon_branchless ns/iter canon_branchy_no_mod ns/iter diff ns/iter diff % speedup
-_gen_range_i16x8_high_reject 370,063 (864 MB/s) 381,762 (838 MB/s) 11,699 3.16% x 0.97
+_gen_range_i16x8_low_reject 419,860 (762 MB/s) 330,372 (968 MB/s) -89,488 -21.31% x 1.27
-_gen_range_i32x4_high_reject 230,870 (1386 MB/s) 237,903 (1345 MB/s) 7,033 3.05% x 0.97
+_gen_range_i32x4_low_reject 257,390 (1243 MB/s) 142,934 (2238 MB/s) -114,456 -44.47% x 1.80
-_gen_range_i64x2_high_reject 206,744 (1547 MB/s) 264,210 (1211 MB/s) 57,466 27.80% x 0.78
+_gen_range_i64x2_low_reject 224,032 (1428 MB/s) 146,566 (2183 MB/s) -77,466 -34.58% x 1.53
-_gen_range_i8x16_high_reject 665,987 (480 MB/s) 691,560 (462 MB/s) 25,573 3.84% x 0.96
+_gen_range_i8x16_low_reject 739,970 (432 MB/s) 623,880 (512 MB/s) -116,090 -15.69% x 1.19
name canon_branchless ns/iter canon_branchy_mod ns/iter diff ns/iter diff % speedup
-_distr_uniform_i16x8_high_reject 363,250 (880 MB/s) 365,540 (875 MB/s) 2,290 0.63% x 0.99
+_distr_uniform_i16x8_low_reject 353,415 (905 MB/s) 68,905 (4644 MB/s) -284,510 -80.50% x 5.13
-_distr_uniform_i32x4_high_reject 210,531 (1519 MB/s) 234,356 (1365 MB/s) 23,825 11.32% x 0.90
+_distr_uniform_i32x4_low_reject 246,272 (1299 MB/s) 83,541 (3830 MB/s) -162,731 -66.08% x 2.95
-_distr_uniform_i64x2_high_reject 146,966 (2177 MB/s) 199,202 (1606 MB/s) 52,236 35.54% x 0.74
+_distr_uniform_i64x2_low_reject 157,588 (2030 MB/s) 75,960 (4212 MB/s) -81,628 -51.80% x 2.07
+_distr_uniform_i8x16_high_reject 715,660 (447 MB/s) 669,060 (478 MB/s) -46,600 -6.51% x 1.07
+_distr_uniform_i8x16_low_reject 706,100 (453 MB/s) 134,419 (2380 MB/s) -571,681 -80.96% x 5.25 The modulo seems to make little difference though: name canon_branchy_no_mod ns/iter canon_branchy_mod ns/iter diff ns/iter diff % speedup
+_distr_uniform_i16x8_high_reject 367,512 (870 MB/s) 365,540 (875 MB/s) -1,972 -0.54% x 1.01
+_distr_uniform_i16x8_low_reject 76,440 (4186 MB/s) 68,905 (4644 MB/s) -7,535 -9.86% x 1.11
-_distr_uniform_i32x4_high_reject 224,735 (1423 MB/s) 234,356 (1365 MB/s) 9,621 4.28% x 0.96
+_distr_uniform_i32x4_low_reject 83,820 (3817 MB/s) 83,541 (3830 MB/s) -279 -0.33% x 1.00
-_distr_uniform_i64x2_high_reject 177,215 (1805 MB/s) 199,202 (1606 MB/s) 21,987 12.41% x 0.89
+_distr_uniform_i64x2_low_reject 83,113 (3850 MB/s) 75,960 (4212 MB/s) -7,153 -8.61% x 1.09
+_distr_uniform_i8x16_high_reject 674,635 (474 MB/s) 669,060 (478 MB/s) -5,575 -0.83% x 1.01
+_distr_uniform_i8x16_low_reject 183,900 (1740 MB/s) 134,419 (2380 MB/s) -49,481 -26.91% x 1.37 I thought at first a series of 64->128 name oneill ns/iter canon_branchy_no_mod ns/iter diff ns/iter diff % speedup
+_distr_uniform_i16x8_high_reject 552,320 (579 MB/s) 367,512 (870 MB/s) -184,808 -33.46% x 1.50
+_distr_uniform_i16x8_low_reject 79,120 (4044 MB/s) 76,440 (4186 MB/s) -2,680 -3.39% x 1.04
+_distr_uniform_i32x4_high_reject 582,245 (549 MB/s) 224,735 (1423 MB/s) -357,510 -61.40% x 2.59
-_distr_uniform_i32x4_low_reject 77,865 (4109 MB/s) 83,820 (3817 MB/s) 5,955 7.65% x 0.93
+_distr_uniform_i64x2_high_reject 720,860 (443 MB/s) 177,215 (1805 MB/s) -543,645 -75.42% x 4.07
+_distr_uniform_i64x2_low_reject 130,595 (2450 MB/s) 83,113 (3850 MB/s) -47,482 -36.36% x 1.57
+_distr_uniform_i8x16_high_reject 851,180 (375 MB/s) 674,635 (474 MB/s) -176,545 -20.74% x 1.26
-_distr_uniform_i8x16_low_reject 111,113 (2879 MB/s) 183,900 (1740 MB/s) 72,787 65.51% x 0.60 (the slowdown in Here's the full bench logs https://github.com/TheIronBorn/rand/blob/unif-benches/benches/unif_bench_logs |
Aha, this is the stop condition for Lemire's method. Regardless, it's only a useful optimisation where the range is known at compile-time, which may be the case in your benchmarks but would require an API extension to expose it properly (e.g. we don't want to use it for shuffling), so it's probably only worth considering if there is a significant advantage (only really seen in |
There are some issues with your benchmarks unfortunately:
I'm still going over stuff and will push a draft PR when done. |
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
Background
What is your motivation?
I came across this interesting pull request that someone made on the official Swift programming language repository
swiftlang/swift#39143
It makes some very intriguing claims, stating such as:
and
And I was wondering if this might similarly be useful in your rand crate
What type of application is this? (E.g. cryptography, game, numerical simulation)
All applications of random number generation. And from what I understand could make SIMD improvements for random number generation possible also?
Feature request
I request that the rand crate authors have a look at this algorithm swiftlang/swift#39143 and that they determine whether it could be implemented into the rand crate to improve performance for the generation of random numbers.
And alternatively, if it is something that needs language support inside of Rust itself. I am not able to determine on my own whether such is the case or not.
The text was updated successfully, but these errors were encountered: