Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into this alleged optimal algorithm for bounded random integers #1172

Closed
ctsrc opened this issue Sep 2, 2021 · 17 comments
Closed

Look into this alleged optimal algorithm for bounded random integers #1172

ctsrc opened this issue Sep 2, 2021 · 17 comments
Labels
A-new Propose usage of a new algorithm

Comments

@ctsrc
Copy link

ctsrc commented Sep 2, 2021

Background

What is your motivation?

I came across this interesting pull request that someone made on the official Swift programming language repository

swiftlang/swift#39143

It makes some very intriguing claims, stating such as:

reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal

and

this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible

And I was wondering if this might similarly be useful in your rand crate

What type of application is this? (E.g. cryptography, game, numerical simulation)

All applications of random number generation. And from what I understand could make SIMD improvements for random number generation possible also?

Feature request

I request that the rand crate authors have a look at this algorithm swiftlang/swift#39143 and that they determine whether it could be implemented into the rand crate to improve performance for the generation of random numbers.

And alternatively, if it is something that needs language support inside of Rust itself. I am not able to determine on my own whether such is the case or not.

@TheIronBorn
Copy link
Contributor

Lemire seems to think it is correct https://twitter.com/lemire/status/1433523690696757251

@dhardy
Copy link
Member

dhardy commented Sep 3, 2021

Indeed, this looks nice, especially the practical option of fixing the maximum number of random bits used. Some notes:

  • A limitation of RngCore is that we don't know whether the RNG generates 32- or 64-bits natively. Sampling a 32-bit number using this method, with the same cut-off threshold of ≤2⁻⁶⁴ of bias, would require up to three 32-bit samples or up to two 64-bit samples; possibly the best approach here would be different logic for 32-bit vs 64-bit CPUs.
  • In theory sampling a u128 requires up to three u64 values which may be preferable to sampling up to two u128 values, though in practice the difference is unlikely to be important.
  • For SIMD implementations, we could take the same approach, terminating early only if all lanes are within the bound.

Then of course it would be nice to compare to #1154. This approach looks good, but it can require up to two widening multiplies (or one and a half). Still worth benchmarking.

@TheIronBorn would you be interested (again)?

@TheIronBorn
Copy link
Contributor

Sure. Working on constant-range benchmarks as well

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Sep 5, 2021

I put together a quick-and-dirty implementation and a benchmark. Here's the branch https://github.com/TheIronBorn/rand/tree/unif-benches. I added a modulo to Uniform::sample to improve the early exit. I think benchmarking on multiple different hardware would be good (especially 32-bit machines with different PRNGs), you can just run cargo bench --bench distributions.

I didn't implement 128-bit integers (what do we do if range > 2^64? just generate 128 bits?)

using https://github.com/BurntSushi/cargo-benchcmp, cargo benchcmp oneill canon <saved_bench_output>

 name (w/ Xoshiro256PlusPlus)    oneill ns/iter      canon ns/iter        diff ns/iter   diff %  speedup
+_distr_uniform_high_reject_i16  37,595 (1063 MB/s)  37,065 (1079 MB/s)           -530   -1.41%   x 1.01
+_distr_uniform_high_reject_i32  200,248 (399 MB/s)  110,348 (724 MB/s)        -89,900  -44.89%   x 1.81
+_distr_uniform_high_reject_i64  201,781 (792 MB/s)  110,587 (1446 MB/s)       -91,194  -45.19%   x 1.82
+_distr_uniform_high_reject_i8   37,887 (527 MB/s)   37,231 (537 MB/s)            -656   -1.73%   x 1.02
-_distr_uniform_low_reject_i16   37,451 (1068 MB/s)  40,636 (984 MB/s)           3,185    8.50%   x 0.92
-_distr_uniform_low_reject_i32   37,554 (2130 MB/s)  40,591 (1970 MB/s)          3,037    8.09%   x 0.93
+_distr_uniform_low_reject_i64   35,988 (4445 MB/s)  35,958 (4449 MB/s)            -30   -0.08%   x 1.00
-_distr_uniform_low_reject_i8    37,605 (531 MB/s)   40,507 (493 MB/s)           2,902    7.72%   x 0.93
+_gen_range_i16_high             47,645 (839 MB/s)   47,396 (843 MB/s)            -249   -0.52%   x 1.01
+_gen_range_i16_low              48,485 (824 MB/s)   43,836 (912 MB/s)          -4,649   -9.59%   x 1.11
+_gen_range_i32_high             221,542 (361 MB/s)  133,900 (597 MB/s)        -87,642  -39.56%   x 1.65
+_gen_range_i32_low              40,666 (1967 MB/s)  40,596 (1970 MB/s)            -70   -0.17%   x 1.00
+_gen_range_i64_high             239,002 (669 MB/s)  131,817 (1213 MB/s)      -107,185  -44.85%   x 1.81
+_gen_range_i64_low              40,780 (3923 MB/s)  39,284 (4072 MB/s)         -1,496   -3.67%   x 1.04
-_gen_range_i8_high              40,472 (494 MB/s)   47,257 (423 MB/s)           6,785   16.76%   x 0.86
-_gen_range_i8_low               40,575 (492 MB/s)   43,446 (460 MB/s)           2,871    7.08%   x 0.93

It's a lot faster when rejections are likely and about the same otherwise. 8-bit and 16-bit have less speedup because the internal 32/64 bit rejections are less likely and perhaps because of the more expensive 64->128 bit multiplication.

Note: I used SmallRng (Xoshiro256PlusPlus) as Pcg64Mcg was significantly slower, likely due to the heavier use of 64->128 bit multiplication. Such a significant contextual slowdown is concerning for a general use API.

 name (w/ Pcg64Mcg)              oneill ns/iter      canon ns/iter        diff ns/iter   diff %  speedup
-_distr_uniform_high_reject_i16  43,841 (912 MB/s)   126,301 (316 MB/s)         82,460  188.09%   x 0.35
+_distr_uniform_high_reject_i32  244,767 (326 MB/s)  127,360 (628 MB/s)       -117,407  -47.97%   x 1.92
+_distr_uniform_high_reject_i64  247,989 (645 MB/s)  139,844 (1144 MB/s)      -108,145  -43.61%   x 1.77
-_distr_uniform_high_reject_i8   43,864 (455 MB/s)   127,611 (156 MB/s)         83,747  190.92%   x 0.34
-_distr_uniform_low_reject_i16   47,197 (847 MB/s)   139,575 (286 MB/s)         92,378  195.73%   x 0.34
-_distr_uniform_low_reject_i32   47,280 (1692 MB/s)  139,337 (574 MB/s)         92,057  194.71%   x 0.34
-_distr_uniform_low_reject_i64   40,512 (3949 MB/s)  131,778 (1214 MB/s)        91,266  225.28%   x 0.31
-_distr_uniform_low_reject_i8    47,263 (423 MB/s)   139,405 (143 MB/s)         92,142  194.96%   x 0.34
-_gen_range_i16_high             54,133 (738 MB/s)   60,703 (658 MB/s)           6,570   12.14%   x 0.89
-_gen_range_i16_low              48,056 (832 MB/s)   60,850 (657 MB/s)          12,794   26.62%   x 0.79
+_gen_range_i32_high             259,113 (308 MB/s)  165,511 (483 MB/s)        -93,602  -36.12%   x 1.57
-_gen_range_i32_low              45,581 (1755 MB/s)  55,439 (1443 MB/s)          9,858   21.63%   x 0.82
+_gen_range_i64_high             266,040 (601 MB/s)  184,480 (867 MB/s)        -81,560  -30.66%   x 1.44
-_gen_range_i64_low              47,443 (3372 MB/s)  64,619 (2476 MB/s)         17,176   36.20%   x 0.73
-_gen_range_i8_high              47,831 (418 MB/s)   63,884 (313 MB/s)          16,053   33.56%   x 0.75
-_gen_range_i8_low               45,598 (438 MB/s)   63,041 (317 MB/s)          17,443   38.25%   x 0.72

@dhardy
Copy link
Member

dhardy commented Sep 5, 2021

Thanks. I believe sample_single_inclusive_canon is correct (code), but the comments indicate you don't quite understand the maths. This method doesn't use rejection sampling on biased results, instead it step-wise approaches the result of multiplying a true-real (infinite precision) random value r in [0, 1) multiplied by range. The extra step(s) are not about bias, but about performing as many steps as required until we know that extra steps cannot change the result after integer rounding — except that we stop after one extra step since the bounds on bias are insignificantly small. (So the change in sample_constant_inclusive_canon doesn't make sense.)

I didn't implement 128-bit integers (what do we do if range > 2^64? just generate 128 bits?)

The Swift impl just uses generics (T: unsigned-fixed-width-int, where bits(T) ≥ 64) if I read it correctly. We could do better: conceptually, three steps with u64 random values multiplied into a 192-bit target. Practically though it makes sense to combine the first two steps (construct a 128-bit random value and use widening multiply into 128-bit high and low parts, with optional early exit). Next, multiply the third u64 by the range into a 192-bit target, take the upper 128 bits and add to the 128-bit low part of the first step, checking for overflow.

Pcg64Mcg was significantly slower

Interesting. Ideally we'd want to benchmark across several RNGs each across several different CPUs. But using sample_single_inclusive_canon instead of sample_constant_inclusive_canon (see above) I don't think the results look so bad.


Another question here is: exactly how much bias is acceptable for the cut-off? 2⁻⁶⁴ seems reasonable, but e.g. if we accept 2⁻⁴⁸ then we could yield u8 and u16 results from a single u64. @vks, others?

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Sep 5, 2021

The change in constant is incorrect

You’re referring to the modulo in constant? The author recommended that (here) though perhaps I’m not applying it correctly.

It’s not intended to reduce bias, just to make the second generation and multiplication less likely

@TheIronBorn
Copy link
Contributor

And perhaps we could change the comments there to discourage that understanding

@TheIronBorn
Copy link
Contributor

Added 128-bit implementation

let (new_hi_order, _) = (rng.gen::<u64>() as $u_extra_large).wmul(range as $u_extra_large);
 name (w/ Xoshiro256PlusPlus)     oneill ns/iter       canon ns/iter        diff ns/iter   diff %  speedup
+_distr_uniform_high_reject_i128  460,530 (694 MB/s)   157,020 (2037 MB/s)      -303,510  -65.90%   x 2.93
+_distr_uniform_low_reject_i128   368,029 (869 MB/s)   108,908 (2938 MB/s)      -259,121  -70.41%   x 3.38
+_gen_range_i128_high             515,440 (620 MB/s)   363,085 (881 MB/s)       -152,355  -29.56%   x 1.42
-_gen_range_i128_low              233,882 (1368 MB/s)  238,505 (1341 MB/s)         4,623    1.98%   x 0.98

@TheIronBorn
Copy link
Contributor

our shuffle benchmark is almost 2x faster

 name              oneill ns/iter  canon ns/iter  diff ns/iter   diff %  speedup
 _seq_shuffle_100  597             311                    -286  -47.91%   x 1.92

(with Xoshiro128PlusPlus, 1.5x with Pcg32)

@vks vks mentioned this issue Sep 7, 2021
24 tasks
@TheIronBorn
Copy link
Contributor

TheIronBorn commented Sep 9, 2021

SIMD performance is surprising. I thought the lack of an easy SIMD 64->128 multiplication would slow things down too much but it seems for sample_single the lack of threshold/bitmask generation and fewer random bits needed helps a lot. (I did the math once for the expected iterations for the blend method but lost it)
In some cases slower than just scalar thanks to the scalar improvements.

 name                              oneill ns/iter      canon ns/iter       diff ns/iter   diff %  speedup
-_distr_uniform_high_reject_i16x8  100,478 (796 MB/s)  172,737 (463 MB/s)        72,259   71.92%   x 0.58
+_distr_uniform_high_reject_i32x4  212,538 (376 MB/s)  152,175 (525 MB/s)       -60,363  -28.40%   x 1.40
+_distr_uniform_high_reject_i64x2  155,429 (514 MB/s)  69,997 (1142 MB/s)       -85,432  -54.97%   x 2.22
-_distr_uniform_low_reject_i16x8   25,793 (3101 MB/s)  226,827 (352 MB/s)       201,034  779.41%   x 0.11
-_distr_uniform_low_reject_i32x4   26,227 (3050 MB/s)  145,611 (549 MB/s)       119,384  455.20%   x 0.18
-_distr_uniform_low_reject_i64x2   32,843 (2435 MB/s)  95,787 (835 MB/s)         62,944  191.65%   x 0.34
+_gen_range_i16x8_high_reject      301,268 (265 MB/s)  112,707 (709 MB/s)      -188,561  -62.59%   x 2.67
+_gen_range_i16x8_low_reject       817,562 (97 MB/s)   168,693 (474 MB/s)      -648,869  -79.37%   x 4.85
+_gen_range_i32x4_high_reject      150,803 (530 MB/s)  75,148 (1064 MB/s)       -75,655  -50.17%   x 2.01
+_gen_range_i32x4_low_reject       475,460 (168 MB/s)  138,031 (579 MB/s)      -337,429  -70.97%   x 3.44
+_gen_range_i64x2_high_reject      267,958 (298 MB/s)  42,753 (1871 MB/s)      -225,205  -84.04%   x 6.27
+_gen_range_i64x2_low_reject       453,003 (176 MB/s)  103,087 (776 MB/s)      -349,916  -77.24%   x 4.39

@dhardy
Copy link
Member

dhardy commented Sep 11, 2021

Interesting results. But you didn't push your SIMD code?

Why does distr_uniform look so bad (I assume compared to the old method) for i16x8 and i32x4? A possible alternative would be to drop the test and always do the second step for x4 and x8 SIMD impls. Another, which might be acceptable (see note above) would be to never do a second step for integers ≤16 bits. It would be nice to see comparators between all four (the previous method, the new method with test for early exit, without test in 1-step and 2-step variants... but then considering multiple ranges and multiple RNGs then requires quite a lot more testing...).

Of course, the new method requires generating u64 or u128 for each lane, which is potentially a lot more random data than with the rejection method (although that could potentially have rather poor performance in some cases).

SIMD performance is surprising. I thought the lack of an easy SIMD 64->128 multiplication

It might end up being better using 32-bit steps for (some) SIMD impls since it reduces the amount of data up-front. The drawback is that we may need up to two extra steps to get bias below 1-in-2^64 samples.


If there are significant performance implications, we could consider variants with differing bias limits. 2^-32 should be enough for many things (though potentially with detectable bias in large studies) while 2^-16 is probably fine for most games. Using 2^-64 is a nice "one size fits all" approach, but probably does leave significant room for improvement in SIMD impls — should anyone care.

In fact, providing it is well documented, we could probably simply reduce the bias tolerance to 2^-32 for selected SIMD impls.

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Sep 11, 2021

I did push SIMD commits? https://github.com/TheIronBorn/rand/blob/9e91003d5cdb429bc2a53e4f8a33a34613534be5/src/distributions/uniform.rs#L725-L759

I've since also added benchmarks for those comparisons (branchless, branchy, mod_branchy) you mentioned but haven't run them bc it would take a while. I think I should just worry about the 128-bit (XMM) sized types for now.

i16x8 "constant" performance is so much slower because it requires generating and full-width multiplying a u64x8, at least 4x as much "work" as a 16x8 multiplication. gen_range is so much better because 8 modulos is very slow.

As you say two steps of u32x8 or even just one (if the bias is small enough) could significantly speed things up. I'll see about two-step 32xN benchmarks.

@TheIronBorn
Copy link
Contributor

If we had math for exactly how many bits are required we might be able to provide guarantees for 8-bit widths. I don't understand the proposal well enough for that though.

@dhardy
Copy link
Member

dhardy commented Sep 12, 2021

We never know in advance how many bits are required. The method works by adding bits until either it is known that further precision cannot change the result or the expectation of an incorrect result is sufficiently small (no more than one biased result in 2^64 samples).

The reason the algorithm works well for single results is that on a 64-bit CPU, throwing 64-bits at a problem expected to require far less is cheap. But throwing 64×8=512 bits at i16x8 is less cheap, however 32×8=256 bits is still very likely enough bits. Makes sense? Sorry, you probably already understood most of that.

We could even use 16-bit steps, but I'm less convinced, because we'd require up to four extra steps and testing is probably also a significant cost.

It may simply be better to keep the old method for some of these SIMD impls.

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Sep 23, 2021

I couldn't figure out how to do two-step method for the extra bits (without just doing a 64-bit mul using 32-bit operations that is)

But I put together a comparison of branchless/branchy/mod here:

 name                               canon_branchless ns/iter  canon_branchy_no_mod ns/iter  diff ns/iter   diff %  speedup
-_gen_range_i16x8_high_reject       370,063 (864 MB/s)        381,762 (838 MB/s)                  11,699    3.16%   x 0.97
+_gen_range_i16x8_low_reject        419,860 (762 MB/s)        330,372 (968 MB/s)                 -89,488  -21.31%   x 1.27
-_gen_range_i32x4_high_reject       230,870 (1386 MB/s)       237,903 (1345 MB/s)                  7,033    3.05%   x 0.97
+_gen_range_i32x4_low_reject        257,390 (1243 MB/s)       142,934 (2238 MB/s)               -114,456  -44.47%   x 1.80
-_gen_range_i64x2_high_reject       206,744 (1547 MB/s)       264,210 (1211 MB/s)                 57,466   27.80%   x 0.78
+_gen_range_i64x2_low_reject        224,032 (1428 MB/s)       146,566 (2183 MB/s)                -77,466  -34.58%   x 1.53
-_gen_range_i8x16_high_reject       665,987 (480 MB/s)        691,560 (462 MB/s)                  25,573    3.84%   x 0.96
+_gen_range_i8x16_low_reject        739,970 (432 MB/s)        623,880 (512 MB/s)                -116,090  -15.69%   x 1.19

 name                               canon_branchless ns/iter  canon_branchy_mod ns/iter  diff ns/iter   diff %  speedup
-_distr_uniform_i16x8_high_reject   363,250 (880 MB/s)        365,540 (875 MB/s)                2,290    0.63%   x 0.99
+_distr_uniform_i16x8_low_reject    353,415 (905 MB/s)        68,905 (4644 MB/s)             -284,510  -80.50%   x 5.13
-_distr_uniform_i32x4_high_reject   210,531 (1519 MB/s)       234,356 (1365 MB/s)              23,825   11.32%   x 0.90
+_distr_uniform_i32x4_low_reject    246,272 (1299 MB/s)       83,541 (3830 MB/s)             -162,731  -66.08%   x 2.95
-_distr_uniform_i64x2_high_reject   146,966 (2177 MB/s)       199,202 (1606 MB/s)              52,236   35.54%   x 0.74
+_distr_uniform_i64x2_low_reject    157,588 (2030 MB/s)       75,960 (4212 MB/s)              -81,628  -51.80%   x 2.07
+_distr_uniform_i8x16_high_reject   715,660 (447 MB/s)        669,060 (478 MB/s)              -46,600   -6.51%   x 1.07
+_distr_uniform_i8x16_low_reject    706,100 (453 MB/s)        134,419 (2380 MB/s)            -571,681  -80.96%   x 5.25

The modulo seems to make little difference though:

 name                               canon_branchy_no_mod ns/iter  canon_branchy_mod ns/iter  diff ns/iter   diff %  speedup
+_distr_uniform_i16x8_high_reject   367,512 (870 MB/s)            365,540 (875 MB/s)               -1,972   -0.54%   x 1.01
+_distr_uniform_i16x8_low_reject    76,440 (4186 MB/s)            68,905 (4644 MB/s)               -7,535   -9.86%   x 1.11
-_distr_uniform_i32x4_high_reject   224,735 (1423 MB/s)           234,356 (1365 MB/s)               9,621    4.28%   x 0.96
+_distr_uniform_i32x4_low_reject    83,820 (3817 MB/s)            83,541 (3830 MB/s)                 -279   -0.33%   x 1.00
-_distr_uniform_i64x2_high_reject   177,215 (1805 MB/s)           199,202 (1606 MB/s)              21,987   12.41%   x 0.89
+_distr_uniform_i64x2_low_reject    83,113 (3850 MB/s)            75,960 (4212 MB/s)               -7,153   -8.61%   x 1.09
+_distr_uniform_i8x16_high_reject   674,635 (474 MB/s)            669,060 (478 MB/s)               -5,575   -0.83%   x 1.01
+_distr_uniform_i8x16_low_reject    183,900 (1740 MB/s)           134,419 (2380 MB/s)             -49,481  -26.91%   x 1.37

I thought at first a series of 64->128 mulx would perform better for wmul (used in previous posts) but did some analysis and the LLVM version we already had was better. It improved things vs. the Oneill method:

 name                               oneill ns/iter        canon_branchy_no_mod ns/iter  diff ns/iter   diff %  speedup
+_distr_uniform_i16x8_high_reject   552,320 (579 MB/s)    367,512 (870 MB/s)                -184,808  -33.46%   x 1.50
+_distr_uniform_i16x8_low_reject    79,120 (4044 MB/s)    76,440 (4186 MB/s)                  -2,680   -3.39%   x 1.04
+_distr_uniform_i32x4_high_reject   582,245 (549 MB/s)    224,735 (1423 MB/s)               -357,510  -61.40%   x 2.59
-_distr_uniform_i32x4_low_reject    77,865 (4109 MB/s)    83,820 (3817 MB/s)                   5,955    7.65%   x 0.93
+_distr_uniform_i64x2_high_reject   720,860 (443 MB/s)    177,215 (1805 MB/s)               -543,645  -75.42%   x 4.07
+_distr_uniform_i64x2_low_reject    130,595 (2450 MB/s)   83,113 (3850 MB/s)                 -47,482  -36.36%   x 1.57
+_distr_uniform_i8x16_high_reject   851,180 (375 MB/s)    674,635 (474 MB/s)                -176,545  -20.74%   x 1.26
-_distr_uniform_i8x16_low_reject    111,113 (2879 MB/s)   183,900 (1740 MB/s)                 72,787   65.51%   x 0.60

(the slowdown in i8x16_low_reject may be codesize or the relatively greater prob of 64->128 mul)

Here's the full bench logs https://github.com/TheIronBorn/rand/blob/unif-benches/benches/unif_bench_logs
Also a look at the best for each lane-width: https://github.com/TheIronBorn/rand/blob/unif-benches/benches/bench_sorts, 128-bit is generally faster in scalar, 8/16 with high rejections and constant ranges, not sure what's up with gen_range_i64_low_reject

@dhardy
Copy link
Member

dhardy commented Sep 28, 2021

You’re referring to the modulo in constant? The author recommended that (here) though perhaps I’m not applying it correctly.

Aha, this is the stop condition for Lemire's method. Regardless, it's only a useful optimisation where the range is known at compile-time, which may be the case in your benchmarks but would require an API extension to expose it properly (e.g. we don't want to use it for shuffling), so it's probably only worth considering if there is a significant advantage (only really seen in i8x16 low reject in your latest benchmarks).

@dhardy
Copy link
Member

dhardy commented Sep 28, 2021

There are some issues with your benchmarks unfortunately:

  • the oneill_distr_uniform are not actually oneill since this uses the sample method which was not updated in switch sample_single int to O'neill's modulo method & update gen_rang… #1154
  • comparison between sample (pre-constructed distribution) vs sample_single_inclusive without noting this
  • some of the benches increment the high bound between each iteration, others don't (judging by your selected bounds, I guess you don't expect any of them to)

I'm still going over stuff and will push a draft PR when done.

dhardy added a commit that referenced this issue Oct 20, 2021
dhardy added a commit that referenced this issue Oct 20, 2021
dhardy added a commit that referenced this issue Feb 24, 2022
dhardy added a commit that referenced this issue Feb 24, 2022
dhardy added a commit that referenced this issue Feb 24, 2022
dhardy added a commit that referenced this issue Feb 24, 2022
dhardy added a commit that referenced this issue Feb 28, 2022
@dhardy dhardy closed this as completed Jul 20, 2024
@dhardy dhardy added the A-new Propose usage of a new algorithm label Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-new Propose usage of a new algorithm
Projects
None yet
Development

No branches or pull requests

3 participants