Description
I was going to directly PR the performance improvement, but since this is my first time writing a bunch of unsafe code, and because x.py
will not work for me (#61611), I decided to open an issue and point to this repository filled with all the benchmarks I wrote.
I tried over 20 different variations on algorithms before settling on this one.
The only thing I am concerned with is that I am directly using T
values in algorithm 1 instead of indirect copies and swaps, which the microbenchmarks did not like. I am pretty sure that it is panic safe, since there is no T: clone
anywhere.
I benchmarked on both Intel and AMD CPUs, and found that only on the x1031s149_new
benchmark did my algorithm perform significantly worse (probably due to the fact that 1031 and 149 are both primes), and the u8_new
benchmark on my AMD CPU. I do not know what is causing the u8_new
to be much worse on the new algorithm on my AMD CPU, and if the problem persists on newer AMD hardware. Every other benchmark indicates that my algorithm is as fast or faster than the old one (note that because there are so many benchmarks, there may be outliers which you should benchmark again).
Of course, when I do the PR I am not going to include all of those benchmarks, but I am not sure what subset to include.