-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add/Rework benchmarks to track initialization cost #272
Conversation
This PR adds more benchmarks so we can get and accurate idea about two things: - What is the cost of having to zero the buffer before calling `getrandom`? - What is the performance on aligned, 32-byte buffers? - This is by far the most common use, as its used to seed usersapce CSPRNGs. I ran the benchmarks on my system: - CPU: AMD Ryzen 7 5700G - OS: Linux 5.15.52-1-lts - Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07) I got the following results: ``` test bench_large ... bench: 3,759,323 ns/iter (+/- 177,100) = 557 MB/s test bench_large_init ... bench: 3,821,229 ns/iter (+/- 39,132) = 548 MB/s test bench_page ... bench: 7,281 ns/iter (+/- 59) = 562 MB/s test bench_page_init ... bench: 7,290 ns/iter (+/- 69) = 561 MB/s test bench_seed ... bench: 206 ns/iter (+/- 3) = 155 MB/s test bench_seed_init ... bench: 206 ns/iter (+/- 1) = 155 MB/s ``` These results were very consistent across multiple runs, and roughtly behave as we would expect: - The thoughput is highest with a buffer large enough to amoritize the syscall overhead, but small enough to stay in the L1D cache. - There is a _very_ small cost to zeroing the buffer beforehand. - This cost is imperceptible in the common 32-byte usecase, where the syscall overhead dominates. - The cost is slightly higher (1%) with multi-megabyte buffers as the data gets evicted from the L1 cache between the `memset` and the call to `getrandom`. I would love to see results for other platforms. Could we get someone to run this on an M1 Mac? Signed-off-by: Joe Richey <joerichey@google.com>
I also locally patched the crate to use the
again, these results were quite stable over multiple runs, showing a small improvement from not having to initialize the buffer. For this and the above x86_64 Linux benchmark, I used RUSTFLAGS="-C opt-level=3 -C codegen-units=1 -C embed-bitcode=yes -C lto=fat -C target-cpu=native" |
@newpavlov anything blocking merging in these benchmarks? If we merge them in, it will be easier for people to run them on different platforms. This will, in turn, make it easier to figure out if #226 and #271 are worth it. |
On another system:
Linux implementation (default):
RDRAND implementation (patched):
Again, the difference is detectable, but very, very small. |
On a
Linux implementation:
|
This PR adds more benchmarks so we can get and accurate idea about two
things:
getrandom
?usersapce CSPRNGs.
I ran the benchmarks on my system:
I got the following results:
These results were very consistent across multiple runs, and roughtly
behave as we would expect:
syscall overhead, but small enough to stay in the L1D cache.
syscall overhead dominates.
data gets evicted from the L1 cache between the
memset
and thecall to
getrandom
.I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?
Signed-off-by: Joe Richey joerichey@google.com