-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: switch to xxHash for digests #7392
Conversation
0405e79
to
964fe82
Compare
ref: #1282. |
Some preliminary bench results:
|
xxHash is a far faster algorithm with md5. Signed-off-by: Rudi Grinberg <me@rgrinberg.com> <!-- ps-id: 4b7a67ae-5e18-49aa-b9ea-12bdf7ca649f -->
964fe82
to
bb61fb7
Compare
This should play nicely with #7372 (although some tweaks still remain) |
@jchavarri let's see if this has an effect on your internal builds |
@Alizter thanks for the benchmark. With this PR, it should be enough to catch up to make in the HoTT build I hope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use weak hashing algorithms, so this should obviously not be used by default, but also I'm really not keen on adding a 7KLOC dependency for something that isn't used by default.
What is this building? |
@snowleopard that is the bench you get from Why is a weak hashing algorithm not usable? |
What's md5 if not a weak hashing algorithm? :) Suggestions for other hashing algorithms are welcome of course. But it seems like we're making the wrong trade-off if we're defaulting to cryptographic hashing algorithm. I don't use any build artifacts that I don't produce myself, and I highly doubt any of our users do. Things might change if we have a distributed build cache, but it remains to be seen how adoption would look like. Or are there other situations where a strong hash would be useful? |
I could remove the variants of the hashing functions that we don't use, but honestly it sounds seem like it's worth the trouble. I doubt our 11 meg binary would shrink much. |
I see. 5-6% difference sounds close to the noise you'd get from any change, e.g. due to the executable layout changing or just plain build non-determinism, so I'm not convinced by this figure. Did you run the benchmark just once? On your personal machine that was running a bunch of other applications? I'd like to see the results of the monorepo benchmark in a controlled environment.
When collisions are computationally easy, one will hit them whether on purpose or not, and hitting a hash collision in Dune means getting incorrect build results. All levels of Dune caching (workspace-local, shared and distributed caches) assume no hash collisions, which means we need to stick to cryptographically secure hashes (or redesign Dune caching to deal with hash collisions).
Finding MD5 collisions is computationally pretty hard. Not impossible these days, though, so I'd welcome stronger hashes.
I suggest trying BLAKE3, which is available in Cryptokit.
I'm not quite sure what you are suggesting. To tradeoff build correctness for 5% of performance gain? :) |
Also, just as a general point: we should keep MD5 available while we are transitioning from Jenga to Dune internally. We can't just switch to a different hashing scheme: that would mean that we can't reuse the same build artifact cache between Jenga and Dune, thus doubling disk usage, which would be a severe problem for us. So, whatever new hashing algorithm we introduce, we should keep MD5 as a configuration option. |
To add to @snowleopard's point: I agree that xxhash is not suited to what we're doing with dune. |
So I am under the impression that there are two critical points to Dune hashing being slow:
Since Rudi chose an algorithm which is extremely fast, this confirms my suspicion that both together are causing the observable slowdown seen in builds such as HoTT. In fact I ran some detailed benchmarks comparing this with main and make: HoTT/Coq-HoTT#1687 (comment) I'll resummarise here:
|
@Alizter what's the size of the .vo build for HoTT? You can indeed just put all the .vo files in a folder and measure with a simple program how long it does take to hash them. |
@ejgallego The total size of all the vo files is Here is a hyperfine bench of the vo files with md5sum vs xxhsum (I put all vo files in a single dir): [ali@allosaurus:~/HoTT]$ hyperfine -w 10 'find all_vo/|xargs md5sum |tee all_vo.md5'
Benchmark 1: find all_vo/|xargs md5sum |tee all_vo.md5
Time (mean ± σ): 130.0 ms ± 1.7 ms [User: 115.1 ms, System: 19.3 ms]
Range (min … max): 126.6 ms … 133.0 ms 23 runs
[ali@allosaurus:~/HoTT]$ hyperfine -w 10 'find all_vo/|xargs xxhsum |tee all_vo.md5'
Benchmark 1: find all_vo/|xargs xxhsum |tee all_vo.md5
Time (mean ± σ): 3.7 ms ± 0.2 ms [User: 0.9 ms, System: 4.4 ms]
Range (min … max): 3.1 ms … 4.4 ms 510 runs I'm not expert but that looks fast. |
Thanks for getting the numbers, 100ms is within expected (tho we should use the same exact hash Dune uses to be sure) If so, that would hardly explain the HoTT slowdown. |
I can see significant gains on our internal builds when using this branch. In particular, the Melange build gets 1.16x faster. An OCaml executable build I tested (for the api server) is 1.1x faster. 🎉 I'll keep an eye on https://ocaml.github.io/dune/dev/bench/ after it's merged. |
@ejgallego that's why I said it's not just the hashing algorithm contributing to the slowdown but also the marshaling. |
What data are we marshalling ? |
dune/src/dune_digest/dune_digest.ml Lines 68 to 76 in bb61fb7
|
Ummm, that's the generic hash, sure, but that part should be rarely used in Coq rules, certainly not for .vo files I think. Are we generically hashing some large in-memory structs? If so, which ones? |
Every single dune action is marshalled (to be digested). Every single internal database is written using marshall. |
xxHash is a far faster algorithm than md5.
Signed-off-by: Rudi Grinberg me@rgrinberg.com