Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: switch to xxHash for digests #7392

Closed
wants to merge 1 commit into from

Conversation

rgrinberg
Copy link
Member

@rgrinberg rgrinberg commented Mar 23, 2023

xxHash is a far faster algorithm than md5.

Signed-off-by: Rudi Grinberg me@rgrinberg.com

@rgrinberg rgrinberg force-pushed the ps/rr/feature__switch_to_xxhash_for_digests branch 3 times, most recently from 0405e79 to 964fe82 Compare March 23, 2023 19:14
@jchavarri
Copy link
Collaborator

ref: #1282.

@Alizter
Copy link
Collaborator

Alizter commented Mar 23, 2023

Some preliminary bench results:

Main PR %diff
Clean 108.8421618938446 102.72804498672485 -5.61742%
Null 0.9267170429229736 0.8870739936828613
0.910944938659668 0.8958990573883057
0.9389081001281738 0.9134318828582764
0.9436089992523193 0.901669979095459
0.9183239936828613 0.9315218925476074
Avg. Null 0.9277006149292 0.9059193611145 -2.34788%

xxHash is a far faster algorithm with md5.

Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

<!-- ps-id: 4b7a67ae-5e18-49aa-b9ea-12bdf7ca649f -->
@rgrinberg rgrinberg force-pushed the ps/rr/feature__switch_to_xxhash_for_digests branch from 964fe82 to bb61fb7 Compare March 23, 2023 20:06
@rgrinberg
Copy link
Member Author

This should play nicely with #7372

(although some tweaks still remain)

@rgrinberg
Copy link
Member Author

@jchavarri let's see if this has an effect on your internal builds

@rgrinberg
Copy link
Member Author

@Alizter thanks for the benchmark. With this PR, it should be enough to catch up to make in the HoTT build I hope.

@rgrinberg rgrinberg requested a review from snowleopard March 24, 2023 02:41
Copy link
Collaborator

@snowleopard snowleopard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use weak hashing algorithms, so this should obviously not be used by default, but also I'm really not keen on adding a 7KLOC dependency for something that isn't used by default.

@snowleopard
Copy link
Collaborator

Some preliminary bench results:

What is this building?

@Alizter
Copy link
Collaborator

Alizter commented Mar 24, 2023

@snowleopard that is the bench you get from make bench not sure what it is building but I think it is just the dune executable.

Why is a weak hashing algorithm not usable?

@rgrinberg
Copy link
Member Author

We can't use weak hashing algorithms, so this should obviously not be used by default, but also I'm really not keen on adding a 7KLOC dependency for something that isn't used by default.

What's md5 if not a weak hashing algorithm? :)

Suggestions for other hashing algorithms are welcome of course. But it seems like we're making the wrong trade-off if we're defaulting to cryptographic hashing algorithm. I don't use any build artifacts that I don't produce myself, and I highly doubt any of our users do. Things might change if we have a distributed build cache, but it remains to be seen how adoption would look like. Or are there other situations where a strong hash would be useful?

@rgrinberg
Copy link
Member Author

7KLOC dependency

I could remove the variants of the hashing functions that we don't use, but honestly it sounds seem like it's worth the trouble. I doubt our 11 meg binary would shrink much.

@snowleopard
Copy link
Collaborator

snowleopard commented Mar 24, 2023

@snowleopard that is the bench you get from make bench not sure what it is building but I think it is just the dune executable.

I see. 5-6% difference sounds close to the noise you'd get from any change, e.g. due to the executable layout changing or just plain build non-determinism, so I'm not convinced by this figure. Did you run the benchmark just once? On your personal machine that was running a bunch of other applications? I'd like to see the results of the monorepo benchmark in a controlled environment.

Why is a weak hashing algorithm not usable?

When collisions are computationally easy, one will hit them whether on purpose or not, and hitting a hash collision in Dune means getting incorrect build results. All levels of Dune caching (workspace-local, shared and distributed caches) assume no hash collisions, which means we need to stick to cryptographically secure hashes (or redesign Dune caching to deal with hash collisions).

What's md5 if not a weak hashing algorithm? :)

Finding MD5 collisions is computationally pretty hard. Not impossible these days, though, so I'd welcome stronger hashes.

Suggestions for other hashing algorithms are welcome of course.

I suggest trying BLAKE3, which is available in Cryptokit.

But it seems like we're making the wrong trade-off if we're defaulting to cryptographic hashing algorithm

I'm not quite sure what you are suggesting. To tradeoff build correctness for 5% of performance gain? :)

@snowleopard
Copy link
Collaborator

Also, just as a general point: we should keep MD5 available while we are transitioning from Jenga to Dune internally. We can't just switch to a different hashing scheme: that would mean that we can't reuse the same build artifact cache between Jenga and Dune, thus doubling disk usage, which would be a severe problem for us. So, whatever new hashing algorithm we introduce, we should keep MD5 as a configuration option.

@emillon
Copy link
Collaborator

emillon commented Mar 24, 2023

To add to @snowleopard's point: I agree that xxhash is not suited to what we're doing with dune.
I think that the threat model that dune uses is not too different from git's use of a hashing function: the fact that there are collisions is a good sign that something should be fixed, but is not an immediate problem. Concretely, you can set up some files and rules that will confuse dune, but there's no way at the moment to use that weakness to inject a different file. You need a preimage attack for that, and even md5 is still relatively immune to that (the cost is > 2**100). This kind of attack is computationally easy with xxhash.
So, yes, md5 is broken and slow, which are two reasons to look for alternatives, but the replacement function needs some properties that xxhash do not have.
(in contrast, it could be a good replacement for Hashtbl.hash, but I'm not sure how important that function is in our profiles)

@Alizter
Copy link
Collaborator

Alizter commented Mar 24, 2023

So I am under the impression that there are two critical points to Dune hashing being slow:

  1. Marshalling
  2. Hashing Algorithm

Since Rudi chose an algorithm which is extremely fast, this confirms my suspicion that both together are causing the observable slowdown seen in builds such as HoTT.

In fact I ran some detailed benchmarks comparing this with main and make: HoTT/Coq-HoTT#1687 (comment)

I'll resummarise here:

Builder make dune xxH dune
Mean (s) 65.519 68.799 66.285
Standard deviation σ 0.325 0.292 0.222
Min 65.068 68.450 66.041
Max 66.031 69.175 66.728

@ejgallego
Copy link
Collaborator

ejgallego commented Mar 24, 2023

@Alizter what's the size of the .vo build for HoTT?

You can indeed just put all the .vo files in a folder and measure with a simple program how long it does take to hash them.

@Alizter
Copy link
Collaborator

Alizter commented Mar 24, 2023

@ejgallego The total size of all the vo files is 98M according to du.

Here is a hyperfine bench of the vo files with md5sum vs xxhsum (I put all vo files in a single dir):

[ali@allosaurus:~/HoTT]$ hyperfine -w 10 'find all_vo/|xargs md5sum |tee all_vo.md5'
Benchmark 1: find all_vo/|xargs md5sum |tee all_vo.md5
  Time (mean ± σ):     130.0 ms ±   1.7 ms    [User: 115.1 ms, System: 19.3 ms]
  Range (min … max):   126.6 ms … 133.0 ms    23 runs


[ali@allosaurus:~/HoTT]$ hyperfine -w 10 'find all_vo/|xargs xxhsum |tee all_vo.md5'
Benchmark 1: find all_vo/|xargs xxhsum |tee all_vo.md5
  Time (mean ± σ):       3.7 ms ±   0.2 ms    [User: 0.9 ms, System: 4.4 ms]
  Range (min … max):     3.1 ms …   4.4 ms    510 runs

I'm not expert but that looks fast.

@ejgallego
Copy link
Collaborator

Thanks for getting the numbers, 100ms is within expected (tho we should use the same exact hash Dune uses to be sure)

If so, that would hardly explain the HoTT slowdown.

@jchavarri
Copy link
Collaborator

I can see significant gains on our internal builds when using this branch. In particular, the Melange build gets 1.16x faster. An OCaml executable build I tested (for the api server) is 1.1x faster. 🎉

I'll keep an eye on https://ocaml.github.io/dune/dev/bench/ after it's merged.

@Alizter
Copy link
Collaborator

Alizter commented Mar 25, 2023

@ejgallego that's why I said it's not just the hashing algorithm contributing to the slowdown but also the marshaling.

@ejgallego
Copy link
Collaborator

@ejgallego that's why I said it's not just the hashing algorithm contributing to the slowdown but also the marshaling.

What data are we marshalling ?

@Alizter
Copy link
Collaborator

Alizter commented Mar 27, 2023

@ejgallego

(* We use [No_sharing] to avoid generating different digests for inputs that
differ only in how they share internal values. Without [No_sharing], if a
command line contains duplicate flags, such as multiple occurrences of the
flag [-I], then [Marshal.to_string] will produce different digests depending
on whether the corresponding strings ["-I"] point to the same memory location
or to different memory locations. *)
let generic a =
Metrics.Timer.record "generic_digest" ~f:(fun () ->
string (Marshal.to_string a [ No_sharing ]))

@ejgallego
Copy link
Collaborator

@ejgallego

(* We use [No_sharing] to avoid generating different digests for inputs that
differ only in how they share internal values. Without [No_sharing], if a
command line contains duplicate flags, such as multiple occurrences of the
flag [-I], then [Marshal.to_string] will produce different digests depending
on whether the corresponding strings ["-I"] point to the same memory location
or to different memory locations. *)
let generic a =
Metrics.Timer.record "generic_digest" ~f:(fun () ->
string (Marshal.to_string a [ No_sharing ]))

Ummm, that's the generic hash, sure, but that part should be rarely used in Coq rules, certainly not for .vo files I think.

Are we generically hashing some large in-memory structs? If so, which ones?

@rgrinberg
Copy link
Member Author

Every single dune action is marshalled (to be digested). Every single internal database is written using marshall.

@rgrinberg rgrinberg closed this Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants