feature: switch to xxHash for digests #7392

rgrinberg · 2023-03-23T18:08:38Z

xxHash is a far faster algorithm than md5.

Signed-off-by: Rudi Grinberg me@rgrinberg.com

jchavarri · 2023-03-23T19:19:38Z

Alizter · 2023-03-23T19:51:14Z

Some preliminary bench results:

	Main	PR	%diff
Clean	108.8421618938446	102.72804498672485	-5.61742%
Null	0.9267170429229736	0.8870739936828613
	0.910944938659668	0.8958990573883057
	0.9389081001281738	0.9134318828582764
	0.9436089992523193	0.901669979095459
	0.9183239936828613	0.9315218925476074
Avg. Null	0.9277006149292	0.9059193611145	-2.34788%

xxHash is a far faster algorithm with md5. Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

rgrinberg · 2023-03-23T20:08:32Z

This should play nicely with #7372

(although some tweaks still remain)

rgrinberg · 2023-03-23T20:26:31Z

@jchavarri let's see if this has an effect on your internal builds

rgrinberg · 2023-03-24T02:41:02Z

@Alizter thanks for the benchmark. With this PR, it should be enough to catch up to make in the HoTT build I hope.

snowleopard

We can't use weak hashing algorithms, so this should obviously not be used by default, but also I'm really not keen on adding a 7KLOC dependency for something that isn't used by default.

snowleopard · 2023-03-24T04:56:46Z

Some preliminary bench results:

What is this building?

Alizter · 2023-03-24T05:02:13Z

@snowleopard that is the bench you get from make bench not sure what it is building but I think it is just the dune executable.

Why is a weak hashing algorithm not usable?

rgrinberg · 2023-03-24T05:24:45Z

We can't use weak hashing algorithms, so this should obviously not be used by default, but also I'm really not keen on adding a 7KLOC dependency for something that isn't used by default.

What's md5 if not a weak hashing algorithm? :)

Suggestions for other hashing algorithms are welcome of course. But it seems like we're making the wrong trade-off if we're defaulting to cryptographic hashing algorithm. I don't use any build artifacts that I don't produce myself, and I highly doubt any of our users do. Things might change if we have a distributed build cache, but it remains to be seen how adoption would look like. Or are there other situations where a strong hash would be useful?

rgrinberg · 2023-03-24T05:46:04Z

7KLOC dependency

I could remove the variants of the hashing functions that we don't use, but honestly it sounds seem like it's worth the trouble. I doubt our 11 meg binary would shrink much.

snowleopard · 2023-03-24T09:21:19Z

@snowleopard that is the bench you get from make bench not sure what it is building but I think it is just the dune executable.

I see. 5-6% difference sounds close to the noise you'd get from any change, e.g. due to the executable layout changing or just plain build non-determinism, so I'm not convinced by this figure. Did you run the benchmark just once? On your personal machine that was running a bunch of other applications? I'd like to see the results of the monorepo benchmark in a controlled environment.

Why is a weak hashing algorithm not usable?

When collisions are computationally easy, one will hit them whether on purpose or not, and hitting a hash collision in Dune means getting incorrect build results. All levels of Dune caching (workspace-local, shared and distributed caches) assume no hash collisions, which means we need to stick to cryptographically secure hashes (or redesign Dune caching to deal with hash collisions).

What's md5 if not a weak hashing algorithm? :)

Finding MD5 collisions is computationally pretty hard. Not impossible these days, though, so I'd welcome stronger hashes.

Suggestions for other hashing algorithms are welcome of course.

I suggest trying BLAKE3, which is available in Cryptokit.

But it seems like we're making the wrong trade-off if we're defaulting to cryptographic hashing algorithm

I'm not quite sure what you are suggesting. To tradeoff build correctness for 5% of performance gain? :)

snowleopard · 2023-03-24T09:31:51Z

Also, just as a general point: we should keep MD5 available while we are transitioning from Jenga to Dune internally. We can't just switch to a different hashing scheme: that would mean that we can't reuse the same build artifact cache between Jenga and Dune, thus doubling disk usage, which would be a severe problem for us. So, whatever new hashing algorithm we introduce, we should keep MD5 as a configuration option.

emillon · 2023-03-24T09:59:18Z

To add to @snowleopard's point: I agree that xxhash is not suited to what we're doing with dune.
I think that the threat model that dune uses is not too different from git's use of a hashing function: the fact that there are collisions is a good sign that something should be fixed, but is not an immediate problem. Concretely, you can set up some files and rules that will confuse dune, but there's no way at the moment to use that weakness to inject a different file. You need a preimage attack for that, and even md5 is still relatively immune to that (the cost is > 2**100). This kind of attack is computationally easy with xxhash.
So, yes, md5 is broken and slow, which are two reasons to look for alternatives, but the replacement function needs some properties that xxhash do not have.
(in contrast, it could be a good replacement for Hashtbl.hash, but I'm not sure how important that function is in our profiles)

Alizter · 2023-03-24T15:10:01Z

So I am under the impression that there are two critical points to Dune hashing being slow:

Marshalling
Hashing Algorithm

Since Rudi chose an algorithm which is extremely fast, this confirms my suspicion that both together are causing the observable slowdown seen in builds such as HoTT.

In fact I ran some detailed benchmarks comparing this with main and make: HoTT/Coq-HoTT#1687 (comment)

I'll resummarise here:

Builder	`make`	`dune`	xxH `dune`
Mean (s)	65.519	68.799	66.285
Standard deviation σ	0.325	0.292	0.222
Min	65.068	68.450	66.041
Max	66.031	69.175	66.728

ejgallego · 2023-03-24T18:21:45Z

@Alizter what's the size of the .vo build for HoTT?

You can indeed just put all the .vo files in a folder and measure with a simple program how long it does take to hash them.

Alizter · 2023-03-24T18:50:21Z

@ejgallego The total size of all the vo files is 98M according to du.

Here is a hyperfine bench of the vo files with md5sum vs xxhsum (I put all vo files in a single dir):

[ali@allosaurus:~/HoTT]$ hyperfine -w 10 'find all_vo/|xargs md5sum |tee all_vo.md5'
Benchmark 1: find all_vo/|xargs md5sum |tee all_vo.md5
  Time (mean ± σ):     130.0 ms ±   1.7 ms    [User: 115.1 ms, System: 19.3 ms]
  Range (min … max):   126.6 ms … 133.0 ms    23 runs


[ali@allosaurus:~/HoTT]$ hyperfine -w 10 'find all_vo/|xargs xxhsum |tee all_vo.md5'
Benchmark 1: find all_vo/|xargs xxhsum |tee all_vo.md5
  Time (mean ± σ):       3.7 ms ±   0.2 ms    [User: 0.9 ms, System: 4.4 ms]
  Range (min … max):     3.1 ms …   4.4 ms    510 runs

I'm not expert but that looks fast.

ejgallego · 2023-03-24T18:52:42Z

Thanks for getting the numbers, 100ms is within expected (tho we should use the same exact hash Dune uses to be sure)

If so, that would hardly explain the HoTT slowdown.

jchavarri · 2023-03-25T11:35:51Z

I can see significant gains on our internal builds when using this branch. In particular, the Melange build gets 1.16x faster. An OCaml executable build I tested (for the api server) is 1.1x faster. 🎉

I'll keep an eye on https://ocaml.github.io/dune/dev/bench/ after it's merged.

Alizter · 2023-03-25T13:05:04Z

@ejgallego that's why I said it's not just the hashing algorithm contributing to the slowdown but also the marshaling.

ejgallego · 2023-03-27T14:50:32Z

@ejgallego that's why I said it's not just the hashing algorithm contributing to the slowdown but also the marshaling.

What data are we marshalling ?

Alizter · 2023-03-27T15:48:12Z

@ejgallego

dune/src/dune_digest/dune_digest.ml

Lines 68 to 76 in bb61fb7

    
           (* We use [No_sharing] to avoid generating different digests for inputs that 
        
              differ only in how they share internal values. Without [No_sharing], if a 
        
              command line contains duplicate flags, such as multiple occurrences of the 
        
              flag [-I], then [Marshal.to_string] will produce different digests depending 
        
              on whether the corresponding strings ["-I"] point to the same memory location 
        
              or to different memory locations. *) 
        
           let generic a = 
        
             Metrics.Timer.record "generic_digest" ~f:(fun () -> 
        
                 string (Marshal.to_string a [ No_sharing ]))

ejgallego · 2023-03-27T15:52:49Z

@ejgallego

dune/src/dune_digest/dune_digest.ml

Lines 68 to 76 in bb61fb7

(* We use [No_sharing] to avoid generating different digests for inputs that

differ only in how they share internal values. Without [No_sharing], if a

command line contains duplicate flags, such as multiple occurrences of the

flag [-I], then [Marshal.to_string] will produce different digests depending

on whether the corresponding strings ["-I"] point to the same memory location

or to different memory locations. *)

let generic a =

Metrics.Timer.record "generic_digest" ~f:(fun () ->

string (Marshal.to_string a [ No_sharing ]))

Ummm, that's the generic hash, sure, but that part should be rarely used in Coq rules, certainly not for .vo files I think.

Are we generically hashing some large in-memory structs? If so, which ones?

rgrinberg · 2023-03-27T16:10:14Z

Every single dune action is marshalled (to be digested). Every single internal database is written using marshall.

rgrinberg force-pushed the ps/rr/feature__switch_to_xxhash_for_digests branch 3 times, most recently from 0405e79 to 964fe82 Compare March 23, 2023 19:14

feature: switch to xxHash for digests

bb61fb7

xxHash is a far faster algorithm with md5. Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

rgrinberg force-pushed the ps/rr/feature__switch_to_xxhash_for_digests branch from 964fe82 to bb61fb7 Compare March 23, 2023 20:06

rgrinberg requested a review from dkalinichenko-js March 23, 2023 20:10

rgrinberg requested a review from snowleopard March 24, 2023 02:41

snowleopard requested changes Mar 24, 2023

View reviewed changes

Alizter mentioned this pull request Mar 24, 2023

Dune support HoTT/Coq-HoTT#1687

Merged

rgrinberg closed this Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: switch to xxHash for digests #7392

feature: switch to xxHash for digests #7392

rgrinberg commented Mar 23, 2023 •

edited

Loading

jchavarri commented Mar 23, 2023

Alizter commented Mar 23, 2023

rgrinberg commented Mar 23, 2023

rgrinberg commented Mar 23, 2023

rgrinberg commented Mar 24, 2023

snowleopard left a comment •

edited

Loading

snowleopard commented Mar 24, 2023

Alizter commented Mar 24, 2023

rgrinberg commented Mar 24, 2023

rgrinberg commented Mar 24, 2023

snowleopard commented Mar 24, 2023 •

edited

Loading

snowleopard commented Mar 24, 2023

emillon commented Mar 24, 2023 •

edited

Loading

Alizter commented Mar 24, 2023

ejgallego commented Mar 24, 2023 •

edited

Loading

Alizter commented Mar 24, 2023 •

edited

Loading

ejgallego commented Mar 24, 2023

jchavarri commented Mar 25, 2023

Alizter commented Mar 25, 2023

ejgallego commented Mar 27, 2023

Alizter commented Mar 27, 2023

ejgallego commented Mar 27, 2023

rgrinberg commented Mar 27, 2023

feature: switch to xxHash for digests #7392

feature: switch to xxHash for digests #7392

Conversation

rgrinberg commented Mar 23, 2023 • edited Loading

jchavarri commented Mar 23, 2023

Alizter commented Mar 23, 2023

rgrinberg commented Mar 23, 2023

rgrinberg commented Mar 23, 2023

rgrinberg commented Mar 24, 2023

snowleopard left a comment • edited Loading

Choose a reason for hiding this comment

snowleopard commented Mar 24, 2023

Alizter commented Mar 24, 2023

rgrinberg commented Mar 24, 2023

rgrinberg commented Mar 24, 2023

snowleopard commented Mar 24, 2023 • edited Loading

snowleopard commented Mar 24, 2023

emillon commented Mar 24, 2023 • edited Loading

Alizter commented Mar 24, 2023

ejgallego commented Mar 24, 2023 • edited Loading

Alizter commented Mar 24, 2023 • edited Loading

ejgallego commented Mar 24, 2023

jchavarri commented Mar 25, 2023

Alizter commented Mar 25, 2023

ejgallego commented Mar 27, 2023

Alizter commented Mar 27, 2023

ejgallego commented Mar 27, 2023

rgrinberg commented Mar 27, 2023

rgrinberg commented Mar 23, 2023 •

edited

Loading

snowleopard left a comment •

edited

Loading

snowleopard commented Mar 24, 2023 •

edited

Loading

emillon commented Mar 24, 2023 •

edited

Loading

ejgallego commented Mar 24, 2023 •

edited

Loading

Alizter commented Mar 24, 2023 •

edited

Loading