Description
For incremental compilation we are hashing lots of things, HIR, crate metadata and soon also type checking tables. These hashes are used purely as fingerprints, that is, we compare these hashes in order to find out if the things having been hashed are equal or not. Consequently the primary qualities we require from the hash algorithm are:
- That it acts as a pseudo-random-function, i.e. whatever we feed in, the resulting hashes should have a collision probability that is roughly the same as that of random numbers.
- That it produces wide enough hash values so the collision probabilities are low enough (128 bits should be fine).
- That it is fast.
It remains to be clarified whether it makes sense to use a cryptographic hash function, i.e. that it is hard to find/construct collisions. My guess is no:
- Incremental compilation caches are not meant to be distributed. They are platform dependent files created locally by your compiler and there is little use in distributing them.
- A 128 bit hash is probably susceptible to a brute force attack anyway.
- I cannot think of an attack that exploits a hash collision here. If one wants to tamper with an incremental compilation cache, one could just replace the object file in the cache without a need to ever touch item hashes.
I'd be very interested though if someone could come up with an attack scenario. But even then the solution would probably be to implement some general code signing scheme for these caches.
At the moment we are using BLAKE2 as our hashing algorithm which is a cryptographic hash function and thus fulfills requirements (1) and (2) very well. However, compared to non-cryptographic hash functions it is rather slow. For raw throughput SipHash would be about three times as fast and fast hash functions like Metrohash can be 20 times as fast. Before you get too excited though: Most of the time computing hashes is spent gathering the data that is fed into the hash function, not actually running the function on it. But there are still some gains to be had, here are some preliminary tests of using BLAKE2 -- which is our "worst case" -- and MetroHash -- which is among the fastest hash functions providing 128 bit hashes:
BLAKE2 | MetroHash | SipHash 1-3 | SipHash 2-4 | |
---|---|---|---|---|
libcore | 0.258s | 0.193s (-25%) | 0.189s (-26%) | 0.201s (-22%) |
librustc | 0.435s | 0.345s (-20%) | 0.320s (-26%) | 0.327s (-25%) |
syntex_syntax (HIR) | 0.221s | 0.166s (-25%) | 0.168s (-24%) | 0.176s (-20%) |
syntex_syntax (Metadata) | 0.456s | 0.326s (-28%) | 0.312s (-31%) | 0.347s (-24%) |
So it seems that a fast hash functions allows to save 20-30% spent during ICH computation. If we could get some confidence that:
- a function like MetroHash provides low enough collision probability and
- we really don't need a cryptographic hash function
then we could have some free speedups here.
UPDATE:
I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.