Description
Currently HashMap uses a load factor of approximately 0.9. While this is appropriate for reads (get), as Robin Hood Hashing gives a low distance from the starting point with very low, effectively fixed variance, such is not the case for writes (insert).
When writing over an occupied bucked, it will incur a displacement involving reads and writes over several buckets, either for the new value or the values already in the map. Not only this number is larger than for reads, but also has an increasing variance, quickly becoming very large as load approaches 1.
For a load factor a: (UPDATED, previous lookup values were for random probing)
- the average number of probes in a read is
1/a*ln(1/1-a))(1+1/(1-a))/2 - the average number of written buckets in a write is 1/(1-a)
- the probability of writing no more than k buckets in a write is 1 - a^k
In a table, the average buckets written, the 95th percentile buckets written, and the average buckets probed in a get, for different load factors:
a | Avg W | 95% W | Avg R |
---|---|---|---|
0.500 | 2 | 4.3 | |
0.666 | 3 | 7.4 | |
0.750 | 4 | 10.4 | |
0.800 | 5 | 13.4 | |
0.833 | 6 | 16.4 | |
0.857 | 7 | 19.4 | |
0.875 | 8 | 22.4 | |
0.888 | 9 | 25.4 | |
0.900 | 10 | 28.4 | |
0.950 | 20 | 58.4 | |
0.990 | 100 | 298.1 |
It can be seen that for a = 0.9, the average probes when reading is just 2.56 5.5, but the average number of buckets written when inserting is 10 and the 95th percentile is 28 buckets. For deletes it will not be much better when doing the backwards shift, as most keys will be displaced from their origin (the very low variance behaviour or RH).
While for many cases, the average over the map lifetime may not be much impacted, if it happens that a map is being operated near capacity, with elements constantly being inserted and others removed, e.g., when implementing an LRU cache, the number of buckets (re)written per insert (and also per delete) will be manifestly higher than desired.
While the exact load factor choice can be argued about, it seems that 0.9 is manifestly too high. Going from 0.9 to 0.8 would have a 12.5% memory overhead, while going from 0.8 to 0.9 doubles the average number of buckets written when inserting, and more than doubles the 95th percentile. Therefore, a load factor around 0.8 seems more reasonable.