Skip to content

Decrease HashMap Load Factor #38003

Closed
Closed
@pssalmeida

Description

@pssalmeida

Currently HashMap uses a load factor of approximately 0.9. While this is appropriate for reads (get), as Robin Hood Hashing gives a low distance from the starting point with very low, effectively fixed variance, such is not the case for writes (insert).

When writing over an occupied bucked, it will incur a displacement involving reads and writes over several buckets, either for the new value or the values already in the map. Not only this number is larger than for reads, but also has an increasing variance, quickly becoming very large as load approaches 1.

For a load factor a: (UPDATED, previous lookup values were for random probing)

  • the average number of probes in a read is 1/a*ln(1/1-a)) (1+1/(1-a))/2
  • the average number of written buckets in a write is 1/(1-a)
  • the probability of writing no more than k buckets in a write is 1 - a^k

In a table, the average buckets written, the 95th percentile buckets written, and the average buckets probed in a get, for different load factors:

a Avg W 95% W Avg R
0.500 2 4.3 1.39 1.5
0.666 3 7.4 1.65 2.0
0.750 4 10.4 1.84 2.5
0.800 5 13.4 2.01 3.0
0.833 6 16.4 2.15 3.5
0.857 7 19.4 2.27 4.0
0.875 8 22.4 2.38 4.5
0.888 9 25.4 2.47 5.0
0.900 10 28.4 2.56 5.5
0.950 20 58.4 3.15 10.5
0.990 100 298.1 4.65 50.5

It can be seen that for a = 0.9, the average probes when reading is just 2.56 5.5, but the average number of buckets written when inserting is 10 and the 95th percentile is 28 buckets. For deletes it will not be much better when doing the backwards shift, as most keys will be displaced from their origin (the very low variance behaviour or RH).

While for many cases, the average over the map lifetime may not be much impacted, if it happens that a map is being operated near capacity, with elements constantly being inserted and others removed, e.g., when implementing an LRU cache, the number of buckets (re)written per insert (and also per delete) will be manifestly higher than desired.

While the exact load factor choice can be argued about, it seems that 0.9 is manifestly too high. Going from 0.9 to 0.8 would have a 12.5% memory overhead, while going from 0.8 to 0.9 doubles the average number of buckets written when inserting, and more than doubles the 95th percentile. Therefore, a load factor around 0.8 seems more reasonable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-feature-requestCategory: A feature request, i.e: not implemented / a PR.I-needs-decisionIssue: In need of a decision.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions