-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Early exit for empty HashMap (issue #38880) #48035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kennytm
merged 8 commits into
rust-lang:master
from
technicalguy:Early-exit-empty-hashmap-38880
Feb 15, 2018
Merged
Changes from 3 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
9bc5986
38880 don't compute hash when searching an empty HashMap
technicalguy dcdd2c4
38880 use search_mut function rather than search_hashed
technicalguy 29f7148
38880 remove redundant extra function
technicalguy fd78621
38880 fixup add missing mut
technicalguy a295ec1
38880 restore original entry(key) method
technicalguy 94c3c84
38880 hashmap check size=0, not just capacity=0
technicalguy f3330ce
38880 fix incorrect negation
technicalguy e034ddd
38880 remove unnecessary self.table.size check
technicalguy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be checking for size() for the optimization to apply for any empty map? My understanding is that checking for 0 capacity only works for not yet allocated maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the explanation for that: #38880 (comment)
As it is written, it's a rearrangement of code that should result in no extra branches.
If I understand correctly,
self.table.capacity() != 0
is a precondition. We can simplify this by using.size()
instead of it (with a comment), and still have equivalent code — but not in the path that we use to find a VacantEntry in an empty map. It would be good to use the.size() != 0
check where possible. Here's it's also important to keep the cross-method logical dependencies clear.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I wanted to do, but as @bluss points out, the problem is that when the map is empty but allocated (as opposed to empty and unallocated) and it doesn't find a key, it is supposed to return
InternalEntry::Vacant
. The crucial problem here is thatInternalEntry::Vacant
requires the hash. This means that you can't get away from computing the hash when theHashMap
is allocated, because otherwise you can't return anInternalEntry::Vacant
.This is also a problem for linear search as suggested in #38880. (I did not fully realise this problem until after writing this pull request.) Since an
InternalEntry::Vacant
requires the hash value, a linear search cannot avoid computing the hash value, so it is not possible to achieve a speedup by avoiding the computation of a hash value. I think this only way to fix this is to change the semantics ofInternalEntry::Vacant
. (Which would require a not-insignificant reworking, but would allow for thesize() == 0
that @arthurprs is considering.)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B. I did actually try the
size() == 0
akais_empty()
check first, before I realised about theInternalEntry::Vacant
semantics: caaabeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. HashMaps are hard 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the linear search, for that to yield a net win the backing table would need to support real fast iteration (packed KVs), otherwise iterating the buckets gets more expensive than the hashing. Potentially of interest for https://github.com/bluss/ordermap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some benchmarks for a linear search, and I think it might still be advantageous for some cases even if the map is not densely packed. Here's my reasoning – tell me what you think:
Let m be the allocated size of the hashmap and let n be the number of items it contains. Let S be the cost of accessing the next bucket in the hashmap (regardless of if said bucket is empty), let E be the cost of calculating equality for two elements of a given type T, and let H be the cost of calculating the hash for an element of a given type T. Assume that E and H are constant across all elements of type T (this is not true for all types, e.g. strings).
If mS + nE < H then it is faster to perform a linear search lookup for some key than to compute the hash and perform a direct lookup. The average case of a linear search for a key that exists in the map is half of the worst case.
Now if you have a simple type (e.g. integers) then equality is very cheap and constant time, and computing the hash is (surprisingly) expensive (perhaps because of the DoS resilience?), so a linear search is faster for integer-keyed hashmaps with size <= ~ 20 items.
It gets more complicated for other types, such as strings, because the equality operation doesn't take constant time and/or the hash operation doesn't take constant time. (Although for strings I think it is still the case that the hash operation always takes longer than the equality operation.)
Of course all of this is moot if
InternalEntry::Vacant
requires returning the hash value.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, it depends on a lot of variables, specially an expensive hasher.
You can try something like this, but I'm afraid calculating VacantEntryState on misses is too expensive (edit: or not, as you don't need to compare keys anymore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then it would be faster when searching for keys that exist, but slower when searching for keys that don't exist, right? I don't think that optimisation is worth it if it slows down the case when a key is not found.