-
Notifications
You must be signed in to change notification settings - Fork 22
Description
So, I just put up a sort of a sketch of what I've been trying to say abstractly at https://github.com/c-blake/adix in response to a variety of changes @timotheecour has been making to the stdlib tables, e.g. nim-lang/Nim#13440 and nim-lang/Nim#13393 and nim-lang/Nim#13418
It is pretty drop-in compatible with the stdlib sets & tables and I tried to keep the internal names similar as well. I just added a few generic procs for the CountTable functionality and sorting does not yet reconstitute the index.
I didn't commit all my test programs, but I have been using the primary linear probing with Robin Hood re-org for the past week or two as a replacement for the stdlib tables (via that metab.nim module, short for metaTab, that makes compile-time switching easy). I'm not sure I'm happy with a lot of things about this code -- the getKey/setKey, tables written as wrappers around sets, various things in triplicate instead of template bodies, etc.
Nevertheless, it does show concretely how we could do both unordered and ordered tables with just linear probing with run-time per-table optional robin hood hashing and also run-time per-table hash()-rehashing with the latter two modes "automatically activated" on overdeep probes on sparse tables. It also shows another instance of the current algorithms (I believe cleaner via the probeSeq iterator). The weak hash mitigation is mostly checked for in a proc tooFull in the various ??set.nim implementations.
From a mathematical property point of view, I believe the non-tombstone variants let users who know what they are doing suffer no performance penalty in time or space on delete heavy workloads while at the same time automatically adapting for users that are blissfully unaware of hash table performance pitfalls - up to a a point. { Beyond the point we can adapt to, the user really needs to provide inequality element comparison since there just isn't enough diversity in the output of the hash() they're using (by definition of the "up to a point". As far as I can tell that inequality requirement is ineliminable. } I sort of feel like it's almost a theorem that this is a better approach for tables of unknown workloads (unless you want to force performance sensitive people to do their own table impls or use this library). Even so, a broader conversation than just comments on pull requests seemed warranted. Hence this RFC issue.
Beyond the above virtues, probe depth triggered resize (in any of the variants) has a bonus that if the user does have a great hash function then they get to enjoy near 100% space utilization. For example, with my current hash-the-output of hash via this multiply-rotate gizmo, the running example of only high bits varying (like what started this whole arc of development) becomes almost a perfect hash table with Robin Hood and rehash both activated.
In terms of performance, usually Robin Hood wins big when you have a "miss heavy" workload - e.g., building a pair of sets and then doing an intersection of them where the result is mostly empty. For a more "hit heavy" workload like a histogram where the counts average much greater than 1, vanilla linear probing is usually best. I don't believe there is a single algorithmic solution that is fastest for all workloads, though. Timing things simply in hot loops can mislead, especially leading one astray generalizing to not perfectly oiled cache prefetching situations.
Personally, I think all the recent churn in table stuff could be profitably just unwound and replaced with advice to use such a better integer hash function, possibly providing a warning from the run-time to guide them and then possibly a few alternative hashes in the standard lib under different names from just hash so users can go proc hash(x: int)=hash2(x) to activate them. Then we could add hash3, hash4 (or give them better names). The althash module in adix is sort of like that.
Anyway, there are also some NOTES.md and TODO.md and doc comments at the tops of modules.