How smart should Nim sets & tables be?

So, I just put up a sort of a sketch of what I've been trying to say abstractly at https://github.com/c-blake/adix in response to a variety of changes @timotheecour has been making to the stdlib tables, e.g. https://github.com/nim-lang/Nim/pull/13440 and https://github.com/nim-lang/Nim/issues/13393 and  https://github.com/nim-lang/Nim/pull/13418

It is pretty drop-in compatible with the stdlib sets & tables and I tried to keep the internal names similar as well.  I just added a few generic procs for the `CountTable` functionality and sorting does not yet reconstitute the index.

I didn't commit all my test programs, but I have been using the primary linear probing with Robin Hood re-org for the past week or two as a replacement for the stdlib tables (via that `metab.nim` module, short for metaTab, that makes compile-time switching easy).  I'm not sure I'm happy with a lot of things about this code -- the `getKey`/`setKey`, tables written as wrappers around sets, various things in triplicate instead of template bodies, etc.

Nevertheless, it does show concretely how we could do both unordered and ordered tables with just linear probing with run-time per-table optional robin hood hashing and also run-time per-table hash()-rehashing with the latter two modes "automatically activated" on overdeep probes on sparse tables.  It also shows another instance of the current algorithms (I believe cleaner via the `probeSeq` iterator).  The weak hash mitigation is mostly checked for in a `proc tooFull` in the various `??set.nim` implementations.

From a mathematical property point of view, I believe the non-tombstone variants let users who know what they are doing suffer no performance penalty in time or space on delete heavy workloads while at the same time automatically adapting for users that are blissfully unaware of hash table performance pitfalls - up to a a point.  { Beyond the point we can adapt to, the user really needs to provide inequality element comparison since there just isn't enough diversity in the output of the `hash()` they're using (by definition of the "up to a point".  As far as I can tell that inequality requirement is ineliminable. }  I sort of feel like it's almost a theorem that this is a better approach for tables of unknown workloads (unless you want to force performance sensitive people to do their own table impls or use this library).  Even so, a broader conversation than just comments on pull requests seemed warranted.  Hence this RFC issue.

Beyond the above virtues, probe depth triggered resize (in any of the variants) has a bonus that if the user does have a great hash function then they get to enjoy near 100% space utilization.  For example, with my current hash-the-output of hash via this multiply-rotate gizmo, the running example of only high bits varying (like what started this whole arc of development) becomes almost a perfect hash table with Robin Hood and rehash both activated.

In terms of performance, usually Robin Hood wins big when you have a "miss heavy" workload - e.g., building a pair of sets and then doing an intersection of them where the result is mostly empty.  For a more "hit heavy" workload like a histogram where the counts average much greater than 1, vanilla linear probing is usually best.  I don't believe there is a single algorithmic solution that is fastest for all workloads, though.  Timing things simply in hot loops can mislead, especially leading one astray generalizing to not perfectly oiled cache prefetching situations.

Personally, I think all the recent churn in table stuff could be profitably just unwound and replaced with advice to use such a better integer hash function, possibly providing a warning from the run-time to guide them and then possibly a few alternative hashes in the standard lib under different names from just `hash` so users can go `proc hash(x: int)=hash2(x)` to activate them.  Then we could add hash3, hash4 (or give them better names).  The `althash` module in `adix` is sort of like that.

Anyway, there are also some NOTES.md and TODO.md and doc comments at the tops of modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How smart should Nim sets & tables be? #201

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How smart should Nim sets & tables be? #201

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions