pseudorandom probing for hash collision #13418

timotheecour · 2020-02-16T09:47:30Z

(EDIT: see followup #13440 with a "best of both worlds" approach and additional benchmarks)

performance benchmark shows this can work, see below.

fixes Slow insertion of uint in Tables and Sets Collections depending on the value #13393
fixes analog problem for bitsets
fixes collisions for small floats (as explained in fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410, small floats would previously all have the same hash value, causing major slowdowns)
completely different approach from fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410 but some parts of that PR are still relevant and useful even in combination with this PR

this investigates pseudorandom probing for hash collisions, as done in python.
The main idea is to use this recursion in nextTry:

perturb = perturb shr PERTURB_SHIFT # with perturb initialized to hash(key) before recursion
h = ((5*h) + 1 + perturb) and maxHash

instead of linear probing:
h = (h + 1) and maxHash

consequences

at least similar (within 10%) performance in all cases I tested compared to before PR (see fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410 for test cases)
no catastrophic slowdowns (issue Slow insertion of uint in Tables and Sets Collections depending on the value #13393) which were causing 100x-1000x (and likely unbounded) slowdowns
hash key computation is now only required to produce good hash keys modulo 64 bit, not modulo n bits for n<64, since the whole hash key is used in probing sequence. That means hash key computation for primitive types can remain fast.
for theory on justification of ((5*h) + 1 + perturb) and maxHash, refer to python internal docs (cpython implementation of dictionaries) or other material which explains pretty well; the basic idea is that ((5*h) + 1 + perturb) and maxHash recursion is affected by all bits of hash(key) (so that different hashkeys will not follow exactly the same path and clustering is avoided), and if a free slot still isn't found by the time perturb = 0, then the recursion becomes h = ((5*h) + 1) and maxHash which is guaranteed to visit each slot exactly once. The visit occurs in pseudorandom order, and that order depends on the full 64 bit hash(key).
should be resilient to adversarial attacks (but can still be made more resilient with adding an (unguessable) fixed seed in the recursion, eg: h = ((5*h) + 1 + perturb + seed) and maxHash with seed initialized upoon application startup)

consequence for deletion

there is no way to adapt Algorithm R (Deletion with linear probing) from Knuth Volume 3 to our new recursion, since multiple unrelated paths can pass through a given index (unlike with linear probing h = (h + 1) and maxHash or even h = (h + c) and maxHash with c constant). So I instead keep track of deletions, and adapt the table implementation as follows:
each cell is in 3 states: empty (hash=0), deleted(h=-1), valid(h!=0, h!=1)
rehash when count(empty) < 1/3 numSlots

note:

this is a slight modification from python: I'm using translateBits which results in large (eg 30x) speedups in cases like #13393 at 0 extra cost; it makes the hash influence the path earlier on. python doesn't do translateBits and does suffer from #13393 (although not catastrophically (eg 1000x) thanks to PERTURB_SHIFT) as well as you can see with:

  n = 10_000_000
  t={}
  for i in range(n):
    key = i << 32 # 68 sec 
    # key = i # 3 sec
    t[key] = str(key)

but with this PR, this isn't the case.

TODO for future PR's

do same with:
strtabs
compiler/astalgo.nim
compiler/treetab.nim
system/cellsets
(I don't quite understand why we have so many Table variants; seems to me like at least some of them could be reusing existing ones)
investigate whether following refinement would improve speed even further:

  if iteration < thres: # eg thres = 4
    result = (h + 1) and maxHash
  else:
    perturb = perturb shr PERTURB_SHIFT
    result = ((5*h) + 1 + perturb) and maxHash
  iteration.inc

with the idea being to improve cache locality unless we're detecting we're in a dense collision cluster, in which case we're switching to pseudo-random probing to break out of the cluster. In some benchmarks at least this led to some speedups, but it'd require a bit of code refactoring to avoid introducing code duplication. So this is left for future work.

[EDIT] OrderedTable.del is O(n) (including before PR), because every del rebuilds whole table; instead, using tombstones as introduced in this PR would make it O(1)
[EDIT] support dec, and having zero counts, in CountTable using tombstones
SharedTable should auto-init, like other tables; it should have some iterators, and simply ask user to protect access via withLock instead of current API which is rather cumbersome and unlike other tables

krux02

Here is some superficial comments on your WIP status. Feel free to ignore it if you don't like it.

lib/pure/collections/hashcommon.nim

lib/pure/testutils.nim

Araq · 2020-02-19T09:41:53Z

Superb work, should also be backported, if possible. @narimiran

Clyybber

Awesome!

Clyybber · 2020-02-19T11:50:51Z

lib/pure/testutils.nim

+proc sortedPairs*[T](t: T): auto =
+  ## helps when writing tests involving tables in a way that's robust to
+  ## changes in hashing functions / table implementation.
+  toSeq(t.pairs).sorted
+
+template sortedItems*(t: untyped): untyped =
+  ## helps when writing tests involving tables in a way that's robust to
+  ## changes in hashing functions / table implementation.
+  sorted(toSeq(t))


IMO you should use toSeq(t.pairs).sorted and sorted(toSeq(t)) in the tests instead.
This is bloat IMO and saves like 5 keypresses.

sure, it's a bit "bloaty" but it simplifies tests (stdlib + other ones) and encourages using it instead of flaky alternatives like $t which has caused much churn in the past every time hashes/table/etc algorithms change; I've had to fix a number of tests (both in stdlib + in other nimble packages) to make those tests non-flaky; this simplifies this.

it's a bit more than 5 keypresses in practice: toSeq(t.pairs).sorted prevents using method call syntax/UFCS (because pairs is an iterator), but sortedPairs can be used with MCS; in practice tests often involve expressions to define t so having MCS makes the test simpler to read/write

(also, needs 1 import testutils instead of 2 imports algorithm, sequtils to achive this goal, but that argument is very minor; in practice import sequtils can cause further issues (see nim issues involving import sequtils except toSeq) so it'd in fact be imports algorithm, from sequtils import toSeq`; etc)

all those complications I've had in the past when fixing test led me to the simple realization that testutils is needed.

As I said, use toSeq(.... in the tests.
Even if it is only "a bit bloaty" it is too much to be a candidate for adding to the stdlib IMO.
All the other arguments are very minor, as you admit yourself.
Please remove this module.

lib/pure/collections/intsets.nim

lib/pure/collections/hashcommon.nim

c-blake · 2020-02-19T21:35:03Z

Looking over at your one set of benchmarks I could find which you used to justify "low performance impact" (https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129b.nim, right???) I see only tests with like a million 8 byte keys in hot loops. That's 100% in L3 cache which has approximately 10X better latency than dynamic RAM. In short, it doesn't really test anything interesting about this table structure change.

c-blake · 2020-02-19T21:53:36Z

Perhaps more constructively - Robin Hood hashing is honestly pretty simple..arguably simpler than the explanation for this new probe sequence. It just keeps collision chains sorted by their "distance from their home cell" (where they would hash to in an empty table). That's it!

You can shift cells around with moveMem. Since we save hash codes we can easily and quickly compute the home cell. As a bonus, for unsuccessful lookups (such as at insert new key time) you can stop, on average, halfway down the collision chain (once you pass the spot where a key "should have been" if it were present.

There are a lot of articles on this if you search for them. E.g., this http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/ . I've done it twice before once in C and once in Nim. If you need me to, I can do a PR, but it might have to wait until next week.

c-blake · 2020-02-19T21:57:38Z

Rust used a Robin Hood table for quite a few years (until it switched to the much more complex port of Google's "Swiss" Tables).

c-blake · 2020-02-19T22:15:44Z

..and those "Swiss" tables mostly only get speed through vectorization tricks and much fine-tuning which is an ongoing maintenance burden as CPUs evolve/diversify..and also IIRC do not have the minimum variance guarantees of Robin Hood or its consequent 90+% memory utilization. And RH has no extra hard to tune parameters. Same old, "when to resize". That's it.

RH is even friendly to being used as a disk-resident table, like B-Trees. These days fast computers have a budget of like 500 cycles multiplied by 4-way or 5-way superscalar or 2000-2500 dynamic CPU instructions per cache miss. At that scale, almost anything that is supposed to handle larger than L3 cache data should be done almost the same way as if you did it "on disk". Random probing decidedly does not have that trait.

c-blake · 2020-02-19T22:41:38Z

Oh, and incidentally - to support users who don't know how tables and hashes work from long collision chains we should just count the loops made in the comparisons. Experts should be able to opt out of with a special define (or maybe as part of danger)..Somehow, though.

Then you can write out some kind of run-time warning message (probably exactly once per Table instance or maybe use a global variable to make it once per program or once per thread using a Table). I'm not sure if there are other similar run-time "things are not going as planned but maybe we should not quit/raise exceptions" warnings in the stdlib or if so how they are reported. It seems unlikely excessive hash collisions are the only case of something like this, though we may currently punt on all of them.

Anyway, doing that if you're a naive user and hashes misbehave then you get told what to do - use a better hash function. Also, if you didn't take care against adversaries and you got attacked then this would at least write a message somewhere. E.g., if you had code not designed to be safe from attack that was then run in an unsafe environment.

timotheecour · 2020-02-20T02:44:55Z

@c-blake please see detailed benchmark numbers I'm reporting in hashing collision: improve performance using linear/pseudorandom probing mix #13440 with the mixed linear/pseudorandom approach I've been suggesting: worse case is 1.06 slowdown, best case is 1000X/unlimited; other simple cases of interest are > 2X faster (as compared to before pseudorandom probing for hash collision #13418)
the "better hash approach" was precisely what I had tried in previous PR fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410 but @Araq wanted a solution more robust against adversarial attacks; also note that it introduces a ~2X slowdown in several simple distributions (but at least avoids the catastrophic slowdowns); I compare against that in hashing collision: improve performance using linear/pseudorandom probing mix #13440
with further code improvements + refactorings, I expect it will eventually become easier to both experiment with and customize the hashing+collision resolution and have it work across all Table types (there are many); right now there is much code duplication making this hard.
we should be data driven, design good (representative for many workloads, easy to modify etc) benchmarks instead of using theoretical arguments or fixating on a single use case that would've gotten slower, there is no magic bullet for all tasks but there are still algorithms that have better overall tradeoffs. That being said, you are a making some good points and hopefully these turn into benchmarks + PR's for further improvements.

I see only tests with like a million 8 byte keys in hot loops. That's 100% in L3 cache which has approximately 10X better latency than dynamic RAM. In short, it doesn't really test anything interesting about this table structure change.

I've also run tests on 20M keys and 100M keys (with key, val, hcode are int64, ie cell size with padding = 24B); that is 2.4GB which definitely doesn't fit on my 16MB cache.

robin hood hashing

If you need me to, I can do a PR, but it might have to wait until next week.

@c-blake if you think you can improve on the numbers I posted in #13440, by all means please do :) ; a draft PR is all that's needed for a start (ie no need to fix everything, just enough to run the benchmark I posted in that PR and compare apples to apples #13440 against robinhood hashing or swiss tables or your own version of it)
Note that, whether with pseudorandom hashing or robin hood hash or other collision algorithms, the major difficulty is not really implementing the algorithm, but to adapt all the existing code to work with it; there is quite a bit of code duplication that needs to be refactored before we should make further improvements.

one concern: seems to me robin hood suffers from #13393 as well (eg try with keys of the form i shl 32); if you use a better hash with robin hood, you'll pay a price to compute the hash (see performance numbers i have); if you don't you'll end up with a few large clusters and large probe sizes.
However, IMO the main idea of robin hood can be combined with the mixed approach of #13440: for low depths we proceed with linear probe and swap along the search as done in robinhood; and for depth > thres we switch to pseudorandom probing.

c-blake · 2020-02-20T08:25:55Z

As mentioned, tombstones cause performance lumpiness grief for steady-state constant membership heavy deletion workloads like a cache.

Also, while I am generally in favor of data-drivenness and benchmarks, as mentioned hot loop benchmarks can be so misleading that theoretical arguments tell you more when the primary cost driver is DIMM latency. The CPU branch predictor detects the hot loop and works ahead by 1000 instructions to mask dynamic RAM latency it severely pollutes these numbers. Even if you benchmark array_access[random_element] where you use a PRNG for the element you get misled about what happens when you don't have a perfectly oiled loop machine. CPU performance counters for aggregate cache misses do not even tell the whole story because of pre-fetching logic. And when you try to use more complicated things as benchmarks then all that complication pollutes your numbers, too. There are model specific registers you can use to turn off pre-fetching but then the numbers are suspect for other reasons.

You don't have to fly completely blind. On my main box at home I have 65 ns latency and 32 GB/s bandwidth. So, a scan of up to 65*32=2080 or half a memory page can be hidden. That is very fast RAM, though. Most would be more like (80-120)*(20..24) or 1600..2880 bytes. These are all really large regions for hash cells < 32B. Regular linear probing with a good hash has a longest probe depth per table of about log-base2(table size). So, for lookup latency, what is by far the most sensible thing to do is just fall back on "well, you cannot do better than 1 fetch + a linear-in-physical-memory scan as long as the scan is short-ish". Your threshold thing kind of gets there, but tombstones create unnecessary aging rehashing trouble and temptation to use slow probe sequences whose slowness is not detected by naive benchmarks as happened here.

While you can fall down a rabbit hole of ever more complex probe sequences and there are many papers on this, you know what? That's all about memory systems from 40+ years ago. None will be as optimized by the CPU+memory system as a linear scan. Fortunately or unfortunately the entire CPU+memory system has been optimized to hide latency in benchmarks. Even having a nextTry isolated proc and tombstomes and "ease of experimentation" is probably asking for trouble like what happened here. But yeah, you do need your hash to do some kind of minimal spreading out

Appeal to "but Python did it this way" is also wrong for multiple reasons. Python was always slow and uses these things as part of its basic object model so much that it's all selected for in-L1-cache tiny tables than broad workloads. CPython also no longer even does it this way for tables, only sets. Also, some 1.5-2.0x performance hit at 75% full without the threshold and tombstones would not bother Python users anyway who would "outsource" performance sensitive stuff to some more carefully done C thing.

This depend-on-the-whole hash code sequence doesn't help with attacks - attackers can just collide in the code instead of the modulus. We all get that. It only helps with weak user hashes, but those could also collide the whole code. So, it's still a band-aid on totally awful hashes and @Araq is absolutely correct that the right solution for scaling guarantees is a B-tree. Some warning message like "far more than expected hash loops - use a better hash function or B-tree" might well direct someone to do exactly that.

Anyway, in short - A) your hot loops are measure something not of interest besides the hot loop itself, so your 1.06x number is not quite meaningless but not what you should base decisions/comparisons upon, B) the real thing of interest is hard enough to measure that people agreeing on "realistic workloads" will be nearly impossible, C) there exists a simple alternative dodging all this, namely robin hood used by performance freak/non outsourcing Rustaceans for like 5 years until they moved to something with CPU architecture-specific tuning with its obvious ongoing maintenance burden, D) counting loops in rawGet or whatever and warning users will 1) always be necessary, 2) is super easy both to have and to disable and to regulate to avoid zillions of messages, and 3) and can immediately guide them to a better solution - the only thing needing agreement is how the runtime should warn.

Araq · 2020-02-20T08:53:55Z

If you need me to, I can do a PR, but it might have to wait until next week.

Oh yeah please!

Araq · 2020-02-20T08:55:09Z

with further code improvements + refactorings, I expect it will eventually become easier to both experiment with and customize the hashing+collision resolution and have it work across all Table types (there are many); right now there is much code duplication making this hard.

The code duplication also helps. The compiler hashes Nim identifiers, not "random numbers" and so a specialized implementation can exploit this, a stdlib impl can't. The compiler also never deletes anything inside a table so you can leave out the tombstone logic.

timotheecour · 2020-02-20T11:01:42Z

@c-blake you can take a look at this WIP for robin hood hashing:
timotheecour@96a2d49

doesn't yet allow bootstrapping nim from it but shd be usable as a library at least for Table; can be used as follows:

nim c --lib:lib -r -d:danger $vitanim_D/testcases/tableutils/benchmain.nim

seems to work but performance seems unequivocally worse compared to #13440 using same benchmark as on that PR; of course that could very possibly be due to bugs in my implementation which I wrote quickly as a POC; feel free to improve / come up with a better one in your own PR + hopefully report better performance numbers. As I mentioned earlier, it does suffer from #13393 and better hashing would also incur runtime cost.

The compiler hashes Nim identifiers, not "random numbers" and so a specialized implementation can exploit this, a stdlib impl can't.

I would hope, with all the work done to improve tables/hashes, that stdlib implementation is now good enough for such use cases without any compromise on performance, and that at least some of (strtabs, compiler/astalgo, compiler/treetab, system/cellsets, pure/strtabs, sets) could simply reuse tables, or at least a common denominator that would allow fine tuning based on runtime/compile time properties, eg depthThres, mustRehashfillRatio, canDeleteCell. I don't think this would create a monster, quite the opposite.

Araq · 2020-02-20T15:21:19Z

I don't think this would create a monster, quite the opposite.

Maybe but I don't see the point, instead of optimizing N different implementations you "optimize" a single one for diverse use cases. But this makes the problem much harder to solve, optimization is specialization. For example cellsets hashes pointers, nothing else. The only reason so much effort has to be put into the stdlib's hash tables is that we cannot anticipate how it's used.

c-blake · 2020-02-20T16:14:02Z

I agree having >1 entry point makes sense and "optimization is specialization" is a very quotable saying. Some require no delete which is another wrinkle besides the ones @Araq mentioned. I'm not sure the right number is the exact current number, but each candidate for conversion should be considered very carefully. Adversarial inputs or bad user hashes are other dimensions the impls do not all share. So, at least 3 bits/8 possibilities. Maybe 2 bits/4 if you want to coalesce adversarial & bad. There are surely others.

There is this VLDB 2015 paper "A Seven Dimensional Analysis of Hashing Methods" with a crazy complicated flowchart at the end. Nothing is perfect, but Robin Hood has the most "arrows in" of the bunch. (I also bet it'd be easy to poke holes in their benchmarks, though.)

…nges (#13816) * Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem.

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * This is an alternate resolution to #13393 (which arguably could be resolved outside the stdlib). Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e., A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))` in the instantiation context of a `HashSet` or `Table` or B) more globally, compile with `nim c -d:hashWangYi1`. No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts. Some people might want to stampede ahead unbridled, but my view is that a good plan is to A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out), B) have them report problems (if any), C) if all seems good, make the stdlib more novice friendly by adding `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1` with some `when defined` rearranging so users can `-d:hashIdentity` if they want the old behavior back. This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all *at least* pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes. * Re-organize to work around `when nimvm` limitations; Add some tests; Add a changelog.md entry. * Add less than 64-bit CPU when fork. * Fix decl instead of call typo. * First attempt at fixing range error on 32-bit platforms; Still do the arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well. * A second try at making 32-bit mode CI work. * Use a more systematic identifier convention than Wang Yi's code. * Fix test that was wrong for as long as `toHashSet` used `rightSize` (a very long time, I think). `$a`/`$b` depend on iteration order which varies with table range reduced hash order which varies with range for some `hash()`. With 3 elements, 3!=6 is small and we've just gotten lucky with past experimental `hash()` changes. An alternate fix here would be to not stringify but use the HashSet operators, but it is not clear that doesn't alter the "spirit" of the test. * Fix another stringified test depending upon hash order. * Oops - revert the string-keyed test. * Fix another stringify test depending on hash order. * Add a better than always zero `defined(js)` branch. * It turns out to be easy to just work all in `BigInt` inside JS and thus guarantee the same low order bits of output hashes (for `isSafeInteger` input numbers). Since `hashWangYi1` output bits are equally random in all their bits, this means that tables will be safely scrambled for table sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the integer keys are all < 2**53 (also likely fine). (I'm unsure why the infidelity with C/C++ back ends cut off is 32, not 53 bits.) Since HashSet & Table only use the low order bits, a quick corollary of this is that `$` on most int-keyed sets/tables will be the same in all the various back ends which seems a nice-to-have trait. * These string hash tests fail for me locally. Maybe this is what causes the CI hang for testament pcat collections? * Oops. That failure was from me manually patching string hash in hashes. Revert. * Import more test improvements from #13410 * Fix bug where I swapped order when reverting the test. Ack. * Oh, just accept either order like more and more hash tests. * Iterate in the same order. * `return` inside `emit` made us skip `popFrame` causing weird troubles. * Oops - do Windows branch also. * `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic resolutions are "1 == identity", 1 for Nim Version 1, 1 for first/simplest/fastest in a series of possibilities. Should be very easy to remember.) * Re-organize `when nimvm` logic to be a strict `when`-`else`. * Merge other changes. * Lift constants to a common area. * Fall back to identity hash when `BigInt` is unavailable. * Increase timeout slightly (probably just real-time perturbation of CI system performance).

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * Add neglected API call `find` to heapqueue. * Add a changelog.md entry, `since` annotation and rename parameter to be `heap` like all the other procs for consistency. * Add missing import.

krux02 reviewed Feb 17, 2020

View reviewed changes

lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved

lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved

lib/pure/collections/hashcommon.nim Show resolved Hide resolved

lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved

nim-lang deleted a comment from krux02 Feb 17, 2020

timotheecour added 8 commits February 18, 2020 11:30

_

306b326

_

d6a63dc

_

2e594ff

_

119076a

fixup

5458f10

nim compiles

36abf0e

address comments

2e7b98d

_

bc86a80

timotheecour force-pushed the pr_pseudorand_probing branch from ddcaadb to bc86a80 Compare February 18, 2020 19:47

timotheecour added 9 commits February 18, 2020 11:49

fixup

86108c0

fix tests

e08bd32

fix allValues

ae4281f

fixup

86590c7

use translateBits optimization that further helps with nim-lang#13393

a77e646

fixup

7497762

_

5b63973

use sizeof(Hash) * 4 instead of fastLog2(dataLen(t))

d52d5e3

fixup

3f956aa

timotheecour changed the title ~~[WIP] pseudorandom probing for hash collision~~ pseudorandom probing for hash collision Feb 19, 2020

Araq reviewed Feb 19, 2020

View reviewed changes

lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved

Araq reviewed Feb 19, 2020

View reviewed changes

lib/pure/testutils.nim Outdated Show resolved Hide resolved

address comments

f45e1f5

Araq added the merge_when_passes_CI mergeable once green label Feb 19, 2020

Clyybber reviewed Feb 19, 2020

View reviewed changes

refactorings: remove a mustRehash overload + other small refactorings

8e0bf1c

narimiran merged commit 8c22518 into nim-lang:devel Feb 19, 2020

timotheecour deleted the pr_pseudorand_probing branch February 19, 2020 20:43

timotheecour mentioned this pull request Feb 20, 2020

hashing collision: improve performance using linear/pseudorandom probing mix #13440

Closed

This was referenced Feb 25, 2020

tables/sharedtables/intsets/etc: fix #13496, #13504, #13505; add lots of tests #13498

Merged

OrderedTable crashes on del when there are repeated keys #13500

Open

Slow insertion of uint in Tables and Sets Collections depending on the value #13393

Closed

c-blake mentioned this pull request Mar 16, 2020

How smart should Nim sets & tables be? nim-lang/RFCs#201

Closed

timotheecour mentioned this pull request Mar 18, 2020

fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad #13649

Closed

dom96 mentioned this pull request Mar 28, 2020

hashset regression #13794

Closed

c-blake mentioned this pull request Mar 31, 2020

Unwind just the "pseudorandom probing" part of recent sets,tables changes #13816

Merged

brentp mentioned this pull request Apr 4, 2020

HashSet[uint64] slow insertion depending on values #11764

Closed

c-blake mentioned this pull request Apr 9, 2020

Add hashWangYi1 #13823

Merged

timotheecour mentioned this pull request Jun 19, 2020

An associative containers in which both types can be used as key. nim-lang/RFCs#238

Open

timotheecour mentioned this pull request Feb 26, 2021

tables misc timotheecour/Nim#82

Open

timotheecour mentioned this pull request May 19, 2021

add lookuptables for writing efficient lookup tables; make symbolRank, symbolName O(1) #18044

Closed

mratsim mentioned this pull request Aug 27, 2021

Jordan/transaction pool study status-im/nimbus-eth1#799

Closed

timotheecour changed the title ~~[TODO] pseudorandom probing for hash collision~~ pseudorandom probing for hash collision Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pseudorandom probing for hash collision #13418

pseudorandom probing for hash collision #13418

timotheecour commented Feb 16, 2020 •

edited

Loading

krux02 left a comment

Araq commented Feb 19, 2020

Clyybber left a comment

Clyybber Feb 19, 2020

timotheecour Feb 19, 2020

Clyybber Feb 19, 2020

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

timotheecour commented Feb 20, 2020 •

edited

Loading

c-blake commented Feb 20, 2020

Araq commented Feb 20, 2020

Araq commented Feb 20, 2020 •

edited

Loading

timotheecour commented Feb 20, 2020 •

edited

Loading

Araq commented Feb 20, 2020

c-blake commented Feb 20, 2020

pseudorandom probing for hash collision #13418

pseudorandom probing for hash collision #13418

Conversation

timotheecour commented Feb 16, 2020 • edited Loading

consequences

consequence for deletion

note:

TODO for future PR's

krux02 left a comment

Choose a reason for hiding this comment

Araq commented Feb 19, 2020

Clyybber left a comment

Choose a reason for hiding this comment

Clyybber Feb 19, 2020

Choose a reason for hiding this comment

timotheecour Feb 19, 2020

Choose a reason for hiding this comment

Clyybber Feb 19, 2020

Choose a reason for hiding this comment

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

c-blake commented Feb 19, 2020

timotheecour commented Feb 20, 2020 • edited Loading

robin hood hashing

c-blake commented Feb 20, 2020

Araq commented Feb 20, 2020

Araq commented Feb 20, 2020 • edited Loading

timotheecour commented Feb 20, 2020 • edited Loading

Araq commented Feb 20, 2020

c-blake commented Feb 20, 2020

timotheecour commented Feb 16, 2020 •

edited

Loading

timotheecour commented Feb 20, 2020 •

edited

Loading

Araq commented Feb 20, 2020 •

edited

Loading

timotheecour commented Feb 20, 2020 •

edited

Loading