Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashSet[uint64] slow insertion depending on values #11764

Closed
cwpearson opened this issue Jul 17, 2019 · 58 comments
Closed

HashSet[uint64] slow insertion depending on values #11764

cwpearson opened this issue Jul 17, 2019 · 58 comments

Comments

@cwpearson
Copy link

The rate of inserting uint64s into a hash set varies wildly with the order and the range of integers inserted

Example

# slow_set.nim
import sets
import times

when isMainModule:
    var hs1: HashSet[uint64]
    var hs2: HashSet[uint64]
    var hs3: HashSet[uint64]

    # insert 0..200k
    var time = cpuTime()
    for i in 0..200_000:
        let k1 = uint64(i)
        hs1.incl(k1)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 100k..200k
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 100_000)
        hs2.incl(k1)
        hs2.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs3.incl(k1)
        hs3.incl(k2)
    echo "time ", (cpuTime() - time)

Current Output

Compiled with nim c -d:release -r slow_set.nim.

time 0.016778
time 1.01831
time 14.043752

Expected Output

I would expect the three different insertion loops to take roughly the same amount of time.
They are all inserting the same amount of unique values.
In the second loop, I just interleave the insertion order, and in the third loop, I insert some larger numbers.

Possible Solution

Additional Information

This issues #10097 talks about integer hashing and collisions, but all of my uint64s are well below the max uint32.

This also happens with hashTable keys, presumably for similar reasons?

In my actual project I am deduplicating an edge list for a graph where the source and destination vertex for a node are sort of similar to loop 3, so the performance is not good.

$ nim -v
Nim Compiler Version 0.20.2 [MacOSX: amd64]
Compiled at 2019-07-17
Copyright (c) 2006-2019 by Andreas Rumpf

git hash: 88a0edba4b1a3d535b54336fd589746add54e937
active boot switches: -d:release
@krux02
Copy link
Contributor

krux02 commented Jul 17, 2019

In general I have to say that you can always construct a sequence of keys that works very bad for the hash set implementation no matter what the hashing function does. But this does indeed look strange, as these keys don't look like as if they would cause particularly many collisions. Did you measure what part exactly causes this slowdown?

EDIT: I did the benchmark, this loop here is where all the time is wasted:

while isFilled(t.data[h].hcode):

@cwpearson
Copy link
Author

I added 3 more experiments, with resizing the hashSet first:

import sets
import times

when isMainModule:
    var hs1: HashSet[uint64]
    var hs2: HashSet[uint64]
    var hs3: HashSet[uint64]
    var hs4 = initHashSet[uint64](rightSize(100_000))
    var hs5 = initHashSet[uint64](rightSize(200_000))
    var hs6 = initHashSet[uint64](rightSize(1_100_000))

    # insert 0..200k
    var time = cpuTime()
    for i in 0..200_000:
        let k1 = uint64(i)
        hs1.incl(k1)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 100k..200k
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 100_000)
        hs2.incl(k1)
        hs2.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs3.incl(k1)
        hs3.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    # but insert into a hashSet with space for100k
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs4.incl(k1)
        hs4.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    # but insert into a hashSet with space for 200K
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs5.incl(k1)
        hs5.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    # but insert into a hashSet with space for 1.1M
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs6.incl(k1)
        hs6.incl(k2)
    echo "time ", (cpuTime() - time)

By resizing the hashSet large enough to cover the largest expected integer, we recover all the performance (loop 6) and then some.
Making it big enough to just probably hold all of the integers recovers some of the performance (loop 5).
Making it large but still too small doesn't have much impact (loop 4).

time 0.017222
time 1.267755
time 16.218246
time 15.879905            # initHashSet(rightSize(100k))
time 9.667273999999999.   # initHashSet(rightSize(200k))
time 0.003250000000001307 # initHashSet(rightSize(1.1M))

It's been a while since I took my CS algorithms course, but is it possible that the logic which controls when to resize the HashSet as it grows has some problems?

@cwpearson
Copy link
Author

@cwpearson
Copy link
Author

Hi, now that 1.0.0 has come out I re-ran this https://github.com/cwpearson/nim-issue-11764:

Nim Compiler Version 1.0.0 [Linux: amd64]
Compiled at 2019-09-23
Copyright (c) 2006-2019 by Andreas Rumpf

git hash: f7a8fc46c0012033917582eb740dc0343c093e35
active boot switches: -d:release
(1) time 0.015417049
(2) time 0.952415456
(3) time 12.335662364
(4) time 12.212460556
(5) time 7.530930397999999
(6) time 0.003257626999996432
  • (1): insert 0.200k into set in order
  • (2): insert 0..100k and 100k..200k interleaved
  • (3): insert 0..100k and 1M..1.1M interleaved
  • (4): (3), but hashset is sized for 100k insertions
  • (5): (3), but hashset is sized for 200k insertions
  • (6): (3), but hashset is sized for 1.1M insertions

I'd expect 1-3 to be roughly the same performance, and 4-6 to be somewhat better than 1-3 due to the pre-allocation. In practice, that is still not what we see.

@nim-lang nim-lang deleted a comment from krux02 Oct 9, 2019
@Araq
Copy link
Member

Araq commented Oct 9, 2019

Looks like you crafted the numbers to produce collisions, that is always a danger when using hashing.

@cwpearson
Copy link
Author

cwpearson commented Oct 9, 2019

Thanks!

It looks like the hash for int is just that int value:

Nim/lib/pure/hashes.nim

Lines 115 to 117 in e9f7455

proc hash*(x: int): Hash {.inline.} =
## Efficient hashing of integers.
result = x
.

It seems like the hashset implementation tries the next consecutive bucket when there is a collision:

proc nextTry(h, maxHash: Hash): Hash {.inline.} =
result = (h + 1) and maxHash

The general wisdom for open addressing sets seems to be that the hash function needs to avoid placing consecutive values in consecutive buckets (wikipedia). "Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent."

  1. I agree that collisions are always possible with a hash function, but this choice of hash function seems like it has collisions in a common scenario, where other choices of hash functions would only have collisions in a pathological case. The "efficient" hash implementation for ints is actually causing a big performance issue.
  2. Even if you don't agree with 1), I think the concern about clustering summarized in Wikipedia may argue for a redesign of the hash/hashset implementation in the context of integers. When folks are hashing they probably want to avoid collisions, so it might be worth considering using the MurmurHash that is used for other data on ints as well, even if the cost of each hash is higher.

As an aside, I gave IntSet from the Nim lib a try and the performance is much better (case (8) in https://github.com/cpwearson/nim-issue-11764).

@Araq
Copy link
Member

Araq commented Oct 9, 2019

But my point is that

   let k1 = uint64(i)
   let k2 = uint64(i + 1_000_000)

is not common. Usually IDs are mostly consecutive. May I ask where your integers come from?

narimiran added a commit to narimiran/Nim that referenced this issue Oct 10, 2019
narimiran added a commit to narimiran/Nim that referenced this issue Oct 14, 2019
narimiran added a commit to narimiran/Nim that referenced this issue Oct 14, 2019
@Araq Araq closed this as completed in 734da9e Oct 15, 2019
narimiran added a commit that referenced this issue Oct 23, 2019
@rockcavera
Copy link
Contributor

I noticed the same problem with the sets package and tables. In order not to open a new issue, I decided to comment here.

For testing I used the developer version 2019-11-14 of Nim. I compiled with gcc, clang and vcc.

I created a repository with demo code and problem reporting. https://github.com/rockcavera/nim-problem

I believe the issue should be reopened.

@Araq Araq reopened this Nov 15, 2019
@rockcavera
Copy link
Contributor

I believe the C# HashSet implementation is good and can be used as an example for Nim.

I believe this is the source code

Here the highu64.txt file was processed in 0.220885200 seconds against 88.57006502151489 seconds for my Nim implementation with Sets.

@narimiran
Copy link
Member

I'm already working on it, and the Python's way of probing (link) seems to work nice with the current Nim's implementation.

The remaining problem is: I was not able to make Nim to bootstrap anymore with the new probing.

alehander92 pushed a commit to alehander92/Nim that referenced this issue Dec 2, 2019
alehander92 pushed a commit to alehander92/Nim that referenced this issue Dec 2, 2019
alehander92 pushed a commit to alehander92/Nim that referenced this issue Dec 2, 2019
@rurban
Copy link

rurban commented Apr 8, 2020

Well, inlining is crucial for hash functions, so they need to be small. wyhash is not really that small.

Swiss tables and friends do make a huge difference, esp. for such VM's. The only problem that all 3 good implementations do need C++ so far. In my VM's I was not yet comfortable with that requirement. But it's only a matter of time until someone ports it over to C. Improvements in C++ had been amazing, I used the best one for SMHasher, and this is even thread-safe

@c-blake
Copy link
Contributor

c-blake commented Apr 8, 2020

@rurban, the fallback is big, but the have-128-bit-double-product-on-CPU is not. It's not Rotate-Multiply small, but it's less than 3x more. You may be thinking of the Wang Yi string hash which is much more ornate through much manual unrolling activity, not my specialized 8-byte int adaptation.

Re: Swiss tables, it's all about "compared to what along what dimension?". Time/memory/etc. I get L1 int-keyed hot-everything lookups of like 1/2 a pipeline depth with vanilla LP soup to nuts whole lookup time. On the giant side, time is dominated by the initial uncached memory load or 65-120ns unless you "cook the books" with hot loop benchmarks. In the middle is a muddle of partial prediction/partial caching. Work prediction/prefetching in general creates an enormous amount of confusion in benchmarkers, even among pros, and is a big conversation best had elsewhere (e.g. over at adix).

@c-blake
Copy link
Contributor

c-blake commented Apr 8, 2020

(I do realize I was kind of talking about "both" string hashes & int hashes and so no offense, and that conversation going afield is what makes me keep saying to take the topic elsewhere from this uint64 issue.)

@c-blake
Copy link
Contributor

c-blake commented Apr 8, 2020

Unless you know, as you well might!, of an 8 byte integer hash that passes SMHasher distro tests and is smaller/faster than roughly:

proc hiXorLo(a, b: uint64): uint64 {.inline.} =
  {.emit: "__uint128_t r=a; r*=b; `result` = (r>>64)^r;".}
proc hashWangYi1*(x: int64|uint64|Hash): Hash {.inline.} =
  const P0 = 0xa0761d6478bd642f'u64
  const P1 = 0xe7037ed1a0b428db'u64
  const P5x8 = 0xeb44accab455d165'u64 xor 8'u64
  Hash(hiXorLo(hiXorLo(P0, uint64(x) xor P1), P5x8))

@tommyettinger
Copy link

tommyettinger commented Apr 10, 2020

I'd definitely do some double-checking and make sure that hashWangYi1 is a bijection; if it isn't a bijection but has 64-bit input, 64-bit output, then some outputs are not possible. Most testing suites have a hard time detecting gaps (or repeats) in 64-bit output ranges, and it may not be an issue because the 64-bit number space is so vast. I usually estimate functions like hiXorLo as being unable to produce about 1/3 of all possible results, based on testing with a 32-bit version of a similar wyhash-derived function that can be fully explored. If that's the case here, then two hiXorLo calls would not be able to produce roughly 5/9 of all possible outputs, and the remaining 4/9 of possible outputs would be unevenly distributed. This may be acceptable for some usages.

EDIT: It is definitely not a bijection. I ran a test by M.E. O'Neill meant for detecting random number generators that can't produce duplicates, which is a problem for some PRNGs, but a necessity for hash functions that should be able to return any hash. After about 45 minutes of testing, it had generated 50916406 64-bit values given unique inputs to hashWangYi1. Of those, there were 8 repeats, which is twice as many as it expected for a fairly-distributed PRNG (it gave a p-value of just over 0.95 when treating it as a PRNG), but as a unary hash it should have none.

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

@tommyettinger - Solid point. The xor folding definitely makes it non-invertible. { Psychology Replication Crisis-wise, I'm obligated to request a statistically independent second trial to confirm that p=0.95 result. :-) :-) What was your input set? 1..50916406? }

A hot spot in the table only 2x overloaded isn't bad if those whole-64-bit repeats don't translate into >10s of times expected collisions in just lower bits. It's not Cuckoo hashing where slightly weak hashes wreak havoc. ;-) But, of course, they could be collide-y. We do and-mask reduction of hash outputs to table address. So, we care most about the lower bits.

Anyway, I absolutely see how invertibility would add safety if we don't have to give up the SMHasher avalanche-y/high-sensitivity-to-input property or much speed. Do you (or anyone else) know of a comparably small, fast, mixing/sensitive integer hash that is invertible? I have always thought the right answer was to provide a few alternatives with easy switching. So, if you have a better recommendation I am all ears. Thanks!

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

(and I know about the Thomas Wang one already mentioned in this thread, but that seems like an awful lot of ALU operations..like 15..22 compared to the 5 of that WangYi1, and I don't know anyone has ever even run the relevant SMHasher 8-byte pattern tests on it or elsewise really vetted it. Someone should do an SMHasher/BigCrush suite specialized for integer hashes..Maybe you!)

@tommyettinger
Copy link

Pelle Evensen has an incredibly hard-core routine he evaluates unary hashes with, based on many runs of PractRand (for testing random number generators) with altered input patterns. Anything that passes it is probably more expensive than 5 operations (although I counted more in the wyhash-based hash, some may not run on the ALU: 3 xor, 2 shift, 2 128-bit multiply). There is a nice middle ground with Moremur, which is invertible and only requires 3 xor, 3 shift, 2 64-bit multiply, although they're all sequential. Moremur doesn't pass Evensen's full test battery, but it passes a lot more of it than similar hashes, like MurmurHash3's mixer.

uint64_t moremur(uint64_t x) {
  x ^= x >> 27;
  x *= 0x3C79AC492BA7B653UL;
  x ^= x >> 33;
  x *= 0x1C69B3F74AC4AE35UL;
  x ^= x >> 27;
  return x;
}

One that does pass the full several-day battery of tests is NASAM, which looks like

uint64_t nasam(uint64_t x) {
  // ror64(a, r) is a 64-bit right rotation of a by r bits.
  // the xor of an odd number of rotations of the same term is invertible.
  // here, one of the rotations is effectively a rotation by 0.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  // all xorshifts in the same direction, even with multiple xors and shifts, are invertible.
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;
  return x;
}

Unless you anticipate extremely strange input sets, Moremur probably provides more than enough speed and collision resistance for int hashing.

@peteroupc
Copy link

Do you (or anyone else) know of a comparably small, fast, mixing/sensitive integer hash that is invertible?

In this respect, I am aware of the Hash Prospector (by @skeeto), which does a computer search for good hash functions that follow a given sequence of reversible operations. See also the author's article "Prospecting for Hash Functions".

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

@tommyettinger, hiXorLo is just 1 multiply (with a double register output) and one xor. In C you "spell register access" with a 64-bit shift and a mask. So, I was counting it as 1 initial xor with P0, then two products and two xors in the machine code. Just reading at these other two make them seem quite a bit more costly, like Thomas Wang's, but maybe more scrambling/uniform. (There are insn latency/dependency questions, too.)

As per @peteroupc 's concern (thanks also for those links), spending too much time to get safety can undermine overall goals and be a false sense of safety against intelligent attack. True attacks remain possible until you "mix in" information an attacker cannot access (which I try to address in adix, but is far afield of this topic). So, 4 hashes - identity, multiply-rotate, WangYi-ish, and maybe one of those two you mention might give people a good selection on the "fast <-> safe" axis. I will look into all of this and thanks much for the pointers.

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

Also, @tommyettinger that only 4/9 of output effect you mention may be "mosty canceled" by the two rounds of hiXorLo to get 8/9 with avalanche with low bias..No? If SMHasher is not detecting this bias you suspect that is probably an issue to report. Also, what was your sequence giving p just under .05 (p-value is conventionally P(by chance|null) not 1-that)?

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

For example/food for thought, this program:

# Just writes 8 byte binary hashes of 0..<n to stdout for input to PractRand.
import althash, bitops, cligen, cligen/osUt
type HashFun = enum WangYi, MoreMur
proc writeHash(n=99, r=0, fun=WangYi) =
  var h: Hash
  for i in 0 ..< n:
    let x = rotateLeftBits(uint64(i), r)
    case fun
    of WangYi:  h = hashWangYi1(x)
    of MoreMur: h = hashMoreMur(x)
    discard stdout.uriteBuffer(cast[cstring](h.addr), Hash.sizeof.Natural)
dispatch(writeHash, help = { "n": "samples", "r": "rotation",
                             "fun": "hash function"})

and this command (with PractRand-pre0.95 from sourceforge):

./writeHash -n $((1<<32)) | PractRand stdin64 -tf 2

Produce "no anomalies" every time for the first 4 GiNumbers. A good initial indicator, anyway. Just something I ran in 7 minutes. I realize it is not as thorough Pelle Evensen's tests who does like 1000x more than that and then with a bunch of input pattern transforms on top. I am working on reproducing those and tracking down Melissa O'Neill's code (Harvey Mudd's entire CS website was down for me).

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

So, using that same program and doing

for r in {0..63}; { echo ROTATION $r; ./writeHash -r$r -n$((1<<26)) -fWangYi | PractRand stdin64 -tf 2 }

does detect 7 "anomalies" but no actual "failures" (not even "mildly suspicious") at ROTATION's 5, 6, 19, 20, 34, 58, and 60:

  Test Name                         Raw       Processed     Evaluation
  [Low8/64]BCFN(2+0,13-3,T)         R= +10.0  p =  1.6e-4   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low8/64]FPF-14+6/16:all          R=  +5.9  p =  4.7e-5   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low4/16]BCFN(2+1,13-3,T)         R=  -7.7  p =1-2.6e-4   unusual
  ...and 520 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]mod3n(5):(4,9-6)         R= +13.8  p =  1.4e-5   unusual
  ...and 520 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]Gap-16:A                  R=  -5.0  p =1-1.1e-4   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low8/32]BCFN(2+1,13-3,T)         R=  +9.8  p =  2.0e-4   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low4/16]DC6-9x1Bytes-1           R=  -5.0  p =1-1.5e-3   unusual
  ...and 562 test result(s) without anomalies

Meanwhile your moremur (re-written in Nim):

proc hashMoreMur*(x: int64|uint64|Hash): Hash {.inline.} =
  ## This is from https://github.com/tommyettinger
  var x = uint64(x)
  x = x xor (x shr 27)
  x = x * 0x3C79AC492BA7B653'u64
  x = x xor (x shr 33)
  x = x * 0x1C69B3F74AC4AE35'u64
  x = x xor (x shr 27)
  result = Hash(x)

and this:

for r in {0..63}; { echo ROTATION $r; ./writeHash -r$r -n$((1<<26)) -fMore | PractRand stdin64 -tf 2 }

produces 247 warnings and errors..too many to list outright, but a summary is:

37 unusual
10 mildly_suspicious
14 suspicious
18 very suspicious
22 VERY SUSPICIOUS
146 FAIL

I would have to say, that WangYi seems both faster, smaller, and stronger than MoreMur from this preliminary evaluation. (unless I screwed up translating to Nim. I get hashMoreMur(123456) = -5874516080525399851)

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

Actually, that uint64_t nasam(uint64_t x) you posted gets even more "unusual" flags than my hashWangYi1 above - 12 flags vs the7 of WangYi. So, that means A) Pelle Evensen's tests aren't excluding only "unsual" flags and B) hashWangY1 might well pass his whole suite going against your expecations. I realize I'm only doing 2**26 * 8 = 1/2 a GiB of tests and he's doing a terabyte or whatever, but I don't really feel like waiting 7 days and 26/64 bits isn't so different from 40/64. They're both partial explorations. I didn't try bit reversal, though. Maybe I should just send Pelle the code and have him try it. :-)

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

Also, I'm not sure if any of these PractRand tests (and possibly Melissa's) are Bonferroni corrected. I did grep -i for multiple.c and bonferroni and found nothing. With so many tests you would definitely want to correct for the "omnibus pass/fail conclusion". A single test (or 5% of them) failing at a significance threshold of 5% is to be expected even under the null.

@peteroupc
Copy link

With "Melissa O'Neill's tests", I think you may be referring to the Birthday Problem test at https://github.com/imneme/birthday-problem-test (see also: https://www.pcg-random.org/posts/birthday-test.html ; @imneme).

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

I'm referring to whatever @tommyettinger was referring to. :-) I suspect it was a false positive for badness. It sounded borderline at worst, and the multiple comparisons point definitely matters. But thanks for those links, Peter. I'll take a look later! Cheers

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

I realize I made a performance comment I should maybe have backed up. So, here is a little table I have for 1e6 numbers on 3 computers. Time is min(5 runs) in ns/hash. Compiled with Nim ...d0942d42a1; gcc-10.0.1 with PGO. Source code is in adix/tests/writeHash.nim.

CPU/Hash  WangYi  MoreMur  NASAM
i7-6700k  1.2843  1.4986   1.6379  (4.70 GHz)
amd2950x  1.7293  1.9127   2.2522  (3.78 GHz)
i5-540m   2.9324  3.1191  11.9170  (2.53 GHz)

(That time actually also counts the bit rotation that is part of the test for output randomness and whatever loop overheads if you look at writeHash.nim.) This is only a super-hot-everything benchmark, but it tracks with how the functions "look" to me. I don't see why any of the three would have a particular advantage for various warm-ups.

@brentp
Copy link
Contributor

brentp commented Apr 10, 2020

is there a reason you're not comparing splitmix? that is quite fast for my tests (which include the hashtable insert+lookup). i'm curious how it performs in yours.

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

Nope. No good reason. :-) { Well, other than that there are a lot of hashes out there and I had no ambition of ever doing an exhaustive comparison..just finding one quick, well mixing one that passed a lot of statistical tests. }

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

Speed-wise, splitmix64 from (http://xorshift.di.unimi.it/splitmix64.c) is slower than NASAM on that Skylake/i7-6700k (1.7123 ns = 8 cycles). Not sure I see the point in running stat tests on it since it's the slowest of the bunch. OTOH, you might find some modest 2 cycle gain (in some amortized sense) by moving to hashWangYi1. (EDIT: I got 12 "unusual" PractRand reports for that splitmix64. So, the same as NASAM and more than WangYi1, but none of them are even at the "mildly suspicious" level, except Moremur which breaks all over the place.)

I say "amortized" because, of course, a hot loop like that is really overlaying as much parallel execution as the CPU can discover/use which could be quite a lot. It may be that, e.g., it takes 12 cycles to do one Wang Yi hash but on average there are enough CPU execution units to be doing 2 at a time. I am very aware that almost all CPU/memory system engineering since the early 1990s has made interpreting microbenchmarks harder. So, I try to use careful language. The best benchmark is always your actual application.

@tommyettinger
Copy link

Moremur is meant to be an improved set of constants for the same algo as SplitMix64, it just doesn't include the state increment by 0x9e3779b97f4a7c15 that Vigna's code uses. SplitMix64 has known PractRand failures when that state increment (AKA gamma) is too small (1 and 3 are known problem gammas, as is 0xAAAAAAAAAAAAAAAB), and the actual set of problematic gammas, and to what degree, is unknown. Moremur doesn't fully avoid the problems in splitmix64, and it still absolutely has problem gammas, but it should do better with consecutive states than splitmix64. If you test SplitMix64 with a gamma of 1 instead of 0x9e3779b97f4a7c15 , I'd expect many more than 12 anomalies, and more severe. You can look at the analysis Pelle Evensen already did on SplitMix64 and its relative from MurmurHash3, too: http://mostlymangling.blogspot.com/2018/07/on-mixing-functions-in-fast-splittable.html

The set of numbers I used for the birthday problem test doesn't really matter in a test for the presence of repeated outputs, as long as the inputs are distinct. It does matter that it found any repeats, because it means there are fewer outputs than inputs. I'm trying to find the non-cryptographic hash construction guidelines I've been going by, and I think this was it, from Bob Jenkins: https://burtleburtle.net/bob/hash/evahash.html#Funneling (I'm considering hashWangYi1 a mixing function). I'll have to check on a different computer to see what initial seed the birthday problem test used (it was 32-bit), but it incremented that seed repeatedly by 0x6C8E9CF570932BD5, which makes a reasonable Weyl sequence (needs to generate 2 to the 64 states before it cycles or repeats an input; 0x9e3779b97f4a7c15 is used as the state transition in Vigna's splitmix64.c file), and gave that state to hashWangYi1. Repeatedly giving the output as the input to the next call would be a bad idea; I didn't do that. I can run the test on consecutive u64 as well.

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

I cannot generate a single repeat using that same adix/test/writeHash.nim piped to a new adix/tests/repeats.nim. I'm just using that same 2**26 length input sequence as before with 64 rotations, but that is 4 GiNumbers. At a minimum, if this repeating effect exists at all, it must be tiny or require very specially constructed sequences which probably makes it a non-issue since any hash has some attack pattern.

Also, this follow-up makes your initial report:

generated 50916406 64-bit values given unique inputs to hashWangYi1. Of those, there were 8 repeats, which is twice as many as it expected

hard to understand - 8 repeats being 2x expected, but now you are saying we want zero repeats which is not at all 8/2=4. So, I feel like that test must be "repeats when range-reduced/masked" which is, of course, a different story than the full 64-bit range and definitely requires "repeated trials". And definitely be careful about Melissa's confusing (or enlightend?!?!) tendency to write p-values as their complements (p-value here, 1 - p-value there, etc.).

@c-blake
Copy link
Contributor

c-blake commented Apr 10, 2020

Also, for a non-repeating sequence of inputs mod 2**anyPower, can't we simply use an odd number increment mod 2**thatPower? So, for 2**64 an set of large number|1 increments? I guess what I am thinking is to try many sequences with various odd offsets (perhaps exponentially expanding in size). This is more direct than trying to assess just via worse than Poisson collision rates. I'm also unsure this hash fits the model Bob has for his funneling theorem. It may escape his invertibility condition.

@c-blake
Copy link
Contributor

c-blake commented Apr 11, 2020

Ok. I added a couple lines to adix/tests/writeHash.nim to write in hex format instead of binary and reproduced the greater than random collisions in the full 64-bit address space with (in the interests of reproducibility for anyone who wants):

$ writeHash -H -fW -n$[1<<32] -s7822362180758744021|sort -T. -S$[40<<30]|uniq -d
23367F3A36E60564
3AF61091CD4B167A
$ writeHash -H -fW -n$[1<<32] -s5432154321987654321|sort -T. -S$[40<<30]|uniq -d
41D5793C53CDAD25
5A5ACC1FC095E739
9897D8E21FD55055
CCCBC84FA87326E1
FDE47C0D9C053AE2
$ writeHash -H -fW -n$[1<<32] -s4321543219876543215|sort -T. -S$[40<<30]|uniq -d
<EMPTY>
$

(Related to the above - if you have an 8B-int binary external merge sort, it will use 8/17 the RAM/disk space/bandwidth, but coreutils sort does not do this, AFAIK, and you want "." in the -T. to be /dev/shm or at least an NVMe device)

So, on average for 2**32 entries (2+5+0)/3= 2.33 slots with a collision when I believe one expects k**2/(2N) = 1/2 for the full 64-bit range. So, about 4x too many in this very limited measurement (which took 3 hours, but I think is about a 4 sigma result to @tommyettinger 's 2sigma result). So, this slight elevation of 2-5x is likely real. Maybe that 4/9 gets squared (from nested application) and it's 81./16 =~ 5x.

Anyhow, I don't think it's disqualifying unless it's the tip of a very nasty iceberg which seems unlikely given how it passes SMHasher and PractRand (and how many hours it takes to even measure the defect reliably). If we were targeting Cuckoo hashing with 32-bit hash codes then it might be catastrophic, but it should be a fine default for linear probing/robin hood LP which are robust to collisions in hash codes. It still seems better in a vague/omnibus sense than Tommy's MoreMur recommendation and I'm very sympathetic (before he ever mentioned it) to @peteroupc 's concern about unnecessary randomness optimization. If there weren't 64-bit CPUs where NASAM was 4x slower then that could be the default, but as-is, it makes the most sense (to me) to provide NASAM as a fallback and a few less-random-on-purpose "fall forwards" like identity & RoMu1. (As mentioned, I'm not personally against RoMu1 or identity as defaults if there is an auto-fallback mechanism, as I am working on in adix, or at least a probe depth too large warning mechanism, and I hope to someday get such smarts incorporated into the Nim stdlib.)

Thanks for everyone's participation/help on this, btw, especially @tommyettinger (who I hope to have saved a little time/work by following up like this). Open source is too often thankless.

@peteroupc
Copy link

peteroupc commented Apr 11, 2020

My concern was less about "over-optimization" of hash functions and more about using alternative approaches than the all-too-fashionable SipHash and other keyed hash functions to mitigate security attacks, as long as performance is maintained by using one or more of these approaches. In this sense, I am favorable of the OpenJDK strategy of switching a hash bucket to a binary tree (which has logarithmic time complexity) when it detects a high number of collisions in that bucket.

@c-blake
Copy link
Contributor

c-blake commented Apr 11, 2020

Well, we agree 100% and adix is all about such auto/staged mitigations. I should stop trying to characterize the opinions of others. Sorry. It's often hard to voice support without some pithy reference. :-)

@Clyybber
Copy link
Contributor

Since #13823 is merged, this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.