Skip to content

Commit f16326f

Browse files
committed
New design: interleave fibonacci skips with finding non-repeating elements
Goal: Hash approximately log(N) entries with a higher density of hashed elements weighted towards the end and special consideration for repeated values. Colliding hashes will often subsequently be compared by equality -- and equality between arrays works elementwise forwards and is short-circuiting. This means that a collision between arrays that differ by elements at the beginning is cheaper than one where the difference is towards the end. Furthermore, blindly choosing log(N) entries from a sparse array will likely only choose the same element repeatedly (zero in this case). To achieve this, we work backwards, starting by hashing the last element of the array. After hashing each element, we skip the next `fibskip` elements, where `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and only end up hashing one slice of the array (as might happen with powers of two). Finally, we find the next distinct value from the one we just hashed.
1 parent 5aae6ff commit f16326f

File tree

1 file changed

+39
-33
lines changed

1 file changed

+39
-33
lines changed

base/abstractarray.jl

Lines changed: 39 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2067,40 +2067,46 @@ function hash(A::AbstractArray, h::UInt)
20672067
h = hash(map(last, axes(A)), h)
20682068
isempty(A) && return h
20692069

2070-
# Now work backwards and hash (up to) three distinct key-value pairs
2071-
# Working backwards introduces an asymmetry with isequal; in many cases
2072-
# arrays that hash equally will be compared via isequal, which iteratively
2073-
# works forwards and _short-circuits_. Therefore the elements at the
2074-
# beginning of the array are not as valuable to include in the hash computation
2075-
# as they are "cheaper" to compare within `isequal`.
2076-
# A small number of distinct elements are included in the hashing algorithm
2077-
# in order to emphasize distinctions between arrays that are nearly all the
2078-
# same constant value but have a handful of differences the O(log(n)) skipping
2079-
# algorithm might miss (in particular, this includes sparse matrices).
2080-
I = keys(A)
2081-
i = last(I)
2082-
v1 = A[i]
2083-
h = hash(i=>v1, h)
2084-
i = let v1=v1; findprev(x->!isequal(x, v1), A, i); end
2085-
i === nothing && return h
2086-
v2 = A[i]
2087-
h = hash(i=>v2, h)
2088-
i = let v1=v1, v2=v2; findprev(x->!isequal(x, v1) && !isequal(x, v2), A, i); end
2089-
i === nothing && return h
2090-
h = hash(i=>A[i], h)
2091-
2092-
# Now launch into an ~O(log(n)) hashing of values, continuing from the
2093-
# last-found distinct index. The Fibonacci series is used here to avoid
2094-
# repeating common divisors and potentially only including a single slice
2095-
# of an array (as might be the case with powers of two and a matrix with
2096-
# an evenly divisible size).
2097-
J = vec(I) # Reshape the (potentially cartesian) keys to more efficiently compute the linear skips
2098-
j = LinearIndices(I)[i]
2099-
fibskip = prevfibskip = oneunit(j)
2100-
while j > fibskip
2101-
j -= fibskip
2102-
h = hash(A[J[j]], h)
2070+
# Goal: Hash approximately log(N) entries with a higher density of hashed elements
2071+
# weighted towards the end and special consideration for repeated values. Colliding
2072+
# hashes will often subsequently be compared by equality -- and equality between arrays
2073+
# works elementwise forwards and is short-circuiting. This means that a collision
2074+
# between arrays that differ by elements at the beginning is cheaper than one where the
2075+
# difference is towards the end. Furthermore, blindly choosing log(N) entries from a
2076+
# sparse array will likely only choose the same element repeatedly (zero in this case).
2077+
2078+
# To achieve this, we work backwards, starting by hashing the last element of the
2079+
# array. After hashing each element, we skip the next `fibskip` elements, where
2080+
# `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple
2081+
# ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and
2082+
# only end up hashing one slice of the array (as might happen with powers of two).
2083+
# Finally, we find the next distinct value from the one we just hashed.
2084+
2085+
# This is a little tricky since skipping an integer number of values inherently works
2086+
# with linear indices, but `findprev` uses `keys`. Hoist out the conversion "maps":
2087+
ks = keys(A)
2088+
key_to_linear = LinearIndices(ks) # Index into this map to compute the linear index
2089+
linear_to_key = vec(ks) # And vice-versa
2090+
2091+
# Start at the last index
2092+
keyidx = last(ks)
2093+
linidx = key_to_linear[keyidx]
2094+
fibskip = prevfibskip = oneunit(linidx)
2095+
while true
2096+
# Hash the current key-index and its element
2097+
elt = A[keyidx]
2098+
h = hash(keyidx=>elt, h)
2099+
2100+
# Skip backwards a Fibonacci number of indices -- this is a linear index operation
2101+
linidx = key_to_linear[keyidx]
2102+
linidx <= fibskip && break
2103+
linidx -= fibskip
2104+
keyidx = linear_to_key[linidx]
21032105
fibskip, prevfibskip = fibskip + prevfibskip, fibskip
2106+
2107+
# Find a key index with a value distinct from `elt` -- might be `keyidx` itself
2108+
keyidx = findprev(!isequal(elt), A, keyidx)
2109+
keyidx === nothing && break
21042110
end
21052111

21062112
return h

0 commit comments

Comments
 (0)