Skip to content

Conversation

imartayan
Copy link
Collaborator

As discussed before, rotating by 1 at each step of nthash cause the high bits of the bits of the hash to be very correlated. A larger rotation mitigates this issue (though some other bits will still be correlated), as suggested experimentally here: https://github.com/imartayan/pfp-distribution. I ended up choosing R=25 since:
1/ Along with 7, it's the value such that (R * n) mod 32 takes the longest to be equal to 1 or -1
2/ I think rotating right by 7 (i.e. left by 25) is slightly better when we only keep the upper 16 bits for minimizers, since the 7 lowest bits (which are truncated) will become the 7 highest for the next iteration.

This change will affect all hash values including the default ones, so we might want to mark this as a breaking change (e.g. with version 2.0), and in this case it would probably be relevant to move nthash to a separate crate in the same release (#6).

@RagnarGrootKoerkamp
Copy link
Contributor

rotation direction shouldn't matter: rotating by R going 1 step forward is similar to rotating by -R going 1 step backward, and NtHash is in principle non-directional (although the specific seed values will have a small effect)

@imartayan
Copy link
Collaborator Author

I agree that the direction of the rotation doesn't matter too much, my point was that when we truncate the hash to its upper 16 bits (to store position in the lower bits) then I'd rather pull the new 7 highest bits from the lower bits (which were truncated) than from the high part.

BTW, the only tests that are failing are doctests with hardcoded minimizer positions, nothing to worry about.

@RagnarGrootKoerkamp
Copy link
Contributor

In the end the corralation is going to be the same no matter whether we rotate left or right. But yes, LGTM.
Can you just fix the doctests?
And then at the same time a major version bump and changelog entry.

@RagnarGrootKoerkamp
Copy link
Contributor

This was done in seq-hash: https://github.com/rust-seq/seq-hash/blob/master/src/nthash.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants