Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Make set hashing robust to bitwise XOR cancellations #20

Merged
merged 1 commit into from
Nov 22, 2020

Conversation

aaronlehmann
Copy link
Contributor

The way sets (maps, structs, slices with the "set" tag) are hashed
relies on an XOR to mix the hashes of their elements. This allows
multiple independent items to interfere with each other through XOR
operations. For example, consider a "set" slice with the following two
structs:

[
  {Key: "a", Value: "v1"},
  {Key: "a", Value: "v2"}
]

Changing both keys from "a" to "b" will not change the hash value of the
overall object, since in either case the two keys will cancel out under
XOR arithmetic.

This change fixes the problem by adding a step after every set of
unordered hash operations that "hardens" the result by applying another
hash. It will prevent components of one unordered hashing operation (for
example, Key: "a" above) from interacting with the same data
encountered later in a different context.

Note that this changes hashes produced by the package for given inputs.

Closes: #18

The way sets (maps, structs, slices with the "set" tag) are hashed
relies on an XOR to mix the hashes of their elements. This allows
multiple independent items to interfere with each other through XOR
operations. For example, consider a "set" slice with the following two
structs:

[
  {Key: "a", Value: "v1"},
  {Key: "a", Value: "v2"}
]

Changing both keys from "a" to "b" will not change the hash value of the
overall object, since in either case the two keys will cancel out under
XOR arithmetic.

This change fixes the problem by adding a step after every set of
unordered hash operations that "hardens" the result by applying another
hash. It will prevent components of one unordered hashing operation (for
example, `Key: "a"` above) from interacting with the same data
encountered later in a different context.

Note that this changes hashes produced by the package for given inputs.

Closes: mitchellh#18
@aaronlehmann
Copy link
Contributor Author

An alternative approach would be to replace the XOR entirely by using the following approach for unordered collections:

  • Hash all the elements
  • Sort those hashes
  • Take an ordered hash over the sorted set of hashes

This has a minor performance cost, but it's easier to demonstrate the correctness, so it might be a better solution. I'd be fine with rewriting it that way.

@mitchellh
Copy link
Owner

Thanks for doing this! Yeah, that XORing always felt... wrong, and I regret not spending more time thinking about the problem. In the back of my mind I always felt there was an issue there, so I'm glad you've articulated it so clearly and pointed it out.

Changing the has could be problematic. One way we might want to do this is via semver and just bumping the major version and documenting that as the backwards incompatibility.

Note I haven't reviewed the code at all yet, just thinking of the high level.

Thoughts?

@aaronlehmann
Copy link
Contributor Author

I like the semver idea.

I have a private fork of this package that uses cryptographic hashes instead of FNV, to prevent collisions. If you are open to bumping the major version, it's worth considering a switch from uint64 hash values to []byte to support these longer hashes. I'd be happy to contribute the change upstream.

Also, since opening this pull request, I changed my fork to use the approach I proposed above with sorting hashes instead of using XOR at all. I like it a lot better. Let me know if you would like me to submit that version instead of reviewing what I have here.

@mitchellh
Copy link
Owner

Almost every project at HashiCorp depends on hashstructure in some way, so let me reach out and talk to those teams and maybe get a few more eyes on this and if we feel comfortable with the SemVer change (I think its fine, we're switching to Go Modules anyways).

@F21
Copy link
Contributor

F21 commented Feb 13, 2019

@mitchellh Any updates for this PR?

@mitchellh
Copy link
Owner

Hey @aaronlehmann, I'm finally going to merge this into master. We're going to tag a v2 since hash formats are changing by default. I think going forward we can add support for alternate hash formats including a HashBytes or something to use []byte instead. I'd much prefer []byte now but given this library is in use so many places I wonder if an alternate codepath is required for that to preserve the option for 64-bit int hash values.

@mitchellh mitchellh merged commit d5de409 into mitchellh:master Nov 22, 2020
@mitchellh
Copy link
Owner

Note that master (v2) also changed the Hash API from Hash(value, opts) to Hash(value, format, opts) where format is an enum. This is required. This will let us introduce new formats in the future without forcing a major version bump in the whole library. I also used this to add a FormatV1 so that users can gracefully upgrade from v1 to v2 more easily.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bitwise XOR to combine hashes for set/struct fails miserably even for basic cases
3 participants