Skip to content

[Foundation] Modernize hashing in Foundation's Swift-only types #23832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Apr 26, 2019

Conversation

lorentey
Copy link
Member

@lorentey lorentey commented Apr 6, 2019

This PR upgrades the Foundation overlay to use the new hashing API introduced in SE-0206. This gets rid of the deprecation warnings for hashValue implementations and brings the overlay in sync with current Swift best practices.

Background

SE-0206 radically changed how hashing works in Swift 4.2+. It is now the stdlib's responsibility to choose a hash function; types implementing Hashable only need to worry about selecting what they want to feed to the hasher. To make this possible, Hashable now requires hashing to be implemented through the hash(into:) method.

The old hashValue property is deprecated as a Hashable requirement, and it's only kept to support source-level compatibility with existing code. Set and Dictionary have switched over to using hash(into:) as their entry point for hashing; they do not directly call hashValue at all.

Beyond simplifying hashing, the intent of SE-0206 is to enable Swift to provide certain guarantees about its quality. In particular: as long as hash(into:) feeds enough data the hasher to unambiguously decide equality, Swift attempts to guarantee that collision attacks won't be possible. For this to work, it is critically important for Hashable implementations to include everything that Equatable.== looks at; and this is especially the case for the basic boundary types that come built-in with Swift, like Data.

The exact hash algorithm used is a private implementation detail of the stdlib; the Hasher API was carefully designed to leak as little information about it as possible, enabling future versions of the stdlib to switch to other algorithms without worrying about ABI compatibility issues.

While random seeding makes the hash algorithm opaque, it is technically possible for Swift code to rely on the particular hash encoding of a stdlib- (or SDK-)provided type. This is highly unlikely to happen in practice, but the longer we wait to fix hashing issues, the more likely it will be that someone starts relying on hashing particulars. For example, I could imagine this to happen through a well-meaning desire to work around a broken hash implementation. It is therefore important to do this sooner rather than later.

Notes

I'll highlight points specific to a particular changes as self-review comments. Here is a list of general remarks:

  1. It is neither required nor desirable for Swift's hashValue to return reproducible values.

  2. We don't need to ever match hash values between Swift types and the Cocoa classes that they bridge with. The act of bridging involves a full rehashing of all Set/Dictionary values.

  3. The only hard requirement is that hashing needs to be consistent with the equivalence classes implemented by == within the same type.

  4. Hashing is meaningless without equality. Equatable only defines equality for values of a single type. Therefore, Hashable makes no requirements or expectations about the hash values produced by values of different types.

    • It is perfectly fine for hash values produced by different types to not match. (0 as Int8).hashValue and (0 as Int16).hashValue will (typically) not be equal in Swift.

    • It is also perfectly fine for values of two distinct types to produce matching hash values. Measurement<UnitLength>.hash(into:) and Measurement<UnitDuration>.hash(into:) are allowed to use the same hash encoding, since there is no way they can ever produce large-scale collisions within the same hash table.

  5. While this is not a hard requirement for user code, for boundary types provided in the stdlib/SDK, we require that hashing isn't just consistent with equality, but that it's equivalent to it. The Swift test suite has checks to actively enforce this -- this is possible through repeatedly salting the hash function.

    "Optimizing" hashing by omitting some of the data compared by == is generally a mistake in Swift, because it completely breaks all guarantees about the strength of hashing, and opens the door to (accidental or deliberate) collision attacks.

    It's perfectly acceptable to hash a gigabyte of data if someone inserts some large value (such as a big collection) as a key in a hash table. Multi-megabyte String keys are easy to protect against; hidden hashing weaknesses aren't.

  1. It would be nice if we could somehow improve hashing in Foundation's Objective-C classes, but this is outside the scope of this PR (or indeed, this repository). However, we should at least make sure Swift programmers can rely on the quality of hashes produced by native Swift value types.

  2. Modernize hashing in Foundation's Swift-only types swift-corelibs-foundation#2118 is the corresponding PR for swift-corelibs-foundation. These PRs should land at the same time.

  3. Replacing hashValue implementations on types with hash(into:) has no effect on the symbols exposed as ABI -- the compiler automatically synthesizes hashValue from hash(into:), and vice versa. (This is not the case in protocol extensions, but we've fixed those in the 4.2/5.0 timeframe.) In the Swift 5.0 ABI, all of these types include definitions for both hashValue and `hash(into:) so we don't need to worry about availability issues.

    The sole exception is URL, which implements hashValue but isn't Hashable. Unfortunately we can't add the conformance in 5.1, but we should add a (versioned) hash(into:) definition to discourage people from trying to implement it on their own. (Edit: URL is not a special case; it's Hashable like the rest. I got mislead by an unrelated error.)

rdar://problem/43394032

lorentey added 4 commits April 4, 2019 13:56
…*all* components

The new definition diverges from the existing hashValue: it feeds all path components to the hasher, rather than just a limited subset.

Hashing all the bits that are compared in == is necessary to ensure proper operation of the hash table.
This replaces the compiler-generated variant (based on hashValue) with a better implementation.
@lorentey lorentey force-pushed the foundation-hashing branch from 5cec1e5 to b711ed9 Compare April 6, 2019 01:07
return Int(bitPattern: __CFHashDouble(_time))
} else { // 10.11 and previous behavior fallback; this must allocate a date to reference the hash value and then throw away the reference
return NSDate(timeIntervalSinceReferenceDate: _time).hash
}
Copy link
Member Author

@lorentey lorentey Apr 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: There is no need to emulate NSDate's hashing behavior, or the behavior of any previous version of Date.hashValue. Swift is very explicit that the values returned by hashValue aren't to be treated as stable, and Swift trains programmers to expect randomized hashes.

Granted, in this case (like with many other hashValues removed by this PR), 5.0 did not vend properly seeded hashes, so in theory someone could be relying on the exact values returned. I believe this is still a low-risk change; and since this is not inlinable code, we have the option to re-add specific hashValue implementations with an SDK check if it turns some of them have ABI impact.

@@ -155,15 +155,11 @@ public struct DateInterval : ReferenceConvertible, Comparable, Hashable, Codable
return false
}

public var hashValue: Int {
var buf: (UInt, UInt) = (UInt(start.timeIntervalSinceReferenceDate), UInt(end.timeIntervalSinceReferenceDate))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the end date here was problematic, since == compares start and duration, not start and end.

end does not always include all the information that's present in duration, so doing this results in weaker hashes.

The new code is careful to match the implementation of ==.

let range = startIndex ..< Swift.min(startIndex + 80, endIndex)
storage.withUnsafeBytes(in: range) {
// To ensure strong hashing, all bytes must be fed into the hasher.
withUnsafeBytes {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the cases where the old implementation deliberately neglected to feed all relevant information into the hasher. This makes it trivial to induce an arbitrary number of collisions, despite all the effort that went into implementing a high-quality universal hash function.

This is particularly a bad idea for Data, which is the standard type for safely exchanging byte buffers across module boundaries. It is reasonable to expect Data to implement the same high quality hashing as Array does; silently doing otherwise is not a good idea.

We previously discussed this issue in #21754.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced about this without Hasher having some guarantees about not being O(n), but I think we should try to set up a meeting to discuss about strategy here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I've spun off this change into #23876. Data is already implementing the right APIs, so this is just a matter of changing the implementation.

@@ -36,8 +36,19 @@ public struct Measurement<UnitType : Unit> : ReferenceConvertible, Comparable, E
self.unit = unit
}

public var hashValue: Int {
return Int(bitPattern: __CFHashDouble(value))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation of Measurement.hashValue is not a valid hash, because it is in conflict with ==.

(A version of this change was originally submitted as #20879.)

.ascii,
.utf8,
]
checkHashable(instances, equalityOracle: { $0 == $1 })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkHashable verifies that instances provided don't violate Equality and Hashable requirements, including the soft requirement for strong hashing. (I.e., that non-equal values must differ in what they feed into the hasher.)

The equalityOracle closure provides the expected equivalence relation over the indices in the collection passed in as the first argument.

expectEqual(AffineTransform.self, type(of: anyHashables[2].base))
expectNotEqual(anyHashables[0], anyHashables[1])
expectEqual(anyHashables[1], anyHashables[2])
checkHashableGroups(values)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkHashableGroups is like checkHashable, but it represents the equivalence relation as nested arrays: each element in values is supposed to contain values of a distinct equivalence class.

I.e., a and bridged(a) must compare equal, but a must not compare equal to any item outside its own array.

@lorentey lorentey requested review from phausler and itaiferber April 6, 2019 01:28
@lorentey
Copy link
Member Author

I submitted a PR for swift-corelibs-foundation with matching changes: swiftlang/swift-corelibs-foundation#2118

@lorentey lorentey marked this pull request as ready for review April 17, 2019 03:26
@lorentey
Copy link
Member Author

@swift-ci please test

@swift-ci

This comment has been minimized.

@swift-ci

This comment has been minimized.

@lorentey lorentey mentioned this pull request Apr 18, 2019
@lorentey
Copy link
Member Author

@swift-ci please test macOS platform

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 13bc567

@lorentey
Copy link
Member Author

@swift-ci please test macOS platform

@lorentey
Copy link
Member Author

@swift-ci please test linux platform

@swift-ci

This comment has been minimized.

@swift-ci

This comment has been minimized.

@lorentey lorentey force-pushed the foundation-hashing branch from 71833f4 to 6647c50 Compare April 24, 2019 06:07
@lorentey
Copy link
Member Author

@swift-ci please test

@swift-ci

This comment has been minimized.

@swift-ci

This comment has been minimized.

@lorentey
Copy link
Member Author

02:13:44 ******************** TEST 'Swift(iphonesimulator-i386) :: stdlib/TestPersonNameComponents.swift' FAILED ********************
...
02:13:44 [ RUN      ] TestPersonNameComponents.test_Hashing
02:13:44 stdout>>> check failed at /Users/buildnode/jenkins/workspace/swift-PR-osx/branch-master/swift/test/stdlib/TestPersonNameComponents.swift, line 54
02:13:44 stdout>>> stacktrace:
02:13:44 stdout>>>   #0: /Users/buildnode/jenkins/workspace/swift-PR-osx/branch-master/swift/stdlib/private/StdlibUnittest/StdlibUnittest.swift:2530
02:13:44 stdout>>> expected: 230665695 (of type Swift.Int)
02:13:44 stdout>>> actual: -422011617 (of type Swift.Int)
02:13:44 stdout>>> hash(into:) expected to match, found to differ
02:13:44 stdout>>> lhs (at index 0): givenName: Kevin familyName: Frank 
02:13:44 stdout>>> rhs (at index 1): givenName: Kevin familyName: Frank 

Hm; NSPersonNameComponents.hash seems not to work correctly on iphonesimulator-i386.

@lorentey
Copy link
Member Author

@swift-ci please test

@swift-ci

This comment has been minimized.

@airspeedswift
Copy link
Member

airspeedswift commented Apr 26, 2019

@parkera @millenomi @itaiferber are we good to go ahead with this? (with the Data part subsetted out for further discussion)

@parkera
Copy link
Contributor

parkera commented Apr 26, 2019

Yup

Copy link
Contributor

@itaiferber itaiferber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes LGTM

@lorentey lorentey merged commit d9c166f into swiftlang:master Apr 26, 2019
@lorentey lorentey deleted the foundation-hashing branch April 26, 2019 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants