Skip to content

Remove HashStable impl for collection types with unstable iteration order #533

Closed
@michaelwoerister

Description

@michaelwoerister

Proposal

This MCP has the goal of removing unsound HashStable impls that currently exist for various widely used collection types, in particular collections derived from std::collections::HashMap and std::collections::HashSet (e.g. FxHashMap, DefIdSet, etc), and to restrict the HashStable impls for Ord-based collections (like BTreeMap and SortedMap) to instances with keys that have a stable sort order across compilation session boundaries.

HashMap / HashSet

The iteration order of HashMap and HashSet depends on the hash value of keys. This hash value, however, very often is not stable across compilation session boundaries because it can be computed from memory addresses (e.g. Interned<T>) or unstable IDs (like DefIndex or CrateNum). The current HashStable implementation "solves" this problem by computing an iteration-order-independent fingerprint for these collections -- but this is not sound since the iteration order can influence the outcome of query computation and thus the fingerprint must take it into account. Worst case, this fingerprint inaccuracy can lead to silent miscompilations.

In addition to the problem of hash values, the iteration order of these collections is not defined in any way, i.e. the documentation gives no guarantees. In particular, it is not guaranteed that hash_set.iter() has the same order as decode(encode(hash_set)).iter(), so even for keys with stable hash values (e.g. String), we cannot rely on iteration order staying the same across compilation session boundaries.

What to use instead of HashMap / HashSet

Various forms of HashMap and HashSet are widely used in the compiler. Removing their HashStable implementation means that they cannot be used in query keys and query results anymore. Luckily, there are a couple of good alternatives:

  1. We already use FxIndexMap and FxIndexSet in many places. These are high-performance hashing-based collections with the additional guarantee of having a stable iteration order. Their HashStable implementation is valid and they can be used as a drop-in replacement of HashMap and HashSet in most circumstances.

  2. The MCP proposes to add two new collection types UnorderedMap and UnorderedSet to rustc_data_structures that provide the same interface as other map and set collection types, but that do a better job of hiding iteration order. These types can be a good choice in cases where a collection does not have any ordering at the conceptual level. One advantage they have, is that they "erase" the iteration order and thus can safely be built from non-deterministic sources -- such as a regular FxHashMap or via multi-threaded code.

Interaction with potential_query_instability lint

MCP 465 introduced the potential_query_instability lint, which tries to solve much of the same problem as the changes proposed here. However, as discussed on Zulip, neither the lint, nor the changes here will catch all problematic cases. E.g. with UnorderedMap we can allow safe instances of collection iteration (like map.iter().any(predicate)) in a single place at the API-level, where with the lint each usage instance would have to be audited and #[allow()]ed individually. At the same time, the lint can catch cases where we cannot enforce a HashStable bound, like when using an FxHashMap locally inside a query implementation.

So this MCP does not propose to change anything about the lint (other than applying it more widely :))

Ord-based collections

In a way, Ord-based collections like BTreeMap and SortedMap are in a better position than HashMap because they have a well-defined iteration order. We just have to make sure that the keys being sorted by do not introduce instability. To this end, the MCP proposes to add a StableOrd trait:

/// Trait for marking a type as having a sort order that is
/// stable across compilation session boundaries. More formally:
///
/// Ord::cmp(a1, b1) == Ord:cmp(a2, b2)
///    where a2 = decode(encode(a1, context1), context2)
///          b2 = decode(encode(b1, context1), context2)
///
/// i.e. the result of `Ord::cmp` is not influenced by encoding
/// the values in one session and then decoding them in another
/// session.
/// 
/// This is trivially true for types where encoding and decoding
/// don't change the bytes of the values that are used during 
/// comparison (e.g. u32, String, Path, ...).
///
/// But it is not true for:
///  - `*const T` and `*mut T` because the values of these pointers
///    will change between sessions.
///  - `DefIndex`, `CrateNum`, `LocalDefId`, because their concrete
///    values depend on state that might be different between 
///    compilation sessions.
trait StableOrd : Ord {}

With this trait available, we can implement HashStable only for collection instances where the key has such a stable sort order.

The MCP proposes to implement the trait sparingly, since there is no good, generic way to writing regression tests that make sure a given type has a stable sort order (if you know of one, tell us!).

Mentors or Reviewers

Someone from @rust-lang/wg-incr-comp?

Process

The main points of the Major Change Process are as follows:

  • File an issue describing the proposal.
  • A compiler team member or contributor who is knowledgeable in the area can second by writing @rustbot second.
    • Finding a "second" suffices for internal changes. If however, you are proposing a new public-facing feature, such as a -C flag, then full team check-off is required.
    • Compiler team members can initiate a check-off via @rfcbot fcp merge on either the MCP or the PR.
  • Once an MCP is seconded, the Final Comment Period begins. If no objections are raised after 10 days, the MCP is considered approved.

You can read more about Major Change Proposals on forge.

Comments

This issue is not meant to be used for technical discussion. There is a Zulip stream for that. Use this issue to leave procedural comments, such as volunteering to review, indicating that you second the proposal (or third, etc), or raising a concern that you would like to be addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    T-compilerAdd this label so rfcbot knows to poll the compiler teammajor-changeA proposal to make a major change to rustcmajor-change-acceptedA major change proposal that was accepted

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions