Skip to content

Document current justification for not requiring recursive reference validity (in particular, &mut uninit not being immediate UB) #346

Open
@CAD97

Description

@CAD97

This post is a draft of my understanding of the problem space, and justification for &mut uninit not being UB to hold or pass around between functions.

What is UB?

This is a very brief summary; see the glossary, various blog posts, and #253 for more.

In short, UB is a language-level contract that some situation does not happen. Formally speaking, a program cannot "have UB;" UB is a property of some execution resulting in some behavior considered undefined. Note that this is the formal meaning of undefined; encountering UB retroactively removes any and all guarantees about the program execution1.

Depending on who you ask, UB in C++ may have originally been about allowing implementation-defined behavior and implementations to diverge on how they implement the language. Even if this is the case, though, every commercial C++ compiler uses UB under the modern understanding for optimization, and a language without UB is one that is very difficult if not impossible to optimize2. And more importantly, Rust is not C++.

With the benefit of hindsight from the experience of C/C++ and other language design in the past 50 years, Rust takes a much more deliberate approach to UB. In particular:

  • UB should be detectable. It should be practical to write a perfect sanitizing implementation of the Abstract Machine which can say with certainty whether a specific execution is defined (did not attempt to execute UB).
  • UB should be justified. All other things being equal, it is better for more programs to be defined, because we want it to be reasonable to write programs without UB. As such, making some operation UB should be backed by the properties learned about the program outweighing the cost of developers having to manually prove that invalid operations do not occur.
  • UB should be operational. This ties into the previous two points; an axiomatic assertion of some property prevents clear diagnosis of UB and doesn't serve the language-provided-guarantee property of promising that some set of operations are invalid and will not occur.

Why isn't &mut uninit currently considered UB by Miri?

In short: because it does no operation that is undefined. Expanding on that a little:

  • Validity of memory (e.g. that bytes are not undefined) is asserted when the AM does a typed copy of the memory from one place to another place.
    • The memory making up the reference itself is asserted to be initialized, non-null, and aligned.
  • Validity of borrowing is asserted when references are considered "used."
    • This includes at least when the reference is converted into a place (dereferenced) and when a referenced is used as a function argument, as well as when a function taking the reference returns.
    • Until the memory at the referenced place undergoes a typed copy, its validity is not asserted.

Additionally, writing to an uninitialized place is allowed as this consists of

  • running the drop glue for the place at the given type,
    • If the type does not have drop glue, this trivially does not do a typed copy from the place.
  • then doing a typed copy of the value to write into the place,
  • neither of which assert the validity of the preexisting memory.

Note, however, that writing uninit into &mut init is always UB. This is because this does a typed copy of the written value, which asserts that the written value is initialized.

Why would &mut uninit be considered UB?

There are two operational ways that references to uninitialized memory could be made operationally UB: in borrow validation or during conversion between references and places.

  • Borrow validation: in addition to the retag operations, the memory validity of the place would be asserted.
  • Ref-to-place: whenever a reference is converted into a place (dereferenced), the memory validity of the place would be asserted.
  • Place-to-ref: whenever a place is converted into a reference, the memory validity of the place would be asserted.

But how much memory validity? The easy answer is full memory validity at the referenced type; the minimal answer for the desired property is just that the memory is initialized. However, checking bytes are initialized still requires full type information to know which bytes are potentially padding and thus are allowed to be uninitialized, so full memory validity is simpler to check and cost us nothing extra on the implementation.

There are subtle differences to the properties derivable from when exactly memory validity is asserted, but the purpose of this document is to discuss the fundamental reasons for/against using any of them generally.

What otherwise valid programs are made UB?

There's at least two notable losses, one obvious, and one not so obvious.

The obvious one is just any program using a type like &mut [u8] to reference potentially-uninitialized memory. Many existing implementations of io::Read are written to carefully avoid reading the provided buffer before writing to it, such that they might be used to read into an uninitialized buffer. There is an existing accepted RFC allowing for safely reading into an uninitialized buffer3, but it would be very unfortunate to make nearly all existing code unsound.

The less obvious one is with pointers. Writing into an uninitialized place (via = assignment expression) becomes UB, even if that place is behind a raw pointer. This is because the drop glue semantically calls std::ptr::drop_in_place(&mut place), creating a mutable reference to the place. You can potentially recover writes to places not asserting memory validity of the place by semantically only creating the reference if the place's type has drop glue, but this has further complications around generics (as MIR is produced for the generic function in polymorphic form, and ptr::drop_in_place must be called there). It is perhaps a better idea to use ptr::write for writing into uninitialized memory anyway, but this adds an additional subtle pitfall to what are supposed to be raw pointers with mostly C-like semantics (so long as you don't create any references).

What benefit is to be gained?

What &mut uninit being UB would theoretically provide is that references could be marked as "dereferencable(noundef N)" where they are currently marked dereferencable(N) in the LLVM backend. Pointee memory validation during borrow validation would likely be enough to justify this; validation, and ref-to-place or place-to-ref time could be enough for "dereferencable_on_entry(noundef N)" if reborrowing for a function argument counts as doing a ref-to-place-to-ref conversion (how you would write it in source, &*ref).

However, there is at the time of writing no known optimization benefit to eagerly marking references as pointing to known-init memory, neither practical nor theoretical. This is due to a simple observation: when the memory is read by the source program, it then undergoes a typed copy which asserts that it is memory valid.

So the only potential optimization lost is speculative reads. However, we already justify that references must be dereferencable by the borrow validity rules, so it is perfectly fine to speculatively read memory from behind a reference. It is even valid to make decisions based on the value before it is semantically guaranteed to be read by the source program, so long as the speculative execution can deal with speculation being driven by uninit (e.g. by freezeing it to an arbitrary-but-consistent noundef byte value).

So if there's no optimization benefit to eagerly checking for references pointing to uninit memory, the benefit is solely in diagnosing ill-formed programs. By eagerly checking, the existence of references-to-uninit can be diagnosed when they are created rather than when the uninitialized memory is read4, properly blaming the creator of the reference rather than the place just doing safe reads of a safe reference.

However, this "victim blaming" is actually fundamental to how Miri works to diagnose operational UB. Miri does not (and cannot) understand your library's safety preconditions. The only thing Miri diagnoses is when the code violates the conditions of the Abstract Machine (exhibits UB), and will point at the point where the violation happened as the culprit, even if the bug in the code is instead a far-removed violation of a library's contract invoking "library UB5". Miri only cares about and diagnoses whether a specific execution of a specific complete compilation graph encounters language UB, and using this to show soundness is left to careful application of tests.

Why is this contentious?

It is the author's belief that &mut uninit is somewhat unique in the space of defined-but-unsafe Rust, in that this is a safety invariant on a otherwise very strict primitive type. Of the primitive types,

  • ! is always invalid.
  • () is always valid.
  • Simple numerics iNN, uNN, fNN, only have the trivial noundef validity invariant.
    • (mumbles mumbles provenance; IIUC the current plan is to strip on load and by-value-transmute?)
  • char, bool only have validity invariants.
  • [_; N] and [_] inherit all safety/validity invariants from the contained type.
  • *const _ and *mut _ only have the trivial noundef validity invariant.
    • (mumbles mumbles provenance; IIUC the current plan is to angelically have provided the correct provenance if used?)
  • str has officially been decided to have the validity invariant of [u8] and for valid UTF-8 to be a safety invariant.
    • However, this decision is still somewhat contentious,
    • and technically the decision was made for &str and not str-by-value.
    • The author posits that the reason this is contentious is that it means the str primitive has a nonempty safety invariant
    • and perhaps the resolution to this is to say str isn't (semantically) a primitive, it just looks like one, but is actually just struct str([u8]).
  • Box is... special, but generally not considered as a primitive type outside of the compiler.

References thus are special among primitive types in that they

  • have a nontrivial memory validity invariant,
    • noundef, nonnull, aligned
  • have complicated borrow validity requirements, and
    • see: Stacked Borrows retags
  • have a safety invariant that the pointee is both memory-valid and safe.

This is likely unavoidable, as memory validity being shallow / not following references is itself a very desirable property, both for reasoning about unsafe code and for implementing the sanitizer. But I think this can be resolved as solely a teaching problem. The answer to "can a reference point to uninitialized memory" should be "no*, use MaybeUninit or a wrapper of it," where the asterisk is "unsafe can break the rules, but unless you break the rules, you don't have to worry about it." I think in many ways many people are too eager to be correct to remember when it's okay and even better to put forward a simplifying lie, and then refine later as necessary.

Having an operational model of "what you can do" with unsafe is important for being able to reason about unsafe code and to write complicated unsafe code. But for teaching unsafe, it's almost certainly better to stick to relaxing the safe rules for a significant period.

TL;DR

It is the author's conclusion that:

  • References to uninit primitives are clearly nonproblematic to allow in the opsem.
  • Retags of enums need to do a read if we want "active variant" MaybeUninit tracking.
  • Teaching proper "choose your weakenings" use of unsafe is difficult.

Footnotes

  1. The smallest possible caveat may informally apply here: external synchronization via observable effects. At each point the Abstract Machine does something observable outside of the Abstract Machine (i.e. does FFI e.g. IO), the observable state of the Abstract Machine must be in the state defined by the execution to this point. The only way for UB to retroactively unguarantee the already observed behavior is if it is not a valid execution to stop the AM at the observable point before the UB occurs. This is, however, merely an informal argument; the guarantee does in fact no longer exist, it is just that there is no known way for a compiler to take advantage of this. However, note that not all behavior you may expect to be externally observable necessarily is, and neither is all behavior implemented via FFI. So long as the Abstract Machine has a definition of the operation which does not leave the AM, it is not considered externally observable. The canonical example of this is that while allocation is implemented by calling into the host OS, the AM has an internal definition of allocation and an implementation may arbitrarily call the OS allocation APIs in any defined manner.

  2. You may disagree here, and point to scripting languages like ECMAScript or even Safe Rust as languages that can be optimized while not having UB. But the insight here is that they still have UB; the difference is solely that the UB is statically prevented from happening in the surface language. As soon as you go to lower the higher-level language to another target, the set of syntactically valid possibilities extends beyond that of the surface language, and any operation which would be forbidden in the surface language is UB.

  3. In general, you should prefer using types like &mut [MaybeUninit<u8>] rather than &[u8] for writing into potentially uninitialized memory. Even if the existence of the reference is not in and of itself UB, it is still wildly unsafe, and using types that allow and potentially track uninitialized memory much better describes the semantics of your program and prevent accidentally exposing references to uninit to downstream code (which is still unsound).

  4. It is for this reason that if one of the measures for making &mut uninit UB eagerly were to be taken, the author suggests putting the check on place-to-ref conversions.

  5. We say an execution's use of a library API exhibits "library UB" if it violates the documented preconditions of the library functions. If "library UB" is caused, the library has full "permission" to cause language-level UB at any later point. The program's behavior is still defined unless language UB is encountered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions