Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update/relax str/String utf8 safety docs #134598

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zachs18
Copy link
Contributor

@zachs18 zachs18 commented Dec 21, 2024

Relax UTF-8 requirements of String as well as some str/String-related functions from "must be UTF-8 or UB" to "invalid UTF-8 is not immediate UB but is unsafe" to match the validity requirement for str.

cc @joshtriplett #71033 (comment)

I wonder if we might be able, in the future, to carefully exclude a few of str's functions from the "library UB" requirements.


Current state:

str currently documents that it's UTF-8 requirement is not a language-level requirement (i.e. having a non-UTF-8 str is not immediately UB, but it can cause later UB because other code can assume that str's are valid UTF-8), though std::str::from_utf8_unchecked currently still has a stronger requirement.

String does not currently have such docs, though some of its associated functions' docs imply it1.


(first commit) This PR relaxes std::str::from_utf8_unchecked's and str::as_bytes_mut's UTF-8 requirements to match that of str itself. (and updates the safety comments and implementation in std::str::from_utf8_unchecked(_mut))

(second commit) This PR also adds an "Invariant" section to String's docs to match str, which documents that

Rust libraries may assume that Strings are always valid UTF-8, just like strs.

Constructing a non-UTF-8 String is not immediate undefined behavior, but any function called on a String may assume that it is valid UTF-8, which means that a non-UTF-8 String can lead to undefined behavior down the road.

(third commit) This PR also adds lists of "exceptions" to str/String's "Invariant" sections, of functions that are safe to call on invalid-UTF-8 str/Strings. The specific list is not set in stone, I mostly just chose functions that don't do any string manipulation. It could also probably be formatted better?, maybe as a list of "categories" with the functions separated by commas, not as their own lines. If this section is added, then removing a (stable) function from it would be a breaking change, and adding new (stable) functions to it would be a new stable API guarantee. Alternately, this guarantee could be added as a short note to the functions' docs individually (with a link to the "Invariants" section) to prevent the list becoming inaccurate (e.g. "This function is safe to call on a str/String containing [invalid UTF-8](link to invariant section)").

As this relaxes stable API requirements, I think this needs a T-libs-api FCP?

@rustbot label T-libs-api

r? @joshtriplett (or other T-libs-api)

Footnotes

  1. String::from_utf8_unchecked's docs say "This function is unsafe because it does not check that the bytes passed to it are valid UTF-8. If this constraint is violated, it may cause memory unsafety issues with future users of the String, as the rest of the standard library assumes that Strings are valid UTF-8." Note that it mentions only future users, implying that calling String::from_utf8_unchecked with invalid UTF-8 alone is not immediate UB.

* core::str::from_utf8_unchecked: doc: do not declare that invalid UTF-8 is immediate UB, analogous to `String::from_utf8_unchecked`.
* core::str::from_utf8_unchecked(_mut): update internal safety comments, and use pointer cast instead of transmute
* str::as_bytes_mut: doc: do not declare that invalid UTF-8 after borrow ends is immediate UB, analogous to `String::as_mut_vec`.
Documents that invalid UTF-8 in a `String` is not immediate UB, but other code may assume that `String`s are valid UTF-8,
so it can lead to UB later.
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Dec 21, 2024
@rust-log-analyzer

This comment has been minimized.

Comment on lines +367 to +391
/// * `String::as_bytes`
/// * `String::as_bytes_mut`
/// * `String::as_str`
/// * `String::as_mut_str`
/// * `String::as_ptr`
/// * `String::as_mut_ptr`
/// * `String::as_mut_vec`
/// * `String::capacity`
/// * `String::len`
/// * `String::clear`
/// * `<String as Drop>::drop` (i.e. dropping a `String` that contains invalid UTF-8 does not alone cause UB)
/// * `String::leak`
/// * `String::into_boxed_str`
/// * `String::reserve`
/// * `String::reserve_exact`
/// * `String::try_reserve`
/// * `String::try_reserve_exact`
/// * `<String as Deref>::deref`
/// * `<String as DerefMut>::deref_mut`
/// * `<String as AsRef<str>>::as_ref`
/// * `<String as AsMut<str>>::as_mut`
/// * `<String as Borrow<str>>::borrow`
/// * `<String as BorrowMut<str>>::borrow_mut`
/// * `<String as Clone>::clone`
/// * `<String as Clone>::clone_from`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this list should be cut way down.

There should be a couple of things that are the gateways to using it in ways that temporarily break invariants that make this promise, but everything else should be allowed to check if it feels like.

TBH, I'd be tempted to say that the only thing that explicitly allows it is as_mut_vec, since you can always do whatever you wanted on that Vec<u8>.

(Let others make the case why they need anything else.)

@@ -197,10 +202,10 @@ pub const unsafe fn from_utf8_unchecked(v: &[u8]) -> &str {
#[rustc_const_stable(feature = "const_str_from_utf8_unchecked_mut", since = "1.83.0")]
#[rustc_diagnostic_item = "str_from_utf8_unchecked_mut"]
pub const unsafe fn from_utf8_unchecked_mut(v: &mut [u8]) -> &mut str {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this unsafe method has no # Safety section at all :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm positive there's a lint for this. Is that not enabled in core?

@@ -145,7 +145,10 @@ pub const fn from_utf8_mut(v: &mut [u8]) -> Result<&mut str, Utf8Error> {
///
/// # Safety
///
/// The bytes passed in must be valid UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this requirement, to preserve the freedom for tooling (miri, ubchecks, kani, whatever) to validate it, even if we don't in release builds.

People can always pointer-cast if they want to do turn a &[u8] into &str without involving any standard library methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants