Update/relax str/String utf8 safety docs #134598

zachs18 · 2024-12-21T04:07:03Z

Relax UTF-8 requirements of String as well as some str/String-related functions from "must be UTF-8 or UB" to "invalid UTF-8 is not immediate UB but is unsafe" to match the validity requirement for str.

cc @joshtriplett #71033 (comment)

I wonder if we might be able, in the future, to carefully exclude a few of str's functions from the "library UB" requirements.

Current state:

str currently documents that it's UTF-8 requirement is not a language-level requirement (i.e. having a non-UTF-8 str is not immediately UB, but it can cause later UB because other code can assume that str's are valid UTF-8), though std::str::from_utf8_unchecked currently still has a stronger requirement.

String does not currently have such docs, though some of its associated functions' docs imply it¹.

(first commit) This PR relaxes std::str::from_utf8_unchecked's and str::as_bytes_mut's UTF-8 requirements to match that of str itself. (and updates the safety comments and implementation in std::str::from_utf8_unchecked(_mut))

(second commit) This PR also adds an "Invariant" section to String's docs to match str, which documents that

Rust libraries may assume that Strings are always valid UTF-8, just like strs.

Constructing a non-UTF-8 String is not immediate undefined behavior, but any function called on a String may assume that it is valid UTF-8, which means that a non-UTF-8 String can lead to undefined behavior down the road.

(third commit) This PR also adds lists of "exceptions" to str/String's "Invariant" sections, of functions that are safe to call on invalid-UTF-8 str/Strings. The specific list is not set in stone, I mostly just chose functions that don't do any string manipulation. It could also probably be formatted better?, maybe as a list of "categories" with the functions separated by commas, not as their own lines. If this section is added, then removing a (stable) function from it would be a breaking change, and adding new (stable) functions to it would be a new stable API guarantee. Alternately, this guarantee could be added as a short note to the functions' docs individually (with a link to the "Invariants" section) to prevent the list becoming inaccurate (e.g. "This function is safe to call on a str/String containing [invalid UTF-8](link to invariant section)").

As this relaxes stable API requirements, I think this needs a T-libs-api FCP?

@rustbot label T-libs-api

r? @joshtriplett (or other T-libs-api)

String::from_utf8_unchecked's docs say "This function is unsafe because it does not check that the bytes passed to it are valid UTF-8. If this constraint is violated, it may cause memory unsafety issues with future users of the String, as the rest of the standard library assumes that Strings are valid UTF-8." Note that it mentions only future users, implying that calling String::from_utf8_unchecked with invalid UTF-8 alone is not immediate UB. ↩

* core::str::from_utf8_unchecked: doc: do not declare that invalid UTF-8 is immediate UB, analogous to `String::from_utf8_unchecked`. * core::str::from_utf8_unchecked(_mut): update internal safety comments, and use pointer cast instead of transmute * str::as_bytes_mut: doc: do not declare that invalid UTF-8 after borrow ends is immediate UB, analogous to `String::as_mut_vec`.

Documents that invalid UTF-8 in a `String` is not immediate UB, but other code may assume that `String`s are valid UTF-8, so it can lead to UB later.

scottmcm · 2024-12-21T05:54:38Z

library/alloc/src/string.rs

+/// * `String::as_bytes`
+/// * `String::as_bytes_mut`
+/// * `String::as_str`
+/// * `String::as_mut_str`
+/// * `String::as_ptr`
+/// * `String::as_mut_ptr`
+/// * `String::as_mut_vec`
+/// * `String::capacity`
+/// * `String::len`
+/// * `String::clear`
+/// * `<String as Drop>::drop` (i.e. dropping a `String` that contains invalid UTF-8 does not alone cause UB)
+/// * `String::leak`
+/// * `String::into_boxed_str`
+/// * `String::reserve`
+/// * `String::reserve_exact`
+/// * `String::try_reserve`
+/// * `String::try_reserve_exact`
+/// * `<String as Deref>::deref`
+/// * `<String as DerefMut>::deref_mut`
+/// * `<String as AsRef<str>>::as_ref`
+/// * `<String as AsMut<str>>::as_mut`
+/// * `<String as Borrow<str>>::borrow`
+/// * `<String as BorrowMut<str>>::borrow_mut`
+/// * `<String as Clone>::clone`
+/// * `<String as Clone>::clone_from`


I think this list should be cut way down.

There should be a couple of things that are the gateways to using it in ways that temporarily break invariants that make this promise, but everything else should be allowed to check if it feels like.

TBH, I'd be tempted to say that the only thing that explicitly allows it is as_mut_vec, since you can always do whatever you wanted on that Vec<u8>.

(Let others make the case why they need anything else.)

scottmcm · 2024-12-21T05:56:04Z

library/core/src/str/converts.rs

@@ -197,10 +202,10 @@ pub const unsafe fn from_utf8_unchecked(v: &[u8]) -> &str {
 #[rustc_const_stable(feature = "const_str_from_utf8_unchecked_mut", since = "1.83.0")]
 #[rustc_diagnostic_item = "str_from_utf8_unchecked_mut"]
 pub const unsafe fn from_utf8_unchecked_mut(v: &mut [u8]) -> &mut str {


I see this unsafe method has no # Safety section at all :(

I'm positive there's a lint for this. Is that not enabled in core?

scottmcm · 2024-12-21T05:58:42Z

library/core/src/str/converts.rs

@@ -145,7 +145,10 @@ pub const fn from_utf8_mut(v: &mut [u8]) -> Result<&mut str, Utf8Error> {
 ///
 /// # Safety
 ///
-/// The bytes passed in must be valid UTF-8.


I think we should keep this requirement, to preserve the freedom for tooling (miri, ubchecks, kani, whatever) to validate it, even if we don't in release builds.

People can always pointer-cast if they want to do turn a &[u8] into &str without involving any standard library methods.

zachs18 added 3 commits December 20, 2024 21:13

Add Invariant section to String's docs, analogous to str.

803eca4

Documents that invalid UTF-8 in a `String` is not immediate UB, but other code may assume that `String`s are valid UTF-8, so it can lead to UB later.

Document explicit exceptions to str/String's UTF-8 requirements.

bb003e4

rustbot assigned joshtriplett Dec 21, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Dec 21, 2024

This comment has been minimized.

Sign in to view

scottmcm reviewed Dec 21, 2024

View reviewed changes

Disambiguate str doc-links

a29cb96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update/relax str/String utf8 safety docs #134598

Update/relax str/String utf8 safety docs #134598

zachs18 commented Dec 21, 2024

This comment has been minimized.

scottmcm Dec 21, 2024

scottmcm Dec 21, 2024

jhpratt Dec 21, 2024

scottmcm Dec 21, 2024

Update/relax str/String utf8 safety docs #134598

Are you sure you want to change the base?

Update/relax str/String utf8 safety docs #134598

Conversation

zachs18 commented Dec 21, 2024

Footnotes

This comment has been minimized.

scottmcm Dec 21, 2024

Choose a reason for hiding this comment

scottmcm Dec 21, 2024

Choose a reason for hiding this comment

jhpratt Dec 21, 2024

Choose a reason for hiding this comment

scottmcm Dec 21, 2024

Choose a reason for hiding this comment