-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update/relax str/String utf8 safety docs #134598
base: master
Are you sure you want to change the base?
Conversation
* core::str::from_utf8_unchecked: doc: do not declare that invalid UTF-8 is immediate UB, analogous to `String::from_utf8_unchecked`. * core::str::from_utf8_unchecked(_mut): update internal safety comments, and use pointer cast instead of transmute * str::as_bytes_mut: doc: do not declare that invalid UTF-8 after borrow ends is immediate UB, analogous to `String::as_mut_vec`.
Documents that invalid UTF-8 in a `String` is not immediate UB, but other code may assume that `String`s are valid UTF-8, so it can lead to UB later.
This comment has been minimized.
This comment has been minimized.
/// * `String::as_bytes` | ||
/// * `String::as_bytes_mut` | ||
/// * `String::as_str` | ||
/// * `String::as_mut_str` | ||
/// * `String::as_ptr` | ||
/// * `String::as_mut_ptr` | ||
/// * `String::as_mut_vec` | ||
/// * `String::capacity` | ||
/// * `String::len` | ||
/// * `String::clear` | ||
/// * `<String as Drop>::drop` (i.e. dropping a `String` that contains invalid UTF-8 does not alone cause UB) | ||
/// * `String::leak` | ||
/// * `String::into_boxed_str` | ||
/// * `String::reserve` | ||
/// * `String::reserve_exact` | ||
/// * `String::try_reserve` | ||
/// * `String::try_reserve_exact` | ||
/// * `<String as Deref>::deref` | ||
/// * `<String as DerefMut>::deref_mut` | ||
/// * `<String as AsRef<str>>::as_ref` | ||
/// * `<String as AsMut<str>>::as_mut` | ||
/// * `<String as Borrow<str>>::borrow` | ||
/// * `<String as BorrowMut<str>>::borrow_mut` | ||
/// * `<String as Clone>::clone` | ||
/// * `<String as Clone>::clone_from` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this list should be cut way down.
There should be a couple of things that are the gateways to using it in ways that temporarily break invariants that make this promise, but everything else should be allowed to check if it feels like.
TBH, I'd be tempted to say that the only thing that explicitly allows it is as_mut_vec
, since you can always do whatever you wanted on that Vec<u8>
.
(Let others make the case why they need anything else.)
@@ -197,10 +202,10 @@ pub const unsafe fn from_utf8_unchecked(v: &[u8]) -> &str { | |||
#[rustc_const_stable(feature = "const_str_from_utf8_unchecked_mut", since = "1.83.0")] | |||
#[rustc_diagnostic_item = "str_from_utf8_unchecked_mut"] | |||
pub const unsafe fn from_utf8_unchecked_mut(v: &mut [u8]) -> &mut str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this unsafe
method has no # Safety
section at all :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm positive there's a lint for this. Is that not enabled in core
?
@@ -145,7 +145,10 @@ pub const fn from_utf8_mut(v: &mut [u8]) -> Result<&mut str, Utf8Error> { | |||
/// | |||
/// # Safety | |||
/// | |||
/// The bytes passed in must be valid UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this requirement, to preserve the freedom for tooling (miri, ubchecks, kani, whatever) to validate it, even if we don't in release builds.
People can always pointer-cast if they want to do turn a &[u8]
into &str
without involving any standard library methods.
Relax UTF-8 requirements of
String
as well as somestr
/String
-related functions from "must be UTF-8 or UB" to "invalid UTF-8 is not immediate UB but is unsafe" to match the validity requirement forstr
.cc @joshtriplett #71033 (comment)
Current state:
str
currently documents that it's UTF-8 requirement is not a language-level requirement (i.e. having a non-UTF-8str
is not immediately UB, but it can cause later UB because other code can assume thatstr
's are valid UTF-8), thoughstd::str::from_utf8_unchecked
currently still has a stronger requirement.String
does not currently have such docs, though some of its associated functions' docs imply it1.(first commit) This PR relaxes
std::str::from_utf8_unchecked
's andstr::as_bytes_mut
's UTF-8 requirements to match that ofstr
itself. (and updates the safety comments and implementation instd::str::from_utf8_unchecked(_mut)
)(second commit) This PR also adds an "Invariant" section to
String
's docs to matchstr
, which documents that(third commit) This PR also adds lists of "exceptions" to
str
/String
's "Invariant" sections, of functions that are safe to call on invalid-UTF-8str
/String
s. The specific list is not set in stone, I mostly just chose functions that don't do any string manipulation. It could also probably be formatted better?, maybe as a list of "categories" with the functions separated by commas, not as their own lines. If this section is added, then removing a (stable) function from it would be a breaking change, and adding new (stable) functions to it would be a new stable API guarantee. Alternately, this guarantee could be added as a short note to the functions' docs individually (with a link to the "Invariants" section) to prevent the list becoming inaccurate (e.g. "This function is safe to call on astr
/String
containing [invalid UTF-8](link to invariant section)").As this relaxes stable API requirements, I think this needs a T-libs-api FCP?
@rustbot label T-libs-api
r? @joshtriplett (or other T-libs-api)
Footnotes
String::from_utf8_unchecked
's docs say "This function is unsafe because it does not check that the bytes passed to it are valid UTF-8. If this constraint is violated, it may cause memory unsafety issues with future users of the String, as the rest of the standard library assumes that Strings are valid UTF-8." Note that it mentions only future users, implying that callingString::from_utf8_unchecked
with invalid UTF-8 alone is not immediate UB. ↩