Skip to content

&str and &[u8] have the same layout #1848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sanbox-irl
Copy link

@sanbox-irl sanbox-irl commented Jun 7, 2025

Currently, str and [u8] are promised to have the same layout, but &str and &[u8] are not promised to have the same layout. The std currently assumes that they are promised to have the same layout (https://doc.rust-lang.org/src/core/str/converts.rs.html#172), so this change would have no impact beyond codifying what is already in practice. This PR defines &str and &[u8] to have the same layout, though what that layout is continues to be unspecified.

There are some further steps here that I didn't take:

  1. Every rule about slices should probably also apply to str. I have added str in several places in the reference where it otherwise refered to slices, but likely the definition of a slice should also simply include str. This is a bigger conversation and frankly unimportant if...
  2. Some version of Make str into a libcore struct (redux) rust#107939 ever getts stabilized. In that case, all of this doesn't matter and str would be removed from the reference. This seems to me to be obviously the better choice.

In any case, this PR represents a fairly incrementalist approach.

Thanks for the insight of those on the Zulip thread here

@rustbot rustbot added the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Jun 7, 2025
@workingjubilee
Copy link
Member

@rustbot label: +I-lang-nominated +T-lang

@workingjubilee
Copy link
Member

@rustbot label: +I-lang-easy-decision

@rustbot
Copy link
Collaborator

rustbot commented Jun 7, 2025

Unknown labels: I-lang-easy-decision

@chorman0773
Copy link
Contributor

chorman0773 commented Jun 7, 2025

FTR, the standard library has every right to make assumptions about the implementation of the language beyond what the language does guarantees, because it is intrinsically tied to rustc. Not necesssarily a point against making a decision here, but I don't think it's a strong point in favour of stabilizing the equivalence either.

@sanbox-irl
Copy link
Author

FTR, the standard library has every right to make assumptions about the implementation of the language beyond what the language does guarantees, because it is intrinsically tied to rustc. Not necesssarily a point against making a decision here, but I don't think it's a strong point in favour of stabilizing the equivalence either.

Agreed -- I actually am going to reword this to make it clear that I mean this isn't a change for Rustc, only a codification of existing decisions

@ia0
Copy link

ia0 commented Jun 9, 2025

The std currently assumes that they are promised to have the same layout (https://doc.rust-lang.org/src/core/str/converts.rs.html#172)

The layout for transmute doesn't matter. I guess the safety comment was about str and [u8] having the same layout. For the transmute what matters is:

  • The size of &str and &[u8] (that's part of the layout, but alignment doesn't matter as explained in the documentation of transmute).
  • The validity invariant of &[u8] must imply the validity invariant of &str (i.e. all valid values of type &[u8] must be a valid value of type &[u8]), this is where the layout of str and [u8] matters (among other things).

That's only for safety. For correctness, we also need valid values of &[u8] to have the same representation at &str for that same value. In other words the representation relation of &[u8] must be included the one of &str (it's not enough for their domain to be included, they must map to at least the same values). In practice they are equal, but transmute only needs one direction (the one of the transmute).

So I'm not sure guaranteeing that &[u8] and &str has the same layout is a correct answer to making the std code look like code that users can write. That said, such guarantee could be useful for other purposes, I'm just saying that the motivation in OP doesn't seem to justify the change.

@workingjubilee
Copy link
Member

@ia0 What would you suggest for how the documented guarantees should be strengthened to match what the PR author obviously wants?

@ia0
Copy link

ia0 commented Jun 11, 2025

What would you suggest for how the documented guarantees should be strengthened to match what the PR author obviously wants?

I'm going to assume "what the PR author obviously wants" is "guarantees that transmuting between &str and &[u8] is safe". In that case the title of the PR should be "&str and &[u8] have the same representation relation".

The problem is that the Reference doesn't yet have this concept. Ralf asked for such concept in #1752 (comment):

Conceptually, what we eventually need is for every type a description of which byte sequences are valid for this type, and which value is represented by each valid byte sequence.

This was lost in the middle of a PR review so I guess it didn't get the attention it could have. But I'm assuming this request has gone through other channels given how critical it is for users writing unsafe code and relying solely on the Reference.

The only type AFAICT that has a representation relation defined at this time is bool under [type.bool.repr] (along with [type.bool.layout] to know it's a single byte, although it's somehow implicit already in [type.bool.repr]).

Today, one can somehow get close to the domain of the representation relation (i.e. which byte sequences are valid, but not which value they represent, which matters for unsafe code where some amount of correctness matters) by combining information about the type layout (its size, relative offset of fields, and discriminant representation, but not its alignment) and the bit validity. However the bit validity is currently often underspecified. For example I didn't find the bit validity of u8. I would expect a sentence "all initialized bytes are valid at u8" under [type.numeric.validity]. We would also need something like that for pointers and metadata such that we can express "what the PR author obviously wants" using type layout and bit validity. And ideally, we would also map the representation to the value (unsigned integers would take endianness into account for example, and signed integers would specify two's complement).

In my opinion, it would be better to wait until we have a notion of representation relation, such that all such guarantees for unsafe users can be specified in a uniform way. In the meantime, unsafe users should refer to the Unsafe Code Guidelines and other documentation like MiniRust, but this is somehow in opposition with rust-lang/unsafe-code-guidelines#566 (comment):

Frankly, my $.02 is that we deprecate the UCG as a whole and transition to making guarantees in the Reference (and other documents like the minirust spec for a more programmatic definition).

So maybe adding the representation relation to the Reference should be prioritized?

@sanbox-irl
Copy link
Author

sanbox-irl commented Jun 11, 2025

I think, though that's a valid goal, your point is really to make a more rigorous definition of "layout", particularly using the term "representation relation" instead with some well defined meaning. A change like that, as you note, would require some large work in the spec (at the minimum, looking at all the usages of "layout") and so I think should come in as a separate PR/RFC. Particularly, I'm comfortable saying "whatever 'layout' means, &str and &[u8] have the same one." I can amend the PR to make that clearer.

How does that sound?

@ia0
Copy link

ia0 commented Jun 11, 2025

your point is really to make a more rigorous definition of "layout"

That would be an editorial question. I'm not saying that. I'm saying a new concept of "representation relation" (in addition to "layout") should be added. How that's implemented is up to editors of the Reference. There are at least 3 options:

  • Define one or more common concepts shared by both "layout" and "representation relation" to factorize common aspects like size, field offsets, and discriminant. At its extreme, all those subconcepts would be their own concepts. Each type defines its size, its alignment, its field offsets, its ABI, etc independently (properties can be specified on those concepts, e.g. size is a multiple of alignment). Then higher-level concepts such as "layout" or "representation relation" may use those definitions, possibly only packaging them without additional processing like "layout". (If I were to choose an option, it would be this extreme option.)
  • Create a new independent concept of "representation relation". This will repeat some stuff with "layout", but there's already some form of repeating in the Reference (most of it unavoidable like the bool example).
  • Try to butcher one concept into the other. Not sure how this would work, but I don't think it would be better than any of the 2 other options above.

I'm comfortable saying "whatever 'layout' means, &str and &[u8] have the same one."

But how would this solve the transmute problem? What transmute asks is that the sequence of bytes being transmuted is valid at both the source and destination type. The layout of those types is not what you need to satisfy this requirement. You need the validity invariant (aka the domain of the representation relation). The layout is neither sufficient (it needs bit validity) nor necessary (it talks about alignment).

To be clear, I'm not against this change (it seems reasonable to me), I'm just arguing that we should not believe that it will solve the transmute problem. And thus we should document the motivation for this change (assuming changes to the Reference need to be motivated).

@sanbox-irl
Copy link
Author

When I say "layout" (and I suspect when the std says the same), I am including field offsets in that -- that is part of the validity of the transmute. I should be more careful in the text about that, but since "layout" is still used vaguely in the reference, i think it would be best to wait on that.

So the transmute I referenced is saying "the offsets where the pointer and the length is stored are the same in &[u&] and &str". The std can currently say that is true as a point of fact and this RFC is to make that true as a point of reference.

I might not be following you perfectly though -- you seem much more versed in programming language theory -- so let me know if I'm missing your point!

@traviscross traviscross added the P-lang-drag-2 Lang team prioritization drag level 2. label Jun 11, 2025
@ia0
Copy link

ia0 commented Jun 11, 2025

When I say "layout" (and I suspect when the std says the same), I am including field offsets in that

Yes, layout is "size + align + field offset + discriminant" from layout.intro. I think the problem is on what follows.

that is part of the validity of the transmute

Indeed, &str and &[u8] having the same layout can be used as part of the argument (but doesn't need to).

saying "the offsets where the pointer and the length is stored are the same in &[u&] and &str"

This is not implied by &[u8] and &str having the same layout. The reference doesn't say anything about a possible pointer and metadata field for wide pointers. I would expect to see this under layout.pointer.unsized but it only talks about size and alignment (only giving a precise definition in a note). And if it did, it would also need to talk more precisely about the validity invariant of those fields.

In other words, while the sentence "&[u8] and &str have the same layout" could be used to prove a transmute between those types, it is neither necessary nor sufficient in theory, and can't be used in practice with today's Reference. On the contrary, the sentence "&[u8] and &str have the same validity invariant" is exactly what's needed to prove a transmute between those types. The Reference doesn't have this notion yet.

So you could see this PR as a step towards proving the transmute with the Reference, but it's not a step perfectly aligned with that goal, because it also guarantees something about alignment which is not needed (and currently not guaranteed although true in practice now and most probably always).

@workingjubilee
Copy link
Member

@ia0 Sorry, I do not think you are contributing anything further here. It seems to be an obviously preexisting problem. Please open a PR against the reference to address the concerns you have.

@scottmcm
Copy link
Member

Given the existing text that

String slices are a UTF-8 representation of characters that have the same layout as slices of type [u8].

then I think any concerns I'd have about what "layout" means exactly would also apply there, so overall I think this guarantee makes sense.

That said, doing it just for &str was surprising to me. Why specifically & but not &mut nor *const nor *mut? Or is that leaning on some other statement that those are already necessarily the same, so doing it for & implicitly does the others?


I'll also cc rust-lang/rfcs#3775, which I think if it lands will necessarily make this guarantee as well.

@sanbox-irl
Copy link
Author

@scottmcm :

That said, doing it just for &str was surprising to me. Why specifically & but not &mut nor *const nor *mut? Or is that leaning on some other statement that those are already necessarily the same, so doing it for & implicitly does the others?

Exactly -- that mirrors what @kpreid referenced above too. Because all pointers have the same layout:

Pointers and references have the same layout. Mutability of the pointer or reference does not change the layout.

I think we don't need, therefore, to restate that in this section. I think, as @kpreid said, it would be best if we had a term like "all primitive pointers to str have the same layout as all primitive pointers to [u8] and then a link to that subsection.

However, since we don't currently have any rhetoric for "primitive pointer", I think that can be saved for another RFC which could then be backported to this subsection. In the meantime, I wouldn't mind adding link to https://doc.rust-lang.org/stable/reference/type-layout.html#r-layout.pointer.intro so readers understand that &str having the same layout to &[u8] implies &mut and *const and *mut str has the same layout as all the others to [u8].

I'll also cc rust-lang/rfcs#3775, which I think if it lands will necessarily make this guarantee as well.

Yes, I think if we're ready to pull the trigger on that RFC, then this PR is naturally included with that. However, assuming accepting that RFC is less immenent, accepting this PR first will simplify that RFC -- it will only have to talk directly about &/&mut/*const/*mut [T] and could simply include a short note that since &/&mut/*const/*mut str has the layout as others to [T], that RFC also applies to str. Currently, that RFC both declares that &str and &[u8] have the same layout and describes that layout -- it would be nice for it to only have to do one thing at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-lang-nominated P-lang-drag-2 Lang team prioritization drag level 2. S-waiting-on-review Status: The marked PR is awaiting review from a maintainer T-lang Team: Lang
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants