-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add byte and byte string literals #69
Conversation
Previously: rust-lang/rust#4334 |
Apparently, GitHub’s auto-linking does not apply when rendering in-repo Markdown files.
byte string literals of type `&'static [u8]` (or `[u8]`, post-DST). | ||
They are identical to the existing character and string literals, except that: | ||
|
||
* They are prefixed with a `b` (for "binary"), to distinguish them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 for b
as a prefix - I don't see anything more or less binary about these chars/strs than regular ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b
is taken from Python, but I’m not especially attached to it. I’d be fine with another syntax. How about one of these? a'\t'
(a for ASCII), '\t'u8
(the latter doesn’t really work for strings, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the last one. Why wouldn't it work for strings?
+1 all round. +1 for raw strings, though I would use +1 for removing @nick29581 with raw strings having come since the discussion in rust-lang/rust#4334, |
I didn't know we had support for raw strings, so I feel a bit better about a |
# Unresolved questions | ||
|
||
Should there be "raw byte string" literals? | ||
E.g. `pdf_file.write(rb"<< /Title (FizzBuzz \(Part one\)) >>")` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python precedent is for allowing br
and forbidding rb
(syntax error). Also: yes.
I strongly support this RFC. I was actually planning on writing almost exactly the same RFC myself, so thanks @SimonSapin. |
Very strong +1. Every day I spend writing Rust I wish it had byte string literals. |
👍 This seems really nice, regardless of the specific syntax it ends up being. |
I too was going to argue about syntax, but the precedent from Python is good enough for me. +1 on all fronts. An extra +1 to enforcing |
Do we really want to 'overload' |
Any chance to borrow some binary pattern matching stuff from Erlang? I find it very powerful and pleasant to use at the same time, e.g. Erlang bit syntax. |
I do. It follows the precedent of other languages of
It was deliberate to exclude |
Yeah, but it means
What they mean in regular string literals. I mean, really byte string literals are just regular string literals without the UTF-8 invariant and hence a different type, the syntax doesn't need to be completely different. |
I don’t see a problem here. This difference is precisely what makes byte literals different from Unicode literals in the first place…
Meaning "Just assume UTF-8". I’m opposed to this. The point of working with bytes rather than Unicode is that you don’t necessarily know the encoding (other than it’s ASCII-compatible), so assuming a particular encoding is not appropriate. I could cause Mojibake or other related bugs. I suppose we have a different vision of what |
I like this as a potential solution for paths. |
Removing |
@pcwalton what about paths? Filenames on Unix are fundamentally bytes that should only be interpreted in some encoding (nowadays often UTF-8, but not always, if you have an external hard drive from 1995). But on Windows they’re UTF-16. (Or maybe UCS-2.) I don’t see how byte literals would help |
How about restricting it (for Unicode literals) to the ASCII range, where it maps to a single UTF-8 byte? |
I can live with that. My main concern is people writing something like So if we can restrict the range allowed by |
One thing though... what purpose would that serve? If we restrict it to the ASCII range, you might as well write |
Same as removing it: avoid the debate of rust-lang/rust#2800
Yeah of course. But you may still want some of the "non-printable" code points of the ASCII range: U+0000 to U+001F and U+007F. |
Thank you for the contribution. Accepted as RFC 23, per https://github.com/mozilla/rust/wiki/Meeting-weekly-2014-06-03. cc rust-lang/rust#14646 |
For the record, I realized while implementing this that the combination of decisions in this RFC have two consequences I did not anticipate:
|
Doesn't sound like a big deal.
Seems reasonable enough; the limitation is there only for raw byte strings, not plain byte strings. And if you want to put unicode chars in a byte string, you are using escapes either way, so there obviously isn't any sensible reason to want to use a raw byte string. In other words, the user can't have it both ways; you can't say "I want |
with the |
@ben0x539 The plan is to remove |
Yeah, what I'm saying is that it isn't entirely because it lets you combine differently typed things into a single block of bytes. |
See #14646 (tracking issue) and rust-lang/rfcs#69. This does not close the tracking issue, as the `bytes!()` macro still needs to be removed. It will be later, after a snapshot is made with the changes in this PR, so that the new syntax can be used when bootstrapping the compiler.
I noticed that current Rust nightly shows deprecation warning on static DATA: &'static [u8] = bytes!(
0, 0, 0, 0, 0, 0, 0, 3, // # of paths
0, 8, "/a/b/c/d",
0, 0, // theoretically possible
0, 1, "/"
); With byte string literals it will look like this: static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03\0\x08/a/b/c/d\0\0\0\x01/"; It looks awful compared to |
Yeah, I'm not quite convinced that we should remove |
@netvl Like in Unicode strings, you can use "escaped newlines", which resolve to nothing: (note the backslashes at the very end of lines.) static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03\
\0\x08/a/b/c/d\
\0\0\
\0\x01/"; This is only half of what you asked for in that you can’t have comments in the middle of a literals, but I have to say this looks very unusal. Also, out of context I have no idea what this data represents, so I don’t know what syntax makes sense to you. Perhaps we could use Python’s idea that consecutive (byte) string literals are concatenated: static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03" // # of paths
b"\0\x08/a/b/c/d"
b"\0\0" // theoretically possible
b"\0\x01/"; Removing |
Escaped newlines certainly make things better, but still I'm not sure whether a suggestion to keep |
I meant RFCs on this repo or issues on the rust repo. I don’t know which is more appropriate in this case. Maybe chat on IRC with one of the core team to see what they prefer. To recap:
But as I said, this RFC is done as far as I’m concerned. It’ll be up to you to champion something else through the process. I think anything based on macros is more likely to get accepted than a language change. |
There's one use case for |
If it’s not a in "static" context, you can use May I ask why |
But I need it in a static context.
Precisely because the text isn't known to be UTF-8. Many existing APIs just manipulate byte sequences without caring what's in them. Sometimes those APIs will get ASCII data, sometimes UTF-8, sometimes binary data. An existing networking API for example would be wrapped as accepting a My current use case involves talking to an API that takes There needs to be some way to handle that, otherwise there's a hole. You can get non-ASCII text as |
I belive that That said, I won’t try to block the un-deprecation of |
@SimonSapin Is there a followup we need to do here to fix the issues you identified? |
No description provided.