Skip to content

Add an "ascii character" type to reduce unsafe needs in conversions #179

Closed

Description

Proposal

Problem statement

For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe blocks because the safe equivalents have material performance impact.

We should offer a nice way that people can write such "the problem fundamentally only produces ASCII Strings" code without needing unsafe and without needing spurious UTF-8 validation checks.

After all, to quote std::ascii,

However, at times it makes more sense to only consider the ASCII character set for a specific operation.

Motivation, use-cases

I was reminded about this by this comment in rust-lang/rust#105076:

    pub fn as_str(&self) -> &str {
        // SAFETY: self.data[self.alive] is all ASCII characters.
        unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
    }

But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap() at the end because other options aren't great.

One might say "oh, just build a String as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8> generates substantially less complicated code than pushing one to a String (https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255 USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str or String.

There should be a core type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.

[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new).

[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos

Solution sketches

In core::ascii,

/// One of the 128 Unicode characters from U+0000 through U+007F, often known as
/// the [ASCII](https://www.unicode.org/glossary/index.html#ASCII) subset.
///
/// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973,
/// or [NIST FIPS 1-2](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-1977.pdf).
///
/// # Layout
///
/// This type is guaranteed to have a size and alignment of 1 byte.
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[repr(transparent)]
struct Char(u8 is 0..=127);

impl Debug for Char {}
impl Display for Char {}

impl Char {
    const fn new(c: char) -> Option<Self> {}
    const fn from_u8(x: u8) -> Option<Self> {}
    const unsafe fn from_u8_unchecked(x: u8) -> Self {}
}

impl From<Char> for char {}
impl From<&[Char]> for &str {}

In alloc::string:

impl From<Vec<ascii::Char>> for String {}

^ this From being the main idea of the whole thing

Safe code can Char::new(…).unwrap() since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in consts, then use the non-reallocating infallible Froms later if they need Strings or &strs.

Other possibilities

I wouldn't put any of these in an initial PR, but as related things

  • This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.
  • There could be associated constants (or variants) named ACK and DEL and such.
  • Lots of AsRef<str>s are possible, like on ascii::Char itself or arrays/vectors thereof
    • And potentially AsRef<[u8]>s too
  • Additional methods like String::push_ascii
  • More implementations like String: Extend<ascii::Char> or String: FromIterator<ascii::Char>
  • Checked conversions (using the well-known ASCII fast paths) from &str (or &[u8]) back to &[ascii::Char]
  • The base85 example would really like a [u8; N] -> [ascii::Char; N] operation it can use in a const so it can have something like const BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();. Long-term that's called array::map and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.

And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar, despite the stuttering name, to avoid Char vs char confusion?

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ACP-acceptedAPI Change Proposal is accepted (seconded with no objections)T-libs-apiapi-change-proposalA proposal to add or alter unstable APIs in the standard libraries

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions