Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove branch in optimized is_ascii #74562

Merged
merged 1 commit into from
Aug 17, 2020

Conversation

pickfire
Copy link
Contributor

Performs slightly better in short or medium bytes by eliminating
the last branch check on byte_pos == len and always check the
last byte as it is always at most one usize.

Benchmark, before libcore, after libcore_new. It improves
medium and short by 1ns but regresses unaligned_tail by 2ns,
either way we can get unaligned_tail have a tiny chance of 1/8
on a 64 bit machine. I don't think we should bet on that, the
probability is worse than dice.

test long::case00_libcore                     ... bench:          38 ns/iter (+/- 1) = 183947 MB/s
test long::case00_libcore_new                 ... bench:          38 ns/iter (+/- 1) = 183947 MB/s
test long::case01_iter_all                    ... bench:         227 ns/iter (+/- 6) = 30792 MB/s
test long::case02_align_to                    ... bench:          40 ns/iter (+/- 1) = 174750 MB/s
test long::case03_align_to_unrolled           ... bench:          19 ns/iter (+/- 1) = 367894 MB/s
test medium::case00_libcore                   ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test medium::case00_libcore_new               ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test medium::case01_iter_all                  ... bench:          20 ns/iter (+/- 1) = 1600 MB/s
test medium::case02_align_to                  ... bench:           6 ns/iter (+/- 0) = 5333 MB/s
test medium::case03_align_to_unrolled         ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test short::case00_libcore                    ... bench:           7 ns/iter (+/- 0) = 1000 MB/s
test short::case00_libcore_new                ... bench:           6 ns/iter (+/- 0) = 1166 MB/s
test short::case01_iter_all                   ... bench:           5 ns/iter (+/- 0) = 1400 MB/s
test short::case02_align_to                   ... bench:           5 ns/iter (+/- 0) = 1400 MB/s
test short::case03_align_to_unrolled          ... bench:           5 ns/iter (+/- 1) = 1400 MB/s
test unaligned_both::case00_libcore           ... bench:           4 ns/iter (+/- 0) = 7500 MB/s
test unaligned_both::case00_libcore_new       ... bench:           4 ns/iter (+/- 0) = 7500 MB/s
test unaligned_both::case01_iter_all          ... bench:          26 ns/iter (+/- 0) = 1153 MB/s
test unaligned_both::case02_align_to          ... bench:          13 ns/iter (+/- 2) = 2307 MB/s
test unaligned_both::case03_align_to_unrolled ... bench:          11 ns/iter (+/- 0) = 2727 MB/s
test unaligned_head::case00_libcore           ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_head::case00_libcore_new       ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_head::case01_iter_all          ... bench:          19 ns/iter (+/- 1) = 1631 MB/s
test unaligned_head::case02_align_to          ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_head::case03_align_to_unrolled ... bench:          14 ns/iter (+/- 0) = 2214 MB/s
test unaligned_tail::case00_libcore           ... bench:           3 ns/iter (+/- 0) = 10333 MB/s
test unaligned_tail::case00_libcore_new       ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_tail::case01_iter_all          ... bench:          19 ns/iter (+/- 0) = 1631 MB/s
test unaligned_tail::case02_align_to          ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_tail::case03_align_to_unrolled ... bench:          13 ns/iter (+/- 0) = 2384 MB/s

Rough (unfair) maths on improvements for fun: 1ns * 7/8 - 2ns * 1/8 = 0.625ns

Inspired by fish and zsh clever trick to highlight missing linefeeds (⏎)
and branchless implementation of binary_search in rust.

cc @thomcc #74066
r? @nagisa

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 20, 2020
@thomcc
Copy link
Member

thomcc commented Jul 20, 2020

either way we can get unaligned_tail have a tiny chance of 1/8 on a 64 bit machine. I don't think we should bet on that, the probability is worse than dice.

I think it's the other way around -- the tail being aligned has a probability of 1/8. I only added a bench for one possible misalignment, but there are several.

Edit: To be clear, I'm not opposed to this change in principal, but I think it's worth keeping in mind that the unaligned tail case is the common case (an argument could be made either way about unaligned head).

Edit 2: It's also worth running these benchmarks on a platform with slower unaligned loads -- this trades an aligned load + a branch for an unaligned load. I'm unsure that this is a good trade.

@pickfire
Copy link
Contributor Author

I think it's the other way around -- the tail being aligned has a probability of 1/8. I only added a bench for one possible misalignment, but there are several.

Ah, I meant the tail being aligned have a probability of 1/8. Sorry if I am not being clear.

Edit: To be clear, I'm not opposed to this change in principal, but I think it's worth keeping in mind that the unaligned tail case is the common case (an argument could be made either way about unaligned head).

I don't understand how unaligned tail with a probability of 1/8 is the common case (> 1/2)? Isn't it the other way around? Most of these checks I believe is either based on user input or have it being read somewhere, I don't think users will think to type in numbers of 8 to be aligned with the compiler.

@thomcc
Copy link
Member

thomcc commented Jul 20, 2020

I don't understand how unaligned tail with a probability of 1/8 is the common case (> 1/2)? Isn't it the other way around? Most of these checks I believe is either based on user input or have it being read somewhere, I don't think users will think to type in numbers of 8 to be aligned with the compiler.

That's exactly why unaligned tail is the common case. Strings whose length modulo 8 is 1, 2, 3, 4, 5, 6, or 7 all have unaligned tails.

@pickfire
Copy link
Contributor Author

That's exactly why unaligned tail is the common case. Strings whose length modulo 8 is 1, 2, 3, 4, 5, 6, or 7 all have unaligned tails.

Oh, I thought it is the other way around. Sorry my bad. But still, won't we have more consistent worst case?

@nagisa
Copy link
Member

nagisa commented Jul 21, 2020

It improves medium and short by 1ns but regresses unaligned_tail by 2ns

This amount of improvement sounds like its most likely within the error margin. At this point we probably would need to start measuring cycles (with e.g. llvm-mca), rather than nanoseconds.

Much like @thomcc I’m worried that that the impact on having an unconditional unaligned load will move the performance hit outside of the error margin on architectures where unaligned loads are expensive (and possibly simulated in generated code).

One thing we could try to mitigate the cost somewhat is replace the unaligned load with a plain byte-by-byte loop if we are making that part of code unconditional... though it would probably make the improvement also vanish...

All that said, I guess I'm not super against the idea of landing this, but it does not look like a slam dunk either.

@thomcc
Copy link
Member

thomcc commented Jul 22, 2020

One thing we could try to mitigate the cost somewhat is replace the unaligned load with a plain byte-by-byte loop if we are making that part of code unconditional... though it would probably make the improvement also vanish...

It's not uncommon for these byte-for-byte loops to be slower than the unaligned load on ARM, so I don't know if I'd assume right out of the gate that that's a better option.

@pickfire
Copy link
Contributor Author

Maybe we do a benchmark on ARM? Should I try it on my raspberry pi and see which is the fastest?

@tesuji
Copy link
Contributor

tesuji commented Jul 22, 2020

Gcc compile farm has ARM machines, but guess you already know that.

@pickfire
Copy link
Contributor Author

Gcc compile farm has ARM machines, but guess you are already know that.

Oh, I didn't know about that.

@pickfire
Copy link
Contributor Author

pickfire commented Jul 22, 2020

I ran the same benchmark in ARM APM X-Gene Mustang board, gcc115 of https://cfarm.tetaneutral.net/machines/list/#. libcore_new is always faster but I am surprised that align_to_unrolled perform quite well.

test long::case00_libcore                     ... bench:          63 ns/iter (+/- 0) = 110952 MB/s
test long::case00_libcore_new                 ... bench:          62 ns/iter (+/- 0) = 112741 MB/s
test long::case01_iter_all                    ... bench:         461 ns/iter (+/- 0) = 15162 MB/s
test long::case02_align_to                    ... bench:          65 ns/iter (+/- 0) = 107538 MB/s
test long::case03_align_to_unrolled           ... bench:          49 ns/iter (+/- 0) = 142653 MB/s
test medium::case00_libcore                   ... bench:          11 ns/iter (+/- 0) = 2909 MB/s
test medium::case00_libcore_new               ... bench:           9 ns/iter (+/- 0) = 3555 MB/s
test medium::case01_iter_all                  ... bench:          43 ns/iter (+/- 0) = 744 MB/s
test medium::case02_align_to                  ... bench:          13 ns/iter (+/- 0) = 2461 MB/s
test medium::case03_align_to_unrolled         ... bench:          12 ns/iter (+/- 0) = 2666 MB/s
test short::case00_libcore                    ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test short::case00_libcore_new                ... bench:          15 ns/iter (+/- 0) = 466 MB/s
test short::case01_iter_all                   ... bench:          12 ns/iter (+/- 0) = 583 MB/s
test short::case02_align_to                   ... bench:          13 ns/iter (+/- 0) = 538 MB/s
test short::case03_align_to_unrolled          ... bench:          13 ns/iter (+/- 0) = 538 MB/s
test unaligned_both::case00_libcore           ... bench:          12 ns/iter (+/- 0) = 2500 MB/s
test unaligned_both::case00_libcore_new       ... bench:          10 ns/iter (+/- 0) = 3000 MB/s
test unaligned_both::case01_iter_all          ... bench:          37 ns/iter (+/- 0) = 810 MB/s
test unaligned_both::case02_align_to          ... bench:          31 ns/iter (+/- 0) = 967 MB/s
test unaligned_both::case03_align_to_unrolled ... bench:          28 ns/iter (+/- 0) = 1071 MB/s
test unaligned_head::case00_libcore           ... bench:          11 ns/iter (+/- 0) = 2818 MB/s
test unaligned_head::case00_libcore_new       ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_head::case01_iter_all          ... bench:          43 ns/iter (+/- 0) = 720 MB/s
test unaligned_head::case02_align_to          ... bench:          23 ns/iter (+/- 0) = 1347 MB/s
test unaligned_head::case03_align_to_unrolled ... bench:          31 ns/iter (+/- 0) = 1000 MB/s
test unaligned_tail::case00_libcore           ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_tail::case00_libcore_new       ... bench:           9 ns/iter (+/- 0) = 3444 MB/s
test unaligned_tail::case01_iter_all          ... bench:          43 ns/iter (+/- 0) = 720 MB/s
test unaligned_tail::case02_align_to          ... bench:          22 ns/iter (+/- 0) = 1409 MB/s
test unaligned_tail::case03_align_to_unrolled ... bench:          29 ns/iter (+/- 0) = 1068 MB/s
src/lib.rs
use std::mem;

/// Returns `true` if any byte in the word `v` is nonascii (>= 128). Snarfed
/// from `../str/mod.rs`, which does something similar for utf8 validation.
#[inline]
fn contains_nonascii(v: usize) -> bool {
    const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize;
    (NONASCII_MASK & v) != 0
}

/// Optimized ASCII test that will use usize-at-a-time operations instead of
/// byte-at-a-time operations (when possible).
///
/// The algorithm we use here is pretty simple. If `s` is too short, we just
/// check each byte and be done with it. Otherwise:
///
/// - Read the first word with an unaligned load.
/// - Align the pointer, read subsequent words until end with aligned loads.
/// - If there's a tail, the last `usize` from `s` with an unaligned load.
///
/// If any of these loads produces something for which `contains_nonascii`
/// (above) returns true, then we know the answer is false.
#[inline]
pub fn is_ascii(s: &[u8]) -> bool {
    const USIZE_SIZE: usize = mem::size_of::<usize>();

    let len = s.len();
    let align_offset = s.as_ptr().align_offset(USIZE_SIZE);

    // If we wouldn't gain anything from the word-at-a-time implementation, fall
    // back to a scalar loop.
    //
    // We also do this for architectures where `size_of::<usize>()` isn't
    // sufficient alignment for `usize`, because it's a weird edge case.
    if len < USIZE_SIZE || len < align_offset || USIZE_SIZE < mem::align_of::<usize>() {
        return s.iter().all(|b| b.is_ascii());
    }

    // We always read the first word unaligned, which means `align_offset` is
    // 0, we'd read the same value again for the aligned read.
    let offset_to_aligned = if align_offset == 0 { USIZE_SIZE } else { align_offset };

    let start = s.as_ptr();
    // SAFETY: We verify `len < USIZE_SIZE` above.
    let first_word = unsafe { (start as *const usize).read_unaligned() };

    if contains_nonascii(first_word) {
        return false;
    }
    // We checked this above, somewhat implicitly. Note that `offset_to_aligned`
    // is either `align_offset` or `USIZE_SIZE`, both of are explicitly checked
    // above.
    debug_assert!(offset_to_aligned <= len);

    // word_ptr is the (properly aligned) usize ptr we use to read the middle chunk of the slice.
    let mut word_ptr = unsafe { start.add(offset_to_aligned) as *const usize };

    // `byte_pos` is the byte index of `word_ptr`, used for loop end checks.
    let mut byte_pos = offset_to_aligned;

    // Paranoia check about alignment, since we're about to do a bunch of
    // unaligned loads. In practice this should be impossible barring a bug in
    // `align_offset` though.
    debug_assert_eq!((word_ptr as usize) % mem::align_of::<usize>(), 0);

    while byte_pos <= len - USIZE_SIZE {
        debug_assert!(
            // Sanity check that the read is in bounds
            (word_ptr as usize + USIZE_SIZE) <= (start.wrapping_add(len) as usize) &&
            // And that our assumptions about `byte_pos` hold.
            (word_ptr as usize) - (start as usize) == byte_pos
        );

        // Safety: We know `word_ptr` is properly aligned (because of
        // `align_offset`), and we know that we have enough bytes between `word_ptr` and the end
        let word = unsafe { word_ptr.read() };
        if contains_nonascii(word) {
            return false;
        }

        byte_pos += USIZE_SIZE;
        // SAFETY: We know that `byte_pos <= len - USIZE_SIZE`, which means that
        // after this `add`, `word_ptr` will be at most one-past-the-end.
        word_ptr = unsafe { word_ptr.add(1) };
    }

    // If we have anything left over, it should be at-most 1 usize worth of bytes,
    // which we check with a read_unaligned.
    if byte_pos == len {
        return true;
    }

    // Sanity check to ensure there really is only one `usize` left. This should
    // be guaranteed by our loop condition.
    debug_assert!(byte_pos < len && len - byte_pos < USIZE_SIZE);

    // SAFETY: This relies on `len >= USIZE_SIZE`, which we check at the start.
    let last_word = unsafe { (start.add(len - USIZE_SIZE) as *const usize).read_unaligned() };

    !contains_nonascii(last_word)
}

/// Optimized ASCII test that will use usize-at-a-time operations instead of
/// byte-at-a-time operations (when possible).
///
/// The algorithm we use here is pretty simple. If `s` is too short, we just
/// check each byte and be done with it. Otherwise:
///
/// - Read the first word with an unaligned load.
/// - Align the pointer, read subsequent words until end with aligned loads.
/// - If there's a tail, the last `usize` from `s` with an unaligned load.
///
/// If any of these loads produces something for which `contains_nonascii`
/// (above) returns true, then we know the answer is false.
#[inline]
pub fn is_ascii_new(s: &[u8]) -> bool {
    const USIZE_SIZE: usize = mem::size_of::<usize>();

    let len = s.len();
    let align_offset = s.as_ptr().align_offset(USIZE_SIZE);

    // If we wouldn't gain anything from the word-at-a-time implementation, fall
    // back to a scalar loop.
    //
    // We also do this for architectures where `size_of::<usize>()` isn't
    // sufficient alignment for `usize`, because it's a weird edge case.
    if len < USIZE_SIZE || len < align_offset || USIZE_SIZE < mem::align_of::<usize>() {
        return s.iter().all(|b| b.is_ascii());
    }

    // We always read the first word unaligned, which means `align_offset` is
    // 0, we'd read the same value again for the aligned read.
    let offset_to_aligned = if align_offset == 0 { USIZE_SIZE } else { align_offset };

    let start = s.as_ptr();
    // SAFETY: We verify `len < USIZE_SIZE` above.
    let first_word = unsafe { (start as *const usize).read_unaligned() };

    if contains_nonascii(first_word) {
        return false;
    }
    // We checked this above, somewhat implicitly. Note that `offset_to_aligned`
    // is either `align_offset` or `USIZE_SIZE`, both of are explicitly checked
    // above.
    debug_assert!(offset_to_aligned <= len);

    // word_ptr is the (properly aligned) usize ptr we use to read the middle chunk of the slice.
    let mut word_ptr = unsafe { start.add(offset_to_aligned) as *const usize };

    // `byte_pos` is the byte index of `word_ptr`, used for loop end checks.
    let mut byte_pos = offset_to_aligned;

    // Paranoia check about alignment, since we're about to do a bunch of
    // unaligned loads. In practice this should be impossible barring a bug in
    // `align_offset` though.
    debug_assert_eq!((word_ptr as usize) % mem::align_of::<usize>(), 0);

    while byte_pos < len - USIZE_SIZE {
        debug_assert!(
            // Sanity check that the read is in bounds
            (word_ptr as usize + USIZE_SIZE) <= (start.wrapping_add(len) as usize) &&
            // And that our assumptions about `byte_pos` hold.
            (word_ptr as usize) - (start as usize) == byte_pos
        );

        // Safety: We know `word_ptr` is properly aligned (because of
        // `align_offset`), and we know that we have enough bytes between `word_ptr` and the end
        let word = unsafe { word_ptr.read() };
        if contains_nonascii(word) {
            return false;
        }

        byte_pos += USIZE_SIZE;
        // SAFETY: We know that `byte_pos <= len - USIZE_SIZE`, which means that
        // after this `add`, `word_ptr` will be at most one-past-the-end.
        word_ptr = unsafe { word_ptr.add(1) };
    }

    // Sanity check to ensure there really is only one `usize` left. This should
    // be guaranteed by our loop condition.
    debug_assert!(byte_pos <= len && len - byte_pos <= USIZE_SIZE);

    // SAFETY: This relies on `len >= USIZE_SIZE`, which we check at the start.
    let last_word = unsafe { (start.add(len - USIZE_SIZE) as *const usize).read_unaligned() };

    !contains_nonascii(last_word)
}
benches/bench.rs
#![feature(test)]
extern crate test;

use ascii::*;
use test::black_box;
use test::Bencher;

macro_rules! repeat {
    ($s: expr) => {
        concat!($s, $s, $s, $s, $s, $s, $s, $s, $s, $s)
    };
}

const SHORT: &'static str = "Alice's";
const MEDIUM: &'static str = "Alice's Adventures in Wonderland";
const LONG: &'static str = repeat!(
    r#"
    La Guida di Bragia, a Ballad Opera for the Marionette Theatre (around 1850)
    Alice's Adventures in Wonderland (1865)
    Phantasmagoria and Other Poems (1869)
    Through the Looking-Glass, and What Alice Found There
        (includes "Jabberwocky" and "The Walrus and the Carpenter") (1871)
    The Hunting of the Snark (1876)
    Rhyme? And Reason? (1883) – shares some contents with the 1869 collection,
        including the long poem "Phantasmagoria"
    A Tangled Tale (1885)
    Sylvie and Bruno (1889)
    Sylvie and Bruno Concluded (1893)
    Pillow Problems (1893)
    What the Tortoise Said to Achilles (1895)
    Three Sunsets and Other Poems (1898)
    The Manlet (1903)[106]
"#
);

macro_rules! benches {
    ($( fn $name: ident($arg: ident: &[u8]) $body: block )+) => {
        benches!(mod short SHORT[..] $($name $arg $body)+);
        benches!(mod medium MEDIUM[..] $($name $arg $body)+);
        benches!(mod long LONG[..] $($name $arg $body)+);
        // Ensure we benchmark cases where the functions are called with strings
        // that are not perfectly aligned or have a length which is not a
        // multiple of size_of::<usize>() (or both)
        benches!(mod unaligned_head MEDIUM[1..] $($name $arg $body)+);
        benches!(mod unaligned_tail MEDIUM[..(MEDIUM.len() - 1)] $($name $arg $body)+);
        benches!(mod unaligned_both MEDIUM[1..(MEDIUM.len() - 1)] $($name $arg $body)+);
    };

    (mod $mod_name: ident $input: ident [$range: expr] $($name: ident $arg: ident $body: block)+) => {
        mod $mod_name {
            use super::*;
            $(
                #[bench]
                fn $name(bencher: &mut Bencher) {
                    bencher.bytes = $input[$range].len() as u64;
                    let mut vec = $input.as_bytes().to_vec();
                    bencher.iter(|| {
                        let $arg: &[u8] = &black_box(&mut vec)[$range];
                        black_box($body)
                    })
                }
            )+
        }
    };
}

benches! {
    fn case00_libcore(bytes: &[u8]) {
        is_ascii(bytes)
    }

    fn case00_libcore_new(bytes: &[u8]) {
        is_ascii_new(bytes)
    }

    fn case01_iter_all(bytes: &[u8]) {
        bytes.iter().all(|b| b.is_ascii())
    }

    fn case02_align_to(bytes: &[u8]) {
        is_ascii_align_to(bytes)
    }

    fn case03_align_to_unrolled(bytes: &[u8]) {
        is_ascii_align_to_unrolled(bytes)
    }
}

// These are separate since it's easier to debug errors if they don't go through
// macro expansion first.
fn is_ascii_align_to(bytes: &[u8]) -> bool {
    if bytes.len() < core::mem::size_of::<usize>() {
        return bytes.iter().all(|b| b.is_ascii());
    }
    // SAFETY: transmuting a sequence of `u8` to `usize` is always fine
    let (head, body, tail) = unsafe { bytes.align_to::<usize>() };
    head.iter().all(|b| b.is_ascii())
        && body.iter().all(|w| !contains_nonascii(*w))
        && tail.iter().all(|b| b.is_ascii())
}

fn is_ascii_align_to_unrolled(bytes: &[u8]) -> bool {
    if bytes.len() < core::mem::size_of::<usize>() {
        return bytes.iter().all(|b| b.is_ascii());
    }
    // SAFETY: transmuting a sequence of `u8` to `[usize; 2]` is always fine
    let (head, body, tail) = unsafe { bytes.align_to::<[usize; 2]>() };
    head.iter().all(|b| b.is_ascii())
        && body.iter().all(|w| !contains_nonascii(w[0] | w[1]))
        && tail.iter().all(|b| b.is_ascii())
}

#[inline]
fn contains_nonascii(v: usize) -> bool {
    const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize;
    (NONASCII_MASK & v) != 0
}

Can someone please help to try to benchmark this on a raspberry pi? If any, I think we should benchmark on the slowest machine rather than a fast one.

@bors
Copy link
Contributor

bors commented Jul 28, 2020

☔ The latest upstream changes (presumably #73265) made this pull request unmergeable. Please resolve the merge conflicts.

@JohnCSimon JohnCSimon added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 11, 2020
@JohnCSimon
Copy link
Member

@pickfire can you please fix the merge conflicts?

Performs slightly better in short or medium bytes by eliminating
the last branch check on `byte_pos == len` and always check the
last byte as it is always at most one `usize`.

Benchmark, before `libcore`, after `libcore_new`. It improves
medium and short by 1ns but regresses unaligned_tail by 2ns,
either way we can get unaligned_tail have a tiny chance of 1/8
on a 64 bit machine. I don't think we should bet on that, the
probability is worse than dice.

test long::case00_libcore                     ... bench:          38 ns/iter (+/- 1) = 183947 MB/s
test long::case00_libcore_new                 ... bench:          38 ns/iter (+/- 1) = 183947 MB/s
test long::case01_iter_all                    ... bench:         227 ns/iter (+/- 6) = 30792 MB/s
test long::case02_align_to                    ... bench:          40 ns/iter (+/- 1) = 174750 MB/s
test long::case03_align_to_unrolled           ... bench:          19 ns/iter (+/- 1) = 367894 MB/s
test medium::case00_libcore                   ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test medium::case00_libcore_new               ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test medium::case01_iter_all                  ... bench:          20 ns/iter (+/- 1) = 1600 MB/s
test medium::case02_align_to                  ... bench:           6 ns/iter (+/- 0) = 5333 MB/s
test medium::case03_align_to_unrolled         ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test short::case00_libcore                    ... bench:           7 ns/iter (+/- 0) = 1000 MB/s
test short::case00_libcore_new                ... bench:           6 ns/iter (+/- 0) = 1166 MB/s
test short::case01_iter_all                   ... bench:           5 ns/iter (+/- 0) = 1400 MB/s
test short::case02_align_to                   ... bench:           5 ns/iter (+/- 0) = 1400 MB/s
test short::case03_align_to_unrolled          ... bench:           5 ns/iter (+/- 1) = 1400 MB/s
test unaligned_both::case00_libcore           ... bench:           4 ns/iter (+/- 0) = 7500 MB/s
test unaligned_both::case00_libcore_new       ... bench:           4 ns/iter (+/- 0) = 7500 MB/s
test unaligned_both::case01_iter_all          ... bench:          26 ns/iter (+/- 0) = 1153 MB/s
test unaligned_both::case02_align_to          ... bench:          13 ns/iter (+/- 2) = 2307 MB/s
test unaligned_both::case03_align_to_unrolled ... bench:          11 ns/iter (+/- 0) = 2727 MB/s
test unaligned_head::case00_libcore           ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_head::case00_libcore_new       ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_head::case01_iter_all          ... bench:          19 ns/iter (+/- 1) = 1631 MB/s
test unaligned_head::case02_align_to          ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_head::case03_align_to_unrolled ... bench:          14 ns/iter (+/- 0) = 2214 MB/s
test unaligned_tail::case00_libcore           ... bench:           3 ns/iter (+/- 0) = 10333 MB/s
test unaligned_tail::case00_libcore_new       ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_tail::case01_iter_all          ... bench:          19 ns/iter (+/- 0) = 1631 MB/s
test unaligned_tail::case02_align_to          ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_tail::case03_align_to_unrolled ... bench:          13 ns/iter (+/- 0) = 2384 MB/s

Rough (unfair) maths on improvements for fun: 1ns * 7/8 - 2ns * 1/8 = 0.625ns

Inspired by fish and zsh clever trick to highlight missing linefeeds (⏎)
and branchless implementation of binary_search in rust.
@nagisa
Copy link
Member

nagisa commented Aug 16, 2020

@bors r+

@bors
Copy link
Contributor

bors commented Aug 16, 2020

📌 Commit 8ec348a has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 16, 2020
@bors
Copy link
Contributor

bors commented Aug 16, 2020

⌛ Testing commit 8ec348a with merge 4bb4b96...

@bors
Copy link
Contributor

bors commented Aug 17, 2020

☀️ Test successful - checks-actions, checks-azure
Approved by: nagisa
Pushing 4bb4b96 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Aug 17, 2020
@bors bors merged commit 4bb4b96 into rust-lang:master Aug 17, 2020
@pickfire pickfire deleted the is_ascii_branchless branch August 17, 2020 06:04
@cuviper cuviper added this to the 1.47.0 milestone May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants