Byte count has poor codegen with autovectorization

I'm currently writing some code that counts the amount of a certain byte (newlines in this example) in a large byte slice (>1GB):

```rs
fn count(b: &[u8]) -> usize {
    b.iter().filter(|&&x| x == b'\n').count()
}
```

When compiled with `-C target-feature=+avx2`, avx2 instructions are emitted from autovectorization, but is still around 2x slower than [bytecount](https://github.com/llogiq/bytecount). 

Using `portable_simd`, the code can be made faster:

```rs
fn count_simd(b: &[u8]) -> usize {
    let (begin, mid, end) = b.as_simd::<64>();
    count(begin)
        + count(end)
        + mid
            .iter()
            .map(|x| {
                x.simd_eq(Simd::splat(b'\n'))
                    .select(Simd::splat(1u8), Simd::splat(0u8))
                    .reduce_sum()
            })
            .map(|x| x as usize)
            .sum::<usize>()
}
```

This has similar performance with `bytecount::count`. 

https://rust.godbolt.org/z/6b5bTKoa9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Byte count has poor codegen with autovectorization #136500

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Byte count has poor codegen with autovectorization #136500

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions