Skip to content

Byte count has poor codegen with autovectorization #136500

Open
@danielhuang

Description

@danielhuang

I'm currently writing some code that counts the amount of a certain byte (newlines in this example) in a large byte slice (>1GB):

fn count(b: &[u8]) -> usize {
    b.iter().filter(|&&x| x == b'\n').count()
}

When compiled with -C target-feature=+avx2, avx2 instructions are emitted from autovectorization, but is still around 2x slower than bytecount.

Using portable_simd, the code can be made faster:

fn count_simd(b: &[u8]) -> usize {
    let (begin, mid, end) = b.as_simd::<64>();
    count(begin)
        + count(end)
        + mid
            .iter()
            .map(|x| {
                x.simd_eq(Simd::splat(b'\n'))
                    .select(Simd::splat(1u8), Simd::splat(0u8))
                    .reduce_sum()
            })
            .map(|x| x as usize)
            .sum::<usize>()
}

This has similar performance with bytecount::count.

https://rust.godbolt.org/z/6b5bTKoa9

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-autovectorizationArea: Autovectorization, which can impact perf or code sizeC-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions