Skip to content

Optimize checksum calculation #1065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

caobug
Copy link

@caobug caobug commented Jun 19, 2025

Replace slice iteration with indexed access to reduce overhead and improve performance - CPU usage dropped by 8% on Apple M1.

Replace slice iteration with indexed access to reduce overhead and improve performance - CPU usage dropped by 8% on Apple M1.
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.17%. Comparing base (e2b75e3) to head (6f649b0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1065      +/-   ##
==========================================
- Coverage   81.17%   81.17%   -0.01%     
==========================================
  Files          81       81              
  Lines       28955    28954       -1     
==========================================
- Hits        23503    23502       -1     
  Misses       5452     5452              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jun 19, 2025

As a question: could you also compare this to using/try this with slice::chunks_exact? If it isn't any slower (or maybe even faster) it may be worthwhile using that since it's a nice built-in :)

From what I know chunks_exact is also quite amenable to bounds-check elision, which should make it at least equally fast.

Code for reference
    /// Compute an RFC 1071 compliant checksum (without the final complement).
    pub fn data(data: &[u8]) -> u16 {
        let mut accum = 0;

        let mut chunks = data.chunks_exact(2);
        for chunk in &mut chunks {
            accum += u16::from_be_bytes([chunk[0], chunk[1]]) as u32;
        }

        // Add the last remaining odd byte, if any.
        if let Some(data) = chunks.remainder().get(0) {
            accum += (*data as u32) << 8;
        }

        propagate_carries(accum)
    }

ETA: additional question, why does this use the 32 byte chunk thing? Is it a sort of cache optimization? On my PC the benchmarks run equally quick by just using data.chunks_exact(2) and getting rid of the chunking.

ETA2: Yes, does seem that the 32 byte thing was added to "aid autovectorization" back in 2017. I wonder if that is still valid, since I imagine that a lot has changed in the last 8(!) years :) Heh, turns out that slice::chunks_exact wasn't even added until December 6th, 2018, which probably contributes to it not being used here.

ETA3: assuming this is measured using the normal benchmarks, it seems that using bare indexing (so removing the chunking logic and letting the compiler do its thing on the optimized, non-chunked version) is actually the fastest. Please try that in your comparison too.

Copy link
Contributor

@whitequark whitequark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to feel about the fact that there's now a combination of indexed and slice accesses in the checksum function--it's definitely a lot harder to read.

@caobug
Copy link
Author

caobug commented Jun 20, 2025

@datdenkikniet @whitequark

I benchmarked three implementations: slice-based, raw pointer, and chunks_exact. The latter two (indexed and chunks_exact) showed nearly identical performance:

bench_checksum_original:        60.35 ns/iter (+/- 4.56)
bench_checksum_indexed:         41.73 ns/iter (+/- 1.86)
bench_checksum_chunks_exact:    41.86 ns/iter (+/- 0.56)

Note: I did not use NetworkEndian::read_u16. Results suggest it's not the most optimal choice.

Code (click to expand)
mod checksum {
    extern crate test;

    use byteorder::{ByteOrder, NetworkEndian};

    #[bench]
    fn bench_checksum_original(b: &mut test::Bencher) {
        let data: Vec<u8> = (0..1460).map(|x| (x % 256) as u8).collect();

        b.iter(|| {
            test::black_box(checksum_original(&data));
        });
    }

    #[bench]
    fn bench_checksum_indexed(b: &mut test::Bencher) {
        let data: Vec<u8> = (0..1460).map(|x| (x % 256) as u8).collect();

        b.iter(|| {
            test::black_box(checksum_indexed(&data));
        });
    }

    #[bench]
    fn bench_checksum_chunks_exact(b: &mut test::Bencher) {
        let data: Vec<u8> = (0..1460).map(|x| (x % 256) as u8).collect();

        b.iter(|| {
            test::black_box(checksum_chunks_exact(&data));
        });
    }

    pub fn checksum_original(mut data: &[u8]) -> u16 {
        let mut accum = 0;

        // For each 32-byte chunk...
        const CHUNK_SIZE: usize = 32;
        while data.len() >= CHUNK_SIZE {
            let mut d = &data[..CHUNK_SIZE];
            // ... take by 2 bytes and sum them.
            while d.len() >= 2 {
                accum += NetworkEndian::read_u16(d) as u32;
                d = &d[2..];
            }

            data = &data[CHUNK_SIZE..];
        }

        // Sum the rest that does not fit the last 32-byte chunk,
        // taking by 2 bytes.
        while data.len() >= 2 {
            accum += NetworkEndian::read_u16(data) as u32;
            data = &data[2..];
        }

        // Add the last remaining odd byte, if any.
        if let Some(&value) = data.first() {
            accum += (value as u32) << 8;
        }

        propagate_carries(accum)
    }

    pub fn checksum_indexed(mut data: &[u8]) -> u16 {
        let mut accum = 0;

        // For each 32-byte chunk...
        const CHUNK_SIZE: usize = 32;
        while data.len() >= CHUNK_SIZE {
            let chunk = &data[..CHUNK_SIZE];
            let mut i = 0;
            // ... take by 2 bytes and sum them.
            while i + 1 < CHUNK_SIZE {
                accum += u16::from_be_bytes([chunk[i], chunk[i + 1]]) as u32;
                i += 2;
            }

            data = &data[CHUNK_SIZE..];
        }

        // Sum the rest that does not fit the last 32-byte chunk,
        // taking by 2 bytes.
        let mut i = 0;
        while i + 1 < data.len() {
            accum += u16::from_be_bytes([data[i], data[i + 1]]) as u32;
            i += 2;
        }

        // Add the last remaining odd byte, if any.
        if i < data.len() {
            accum += (data[i] as u32) << 8;
        }

        propagate_carries(accum)
    }

    pub fn checksum_chunks_exact(data: &[u8]) -> u16 {
        let mut accum = 0;

        // For each 32-byte chunk...
        const CHUNK_SIZE: usize = 32;
        const WORD_SIZE: usize = 2;
        let mut chunks = data.chunks_exact(CHUNK_SIZE);
        for chunk in &mut chunks {
            // ... take by 2 bytes and sum them.
            for pair in chunk.chunks_exact(WORD_SIZE) {
                accum += u16::from_be_bytes([pair[0], pair[1]]) as u32;
            }
        }

        // Sum the rest that does not fit the last 32-byte chunk,
        // taking by 2 bytes.
        let remainder = chunks.remainder();
        let mut word_pairs = remainder.chunks_exact(WORD_SIZE);
        for pair in &mut word_pairs {
            accum += u16::from_be_bytes([pair[0], pair[1]]) as u32;
        }

        // Add the last remaining odd byte, if any.
        if let Some(&byte) = word_pairs.remainder().first() {
            accum += (byte as u32) << 8;
        }

        propagate_carries(accum)
    }

    const fn propagate_carries(word: u32) -> u16 {
        let sum = (word >> 16) + (word & 0xffff);
        ((sum >> 16) as u16) + (sum as u16)
    }
}

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jun 20, 2025

I think CPU optimizations play a big role in how fast this is(n't)... When I run the benchmark on my Ryzen 7 5700X, I get the following, so the new and old impls are practically equally quick:

Bench Result
bench_checksum_chunks_exact 57.20 ns/iter (+/- 2.01)
bench_checksum_indexed 57.96 ns/iter (+/- 1.58)
bench_checksum_original 57.32 ns/iter (+/- 1.35)

Adding a no-big-chunk version of chunks_exact (so it only calls chunks_exact(2) once, and no intermediate chunks_exact(32), see spoiler below for code) has this result:

Code
    pub fn checksum_chunks_exact_no_bigchunk(data: &[u8]) -> u16 {
        let mut accum = 0;

        // ... take by 2 bytes and sum them.
        let mut chunks = data.chunks_exact(2);
        for pair in &mut chunks {
            accum += u16::from_be_bytes([pair[0], pair[1]]) as u32;
        }

        // Add the last remaining odd byte, if any.
        if let Some(&byte) = chunks.remainder().first() {
            accum += (byte as u32) << 8;
        }

        propagate_carries(accum)
    }
Bench Result
bench_checksum_chunks_exact_no_bigchunk 64.44 ns/iter (+/- 0.32)

so its clearly a little slower. However, re-running with RUSTFLAGS="-C target-cpu=native" (native == znver3) flips the whole thing upside down a little more, slowing down the previous impls by almost 2x (quite unsure why) but making the no-bigchunk impl 2x faster...:

Bench Result
bench_checksum_chunks_exact 100.88 ns/iter (+/- 0.67)
bench_checksum_indexed 106.34 ns/iter (+/- 0.88)
bench_checksum_original 91.44 ns/iter (+/- 0.49)
bench_checksum_chunks_exact_no_bigchunk 33.78 ns/iter (+/- 0.50)

Edit:

In the target-cpu=native benches, the optimizer seems to deal with the manual chunking very poorly (it seems to unroll the loop into 16 unrolled iterations or something?) so that is probably why it is slower.

x86 assembly of manually chunked code with `target-cpu=native` (warning: long!)
.LCPI0_1:
        .byte   1
        .byte   0
        .byte   3
        .byte   2
        .byte   5
        .byte   4
        .byte   7
        .byte   6
        .byte   9
        .byte   8
        .byte   11
        .byte   10
        .byte   13
        .byte   12
        .byte   15
        .byte   14
example::checksum_indexed::h5c5cab1f5c9c8f52:
        sub     rsp, 88
        xor     r8d, r8d
        cmp     rsi, 32
        jb      .LBB0_1
        lea     rcx, [rsi - 32]
        xor     r8d, r8d
        cmp     rcx, 992
        jae     .LBB0_4
        mov     rax, rdi
        jmp     .LBB0_7
.LBB0_1:
        mov     rax, rdi
        jmp     .LBB0_9
.LBB0_4:
        shr     rcx, 5
        vpxor   xmm1, xmm1, xmm1
        vpxor   xmm6, xmm6, xmm6
        vpxor   xmm3, xmm3, xmm3
        vpxor   xmm2, xmm2, xmm2
        inc     rcx
        mov     rdx, rcx
        and     rdx, -32
        mov     r8, rdx
        shl     r8, 5
        lea     rax, [rdi + r8]
        sub     rsi, r8
        xor     r8d, r8d
.LBB0_5:
        mov     r9, r8
        shl     r9, 5
        vmovdqu ymmword ptr [rsp - 32], ymm6
        vmovdqu ymmword ptr [rsp - 96], ymm1
        add     r8, 32
        movzx   r10d, byte ptr [rdi + r9]
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 256]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 32], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 64], 2
        vmovd   xmm4, r10d
        movzx   r10d, byte ptr [rdi + r9 + 512]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 288], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 96], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 320], 2
        vmovd   xmm5, r10d
        movzx   r10d, byte ptr [rdi + r9 + 768]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 544], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 128], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 352], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 576], 2
        vmovd   xmm6, r10d
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 800], 1
        movzx   r10d, byte ptr [rdi + r9 + 1]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 160], 5
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 384], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 608], 3
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 832], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 192], 6
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 416], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 640], 4
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 864], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 448], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 672], 5
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 896], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 480], 7
        vpinsrb xmm7, xmm5, byte ptr [rdi + r9 + 704], 6
        vpinsrb xmm5, xmm6, byte ptr [rdi + r9 + 928], 5
        vmovd   xmm6, r10d
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 33], 1
        movzx   r10d, byte ptr [rdi + r9 + 257]
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 736], 7
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 65], 2
        vpinsrb xmm8, xmm5, byte ptr [rdi + r9 + 960], 6
        vpinsrb xmm5, xmm6, byte ptr [rdi + r9 + 97], 3
        vpinsrb xmm6, xmm0, byte ptr [rdi + r9 + 224], 7
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 289], 1
        movzx   r10d, byte ptr [rdi + r9 + 513]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 321], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 129], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 353], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 161], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 385], 4
        vpinsrb xmm9, xmm5, byte ptr [rdi + r9 + 193], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 417], 5
        vpinsrb xmm10, xmm9, byte ptr [rdi + r9 + 225], 7
        vpinsrb xmm5, xmm0, byte ptr [rdi + r9 + 449], 6
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 545], 1
        movzx   r10d, byte ptr [rdi + r9 + 769]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 577], 2
        vpinsrb xmm13, xmm5, byte ptr [rdi + r9 + 481], 7
        vpunpcklbw      xmm1, xmm10, xmm6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 609], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 641], 4
        vpunpcklbw      xmm4, xmm13, xmm4
        vpmovzxwd       ymm4, xmm4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 673], 5
        vpinsrb xmm11, xmm0, byte ptr [rdi + r9 + 705], 6
        vpinsrb xmm0, xmm8, byte ptr [rdi + r9 + 992], 7
        vmovd   xmm8, r10d
        movzx   r10d, byte ptr [rdi + r9 + 2]
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 801], 1
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 833], 2
        vpinsrb xmm14, xmm11, byte ptr [rdi + r9 + 737], 7
        vmovd   xmm9, r10d
        vpinsrb xmm5, xmm9, byte ptr [rdi + r9 + 34], 1
        movzx   r10d, byte ptr [rdi + r9 + 258]
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 865], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 66], 2
        vmovd   xmm9, r10d
        movzx   r10d, byte ptr [rdi + r9 + 514]
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 290], 1
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 897], 4
        vpunpcklbw      xmm6, xmm14, xmm7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 98], 3
        vmovdqa xmmword ptr [rsp - 128], xmm6
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 322], 2
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 929], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 130], 4
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 961], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 162], 5
        vpinsrb xmm12, xmm8, byte ptr [rdi + r9 + 993], 7
        vpinsrb xmm8, xmm9, byte ptr [rdi + r9 + 354], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 194], 6
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 386], 4
        vpinsrb xmm11, xmm5, byte ptr [rdi + r9 + 226], 7
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 546], 1
        movzx   r10d, byte ptr [rdi + r9 + 770]
        vpunpcklbw      xmm0, xmm12, xmm0
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 418], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 578], 2
        vmovdqa xmmword ptr [rsp - 64], xmm0
        vmovd   xmm9, r10d
        movzx   r10d, byte ptr [rdi + r9 + 3]
        vpinsrb xmm15, xmm9, byte ptr [rdi + r9 + 802], 1
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 450], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 610], 3
        vmovd   xmm10, r10d
        movzx   r10d, byte ptr [rdi + r9 + 259]
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 35], 1
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 482], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 642], 4
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 67], 2
        vmovd   xmm13, r10d
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 291], 1
        movzx   r10d, byte ptr [rdi + r9 + 515]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 674], 5
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 323], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 706], 6
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 355], 3
        vpinsrb xmm9, xmm5, byte ptr [rdi + r9 + 738], 7
        vpinsrb xmm5, xmm15, byte ptr [rdi + r9 + 834], 2
        vpinsrb xmm15, xmm10, byte ptr [rdi + r9 + 99], 3
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 387], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 866], 3
        vpinsrb xmm7, xmm13, byte ptr [rdi + r9 + 419], 5
        vmovd   xmm13, r10d
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 547], 1
        movzx   r10d, byte ptr [rdi + r9 + 771]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 898], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 451], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 930], 5
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 4]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 803], 1
        vpinsrb xmm14, xmm7, byte ptr [rdi + r9 + 483], 7
        vpinsrb xmm7, xmm13, byte ptr [rdi + r9 + 579], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 962], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 835], 2
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 611], 3
        vpinsrb xmm10, xmm5, byte ptr [rdi + r9 + 994], 7
        vpinsrb xmm5, xmm15, byte ptr [rdi + r9 + 131], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 867], 3
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 643], 4
        vpunpcklbw      xmm8, xmm14, xmm8
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 163], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 899], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 675], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 195], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 931], 5
        vpinsrb xmm13, xmm7, byte ptr [rdi + r9 + 707], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 227], 7
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 963], 6
        vpinsrb xmm12, xmm13, byte ptr [rdi + r9 + 739], 7
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 995], 7
        vpunpcklbw      xmm7, xmm5, xmm11
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 36], 1
        movzx   r10d, byte ptr [rdi + r9 + 260]
        vpunpcklbw      xmm9, xmm12, xmm9
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 68], 2
        vpunpcklbw      xmm10, xmm0, xmm10
        vpmovzxwd       ymm9, xmm9
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 100], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 132], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 164], 5
        vpinsrb xmm11, xmm5, byte ptr [rdi + r9 + 196], 6
        vmovd   xmm5, r10d
        movzx   r10d, byte ptr [rdi + r9 + 516]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 292], 1
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 228], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 324], 2
        vmovd   xmm12, r10d
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 548], 1
        movzx   r10d, byte ptr [rdi + r9 + 5]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 356], 3
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 580], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 388], 4
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 612], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 420], 5
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 644], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 452], 6
        vpinsrb xmm0, xmm12, byte ptr [rdi + r9 + 676], 5
        vmovd   xmm12, r10d
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 37], 1
        movzx   r10d, byte ptr [rdi + r9 + 261]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 484], 7
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 69], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 708], 6
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 101], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 740], 7
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 133], 4
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 165], 5
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 197], 6
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 229], 7
        vpunpcklbw      xmm11, xmm12, xmm11
        vmovd   xmm12, r10d
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 293], 1
        movzx   r10d, byte ptr [rdi + r9 + 517]
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 325], 2
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 357], 3
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 389], 4
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 421], 5
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 453], 6
        vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 485], 7
        vpunpcklbw      xmm12, xmm12, xmm5
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 549], 1
        movzx   r10d, byte ptr [rdi + r9 + 772]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 581], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 613], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 645], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 677], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 709], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 741], 7
        vpunpcklbw      xmm15, xmm5, xmm0
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 773]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 804], 1
        vpmovzxwd       ymm15, xmm15
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 836], 2
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 805], 1
        movzx   r10d, byte ptr [rdi + r9 + 6]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 868], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 837], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 900], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 869], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 932], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 901], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 964], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 933], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 996], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 965], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 997], 7
        vpunpcklbw      xmm6, xmm5, xmm0
        vpmovzxwd       ymm5, xmm1
        vpaddd  ymm0, ymm5, ymmword ptr [rsp - 96]
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 38], 1
        movzx   r10d, byte ptr [rdi + r9 + 7]
        vpmovzxwd       ymm1, xmmword ptr [rsp - 64]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 70], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 102], 3
        vmovdqu ymmword ptr [rsp - 96], ymm0
        vpaddd  ymm0, ymm4, ymmword ptr [rsp - 32]
        vpaddd  ymm1, ymm2, ymm1
        vpmovzxwd       ymm2, xmm6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 134], 4
        vmovdqu ymmword ptr [rsp - 32], ymm1
        vpinsrb xmm4, xmm5, byte ptr [rdi + r9 + 166], 5
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 39], 1
        movzx   r10d, byte ptr [rdi + r9 + 262]
        vmovdqu ymmword ptr [rsp + 16], ymm0
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 71], 2
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 198], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 103], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 230], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 135], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 167], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 199], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 231], 7
        vpunpcklbw      xmm13, xmm5, xmm4
        vmovd   xmm4, r10d
        movzx   r10d, byte ptr [rdi + r9 + 263]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 294], 1
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 326], 2
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 295], 1
        movzx   r10d, byte ptr [rdi + r9 + 518]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 358], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 327], 2
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 390], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 359], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 422], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 391], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 454], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 423], 5
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 486], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 455], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 487], 7
        vpunpcklbw      xmm14, xmm5, xmm4
        vmovd   xmm4, r10d
        movzx   r10d, byte ptr [rdi + r9 + 519]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 550], 1
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 582], 2
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 551], 1
        movzx   r10d, byte ptr [rdi + r9 + 774]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 614], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 583], 2
        vmovd   xmm1, r10d
        movzx   r10d, byte ptr [rdi + r9 + 775]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 806], 1
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 646], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 615], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 838], 2
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 678], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 647], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 870], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 710], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 679], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 902], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 742], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 711], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 934], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 743], 7
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 966], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 998], 7
        vpunpcklbw      xmm4, xmm5, xmm4
        vpmovzxwd       ymm5, xmmword ptr [rsp - 128]
        vpmovzxwd       ymm4, xmm4
        vpaddd  ymm0, ymm3, ymm5
        vpmovzxwd       ymm5, xmm8
        vpmovzxwd       ymm8, xmm11
        vpmovzxwd       ymm11, xmm12
        vmovdqu ymmword ptr [rsp + 48], ymm0
        vpmovzxwd       ymm0, xmm7
        vpmovzxwd       ymm7, xmm10
        vpaddd  ymm12, ymm11, ymm5
        vpaddd  ymm11, ymm9, ymm15
        vpaddd  ymm6, ymm11, ymmword ptr [rsp + 48]
        vpaddd  ymm8, ymm8, ymm0
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 807], 1
        movzx   r10d, byte ptr [rdi + r9 + 8]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 839], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 871], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 903], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 935], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 967], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 999], 7
        vpunpcklbw      xmm1, xmm0, xmm1
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 40], 1
        movzx   r10d, byte ptr [rdi + r9 + 264]
        vpmovzxwd       ymm1, xmm1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 72], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 104], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 136], 4
        vpinsrb xmm10, xmm0, byte ptr [rdi + r9 + 168], 5
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 296], 1
        movzx   r10d, byte ptr [rdi + r9 + 520]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 328], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 360], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 392], 4
        vpinsrb xmm9, xmm0, byte ptr [rdi + r9 + 424], 5
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 552], 1
        movzx   r10d, byte ptr [rdi + r9 + 9]
        vpinsrb xmm5, xmm0, byte ptr [rdi + r9 + 584], 2
        vpaddd  ymm0, ymm7, ymm2
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 456], 6
        vpinsrb xmm2, xmm5, byte ptr [rdi + r9 + 616], 3
        vpinsrb xmm5, xmm10, byte ptr [rdi + r9 + 200], 6
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 488], 7
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 648], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 232], 7
        vpinsrb xmm7, xmm2, byte ptr [rdi + r9 + 680], 5
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 41], 1
        movzx   r10d, byte ptr [rdi + r9 + 265]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 73], 2
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 712], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 105], 3
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 744], 7
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 137], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 169], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 201], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 233], 7
        vpunpcklbw      xmm5, xmm2, xmm5
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 297], 1
        movzx   r10d, byte ptr [rdi + r9 + 521]
        vpmovzxwd       ymm5, xmm5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 329], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 361], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 393], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 425], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 457], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 489], 7
        vpunpcklbw      xmm9, xmm2, xmm9
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 553], 1
        movzx   r10d, byte ptr [rdi + r9 + 776]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 585], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 617], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 649], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 681], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 713], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 745], 7
        vpunpcklbw      xmm2, xmm2, xmm7
        vmovd   xmm7, r10d
        movzx   r10d, byte ptr [rdi + r9 + 777]
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 808], 1
        vpmovzxwd       ymm2, xmm2
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 840], 2
        vmovd   xmm10, r10d
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 809], 1
        movzx   r10d, byte ptr [rdi + r9 + 10]
        vpaddd  ymm2, ymm4, ymm2
        vmovdqu ymmword ptr [rsp - 64], ymm2
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 872], 3
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 841], 2
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 904], 4
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 873], 3
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 936], 5
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 905], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 968], 6
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 937], 5
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 1000], 7
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 969], 6
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 1001], 7
        vpunpcklbw      xmm7, xmm10, xmm7
        vpmovzxwd       ymm10, xmm13
        vpmovzxwd       ymm13, xmm14
        vpmovzxwd       ymm14, xmm9
        vpmovzxwd       ymm15, xmm7
        vpaddd  ymm10, ymm10, ymm5
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 42], 1
        movzx   r10d, byte ptr [rdi + r9 + 266]
        vpaddd  ymm13, ymm13, ymm14
        vpaddd  ymm1, ymm15, ymm1
        vmovdqu ymmword ptr [rsp - 128], ymm1
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 74], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 106], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 138], 4
        vpinsrb xmm9, xmm5, byte ptr [rdi + r9 + 170], 5
        vmovd   xmm5, r10d
        movzx   r10d, byte ptr [rdi + r9 + 522]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 298], 1
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 330], 2
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 202], 6
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 554], 1
        movzx   r10d, byte ptr [rdi + r9 + 11]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 362], 3
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 234], 7
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 586], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 394], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 618], 3
        vpinsrb xmm7, xmm5, byte ptr [rdi + r9 + 426], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 650], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 458], 6
        vpinsrb xmm5, xmm2, byte ptr [rdi + r9 + 682], 5
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 43], 1
        movzx   r10d, byte ptr [rdi + r9 + 267]
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 490], 7
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 75], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 714], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 107], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 746], 7
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 139], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 171], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 203], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 235], 7
        vpunpcklbw      xmm14, xmm2, xmm9
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 299], 1
        movzx   r10d, byte ptr [rdi + r9 + 523]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 331], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 363], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 395], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 427], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 459], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 491], 7
        vpunpcklbw      xmm15, xmm2, xmm7
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 555], 1
        movzx   r10d, byte ptr [rdi + r9 + 778]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 587], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 619], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 651], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 683], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 715], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 747], 7
        vpunpcklbw      xmm1, xmm2, xmm5
        vmovd   xmm2, r10d
        movzx   r10d, byte ptr [rdi + r9 + 779]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 810], 1
        vmovdqa xmmword ptr [rsp], xmm1
        vpaddd  ymm1, ymm8, ymmword ptr [rsp - 96]
        vpaddd  ymm8, ymm12, ymmword ptr [rsp + 16]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 842], 2
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 811], 1
        movzx   r10d, byte ptr [rdi + r9 + 12]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 874], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 843], 2
        vmovdqu ymmword ptr [rsp - 96], ymm1
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 906], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 875], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 938], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 907], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 970], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 939], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 1002], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 971], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 1003], 7
        vpunpcklbw      xmm9, xmm5, xmm2
        vmovd   xmm5, r10d
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 44], 1
        movzx   r10d, byte ptr [rdi + r9 + 268]
        vpaddd  ymm2, ymm0, ymmword ptr [rsp - 32]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 76], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 108], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 140], 4
        vpinsrb xmm3, xmm5, byte ptr [rdi + r9 + 172], 5
        vmovd   xmm5, r10d
        movzx   r10d, byte ptr [rdi + r9 + 13]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 300], 1
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 204], 6
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 332], 2
        vmovd   xmm11, r10d
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 45], 1
        movzx   r10d, byte ptr [rdi + r9 + 269]
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 236], 7
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 364], 3
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 77], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 396], 4
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 109], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 428], 5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 141], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 460], 6
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 173], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 492], 7
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 205], 6
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 237], 7
        vpunpcklbw      xmm3, xmm11, xmm3
        vmovd   xmm11, r10d
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 301], 1
        movzx   r10d, byte ptr [rdi + r9 + 524]
        vpmovzxwd       ymm3, xmm3
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 333], 2
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 365], 3
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 397], 4
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 429], 5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 461], 6
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 493], 7
        vpunpcklbw      xmm12, xmm11, xmm5
        vmovd   xmm5, r10d
        movzx   r10d, byte ptr [rdi + r9 + 525]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 556], 1
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 588], 2
        vmovd   xmm11, r10d
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 557], 1
        movzx   r10d, byte ptr [rdi + r9 + 780]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 620], 3
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 589], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 652], 4
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 621], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 684], 5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 653], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 716], 6
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 685], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 748], 7
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 717], 6
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 749], 7
        vpunpcklbw      xmm5, xmm11, xmm5
        vmovd   xmm11, r10d
        movzx   r10d, byte ptr [rdi + r9 + 781]
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 812], 1
        vpmovzxwd       ymm5, xmm5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 844], 2
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 813], 1
        movzx   r10d, byte ptr [rdi + r9 + 14]
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 876], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 845], 2
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 908], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 877], 3
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 940], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 909], 4
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 972], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 941], 5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 1004], 7
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 973], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 1005], 7
        vpunpcklbw      xmm11, xmm1, xmm11
        vpmovzxwd       ymm1, xmm14
        vpmovzxwd       ymm14, xmm15
        vpaddd  ymm10, ymm10, ymm1
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 46], 1
        movzx   r10d, byte ptr [rdi + r9 + 15]
        vpaddd  ymm13, ymm13, ymm14
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 78], 2
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 110], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 142], 4
        vpinsrb xmm14, xmm1, byte ptr [rdi + r9 + 174], 5
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 47], 1
        movzx   r10d, byte ptr [rdi + r9 + 270]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 79], 2
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 206], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 111], 3
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 238], 7
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 143], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 175], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 207], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 239], 7
        vpunpcklbw      xmm14, xmm1, xmm14
        vmovd   xmm1, r10d
        movzx   r10d, byte ptr [rdi + r9 + 271]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 302], 1
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 334], 2
        vmovd   xmm15, r10d
        vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 303], 1
        movzx   r10d, byte ptr [rdi + r9 + 526]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 366], 3
        vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 335], 2
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 398], 4
        vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 367], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 430], 5
        vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 399], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 462], 6
        vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 431], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 494], 7
        vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 463], 6
        vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 495], 7
        vpunpcklbw      xmm15, xmm15, xmm1
        vmovd   xmm1, r10d
        movzx   r10d, byte ptr [rdi + r9 + 527]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 558], 1
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 590], 2
        vmovd   xmm4, r10d
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 559], 1
        movzx   r10d, byte ptr [rdi + r9 + 782]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 622], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 591], 2
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 654], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 623], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 686], 5
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 655], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 718], 6
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 687], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 750], 7
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 719], 6
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 751], 7
        vpunpcklbw      xmm1, xmm4, xmm1
        vmovd   xmm4, r10d
        movzx   r10d, byte ptr [rdi + r9 + 783]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 814], 1
        vpmovzxwd       ymm1, xmm1
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 846], 2
        vmovd   xmm7, r10d
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 815], 1
        movzx   r10d, byte ptr [rdi + r9 + 16]
        vpaddd  ymm5, ymm5, ymm1
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 878], 3
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 847], 2
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 910], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 879], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 942], 5
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 911], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 974], 6
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 943], 5
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 1006], 7
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 975], 6
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 1007], 7
        vpunpcklbw      xmm4, xmm7, xmm4
        vpmovzxwd       ymm7, xmm14
        vpmovzxwd       ymm1, xmm4
        vpaddd  ymm14, ymm3, ymm7
        vpmovzxwd       ymm3, xmmword ptr [rsp]
        vpmovzxwd       ymm7, xmm12
        vpaddd  ymm0, ymm3, ymmword ptr [rsp - 64]
        vpmovzxwd       ymm3, xmm9
        vpmovzxwd       ymm9, xmm11
        vpmovzxwd       ymm11, xmm15
        vpaddd  ymm4, ymm9, ymm1
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 48], 1
        movzx   r10d, byte ptr [rdi + r9 + 272]
        vpaddd  ymm12, ymm11, ymm7
        vpaddd  ymm9, ymm3, ymmword ptr [rsp - 128]
        vpaddd  ymm3, ymm10, ymmword ptr [rsp - 96]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 80], 2
        vpaddd  ymm0, ymm6, ymm0
        vmovdqu ymmword ptr [rsp - 96], ymm0
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 112], 3
        vpaddd  ymm2, ymm9, ymm2
        vmovdqu ymmword ptr [rsp - 128], ymm3
        vmovdqu ymmword ptr [rsp - 32], ymm2
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 144], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 176], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 208], 6
        vpinsrb xmm7, xmm1, byte ptr [rdi + r9 + 240], 7
        vpaddd  ymm1, ymm8, ymm13
        vmovdqu ymmword ptr [rsp + 16], ymm1
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 304], 1
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 336], 2
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 368], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 400], 4
        movzx   r10d, byte ptr [rdi + r9 + 528]
        vpinsrb xmm0, xmm1, byte ptr [rdi + r9 + 432], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 464], 6
        vpinsrb xmm1, xmm0, byte ptr [rdi + r9 + 496], 7
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 784]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 560], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 592], 2
        vmovd   xmm2, r10d
        movzx   r10d, byte ptr [rdi + r9 + 17]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 816], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 624], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 848], 2
        vmovd   xmm6, r10d
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 49], 1
        movzx   r10d, byte ptr [rdi + r9 + 273]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 656], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 880], 3
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 81], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 688], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 912], 4
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 113], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 720], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 944], 5
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 145], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 976], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 177], 5
        vpinsrb xmm8, xmm2, byte ptr [rdi + r9 + 1008], 7
        vpinsrb xmm9, xmm6, byte ptr [rdi + r9 + 209], 6
        vpinsrb xmm6, xmm0, byte ptr [rdi + r9 + 752], 7
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 305], 1
        movzx   r10d, byte ptr [rdi + r9 + 529]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 337], 2
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 241], 7
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 561], 1
        movzx   r10d, byte ptr [rdi + r9 + 785]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 369], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 593], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 401], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 625], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 433], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 657], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 465], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 689], 5
        vpinsrb xmm10, xmm0, byte ptr [rdi + r9 + 497], 7
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 817], 1
        movzx   r10d, byte ptr [rdi + r9 + 18]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 721], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 849], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 753], 7
        vpinsrb xmm11, xmm0, byte ptr [rdi + r9 + 881], 3
        vpunpcklbw      xmm0, xmm9, xmm7
        vmovd   xmm9, r10d
        movzx   r10d, byte ptr [rdi + r9 + 274]
        vpunpcklbw      xmm1, xmm10, xmm1
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 50], 1
        vpmovzxwd       ymm0, xmm0
        vpmovzxwd       ymm1, xmm1
        vpinsrb xmm7, xmm11, byte ptr [rdi + r9 + 913], 4
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 82], 2
        vmovd   xmm10, r10d
        movzx   r10d, byte ptr [rdi + r9 + 530]
        vpunpcklbw      xmm13, xmm2, xmm6
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 306], 1
        vpaddd  ymm12, ymm12, ymm1
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 945], 5
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 114], 3
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 338], 2
        vmovd   xmm2, r10d
        movzx   r10d, byte ptr [rdi + r9 + 19]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 562], 1
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 146], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 977], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 594], 2
        vmovd   xmm6, r10d
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 51], 1
        movzx   r10d, byte ptr [rdi + r9 + 275]
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 178], 5
        vpinsrb xmm11, xmm7, byte ptr [rdi + r9 + 1009], 7
        vpinsrb xmm7, xmm10, byte ptr [rdi + r9 + 370], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 626], 3
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 83], 2
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 210], 6
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 402], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 658], 4
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 115], 3
        vpunpcklbw      xmm10, xmm11, xmm8
        vpinsrb xmm8, xmm9, byte ptr [rdi + r9 + 242], 7
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 434], 5
        vpaddd  ymm11, ymm14, ymm0
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 690], 5
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 147], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 466], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 722], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 179], 5
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 498], 7
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 754], 7
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 211], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 243], 7
        vpunpcklbw      xmm3, xmm6, xmm8
        vmovd   xmm8, r10d
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 307], 1
        movzx   r10d, byte ptr [rdi + r9 + 531]
        vmovdqa xmmword ptr [rsp - 64], xmm3
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 339], 2
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 371], 3
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 403], 4
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 435], 5
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 467], 6
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 499], 7
        vpunpcklbw      xmm7, xmm8, xmm7
        vmovd   xmm8, r10d
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 563], 1
        movzx   r10d, byte ptr [rdi + r9 + 786]
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 595], 2
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 627], 3
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 659], 4
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 691], 5
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 723], 6
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 755], 7
        vpunpcklbw      xmm8, xmm8, xmm2
        vmovd   xmm2, r10d
        movzx   r10d, byte ptr [rdi + r9 + 787]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 818], 1
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 850], 2
        vmovd   xmm9, r10d
        movzx   r10d, byte ptr [rdi + r9 + 20]
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 819], 1
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 882], 3
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 851], 2
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 21]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 52], 1
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 914], 4
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 883], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 84], 2
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 53], 1
        movzx   r10d, byte ptr [rdi + r9 + 276]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 946], 5
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 915], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 116], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 85], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 978], 6
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 947], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 148], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 117], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 1010], 7
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 979], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 180], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 149], 4
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 1011], 7
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 212], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 181], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 244], 7
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 213], 6
        vpunpcklbw      xmm9, xmm9, xmm2
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 245], 7
        vpunpcklbw      xmm14, xmm1, xmm0
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 277]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 308], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 340], 2
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 309], 1
        movzx   r10d, byte ptr [rdi + r9 + 532]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 372], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 341], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 404], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 373], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 436], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 405], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 468], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 437], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 500], 7
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 469], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 501], 7
        vpunpcklbw      xmm15, xmm1, xmm0
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 533]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 564], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 596], 2
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 565], 1
        movzx   r10d, byte ptr [rdi + r9 + 788]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 628], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 597], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 660], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 629], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 692], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 661], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 724], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 693], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 756], 7
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 725], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 757], 7
        vpunpcklbw      xmm2, xmm1, xmm0
        vpmovzxwd       ymm0, xmm13
        vpaddd  ymm5, ymm5, ymm0
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 789]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 820], 1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 852], 2
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 821], 1
        movzx   r10d, byte ptr [rdi + r9 + 22]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 884], 3
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 853], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 916], 4
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 885], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 948], 5
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 917], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 980], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 949], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 1012], 7
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 981], 6
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 1013], 7
        vpunpcklbw      xmm1, xmm1, xmm0
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 23]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 54], 1
        vpmovzxwd       ymm1, xmm1
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 86], 2
        vmovd   xmm13, r10d
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 55], 1
        movzx   r10d, byte ptr [rdi + r9 + 278]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 118], 3
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 87], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 150], 4
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 119], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 182], 5
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 151], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 214], 6
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 183], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 246], 7
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 215], 6
        vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 247], 7
        vpunpcklbw      xmm13, xmm13, xmm0
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 279]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 310], 1
        vpmovzxwd       ymm13, xmm13
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 342], 2
        vmovd   xmm3, r10d
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 311], 1
        movzx   r10d, byte ptr [rdi + r9 + 534]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 374], 3
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 343], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 406], 4
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 375], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 438], 5
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 407], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 470], 6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 439], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 502], 7
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 471], 6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 503], 7
        vpunpcklbw      xmm0, xmm3, xmm0
        vpmovzxwd       ymm3, xmm10
        vpmovzxwd       ymm0, xmm0
        vpaddd  ymm10, ymm4, ymm3
        vmovd   xmm3, r10d
        movzx   r10d, byte ptr [rdi + r9 + 535]
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 566], 1
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 598], 2
        vmovd   xmm4, r10d
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 567], 1
        movzx   r10d, byte ptr [rdi + r9 + 790]
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 630], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 599], 2
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 662], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 631], 3
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 694], 5
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 663], 4
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 726], 6
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 695], 5
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 758], 7
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 727], 6
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 759], 7
        vpunpcklbw      xmm3, xmm4, xmm3
        vmovd   xmm4, r10d
        movzx   r10d, byte ptr [rdi + r9 + 791]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 822], 1
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 854], 2
        vmovd   xmm6, r10d
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 823], 1
        movzx   r10d, byte ptr [rdi + r9 + 24]
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 886], 3
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 855], 2
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 918], 4
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 887], 3
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 950], 5
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 919], 4
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 982], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 951], 5
        vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 1014], 7
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 983], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 1015], 7
        vpunpcklbw      xmm6, xmm6, xmm4
        vpmovzxwd       ymm4, xmm14
        vpaddd  ymm14, ymm13, ymm4
        vpmovzxwd       ymm4, xmm15
        vpaddd  ymm13, ymm4, ymm0
        vpmovzxwd       ymm0, xmm2
        vpmovzxwd       ymm2, xmm3
        vpmovzxwd       ymm3, xmm8
        vpmovzxwd       ymm8, xmm9
        vpaddd  ymm4, ymm0, ymm2
        vpmovzxwd       ymm0, xmm6
        vpmovzxwd       ymm2, xmm7
        vpaddd  ymm7, ymm5, ymm3
        vpaddd  ymm8, ymm10, ymm8
        vpaddd  ymm0, ymm1, ymm0
        vpmovzxwd       ymm1, xmmword ptr [rsp - 64]
        vpaddd  ymm11, ymm11, ymm1
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 56], 1
        movzx   r10d, byte ptr [rdi + r9 + 280]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 88], 2
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 120], 3
        vpinsrb xmm6, xmm1, byte ptr [rdi + r9 + 152], 4
        vpaddd  ymm1, ymm12, ymm2
        vpinsrb xmm2, xmm6, byte ptr [rdi + r9 + 184], 5
        vmovd   xmm6, r10d
        movzx   r10d, byte ptr [rdi + r9 + 536]
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 312], 1
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 216], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 344], 2
        vmovd   xmm3, r10d
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 568], 1
        movzx   r10d, byte ptr [rdi + r9 + 25]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 248], 7
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 376], 3
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 600], 2
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 408], 4
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 632], 3
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 440], 5
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 664], 4
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 472], 6
        vpinsrb xmm5, xmm3, byte ptr [rdi + r9 + 696], 5
        vmovd   xmm3, r10d
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 57], 1
        movzx   r10d, byte ptr [rdi + r9 + 281]
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 504], 7
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 89], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 728], 6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 121], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 760], 7
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 153], 4
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 185], 5
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 217], 6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 249], 7
        vpunpcklbw      xmm2, xmm3, xmm2
        vmovd   xmm3, r10d
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 313], 1
        movzx   r10d, byte ptr [rdi + r9 + 537]
        vpmovzxwd       ymm2, xmm2
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 345], 2
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 377], 3
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 409], 4
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 441], 5
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 473], 6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 505], 7
        vpunpcklbw      xmm6, xmm3, xmm6
        vmovd   xmm3, r10d
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 569], 1
        movzx   r10d, byte ptr [rdi + r9 + 792]
        vpmovzxwd       ymm6, xmm6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 601], 2
        vpaddd  ymm6, ymm13, ymm6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 633], 3
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 665], 4
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 697], 5
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 729], 6
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 761], 7
        vpunpcklbw      xmm3, xmm3, xmm5
        vmovd   xmm5, r10d
        movzx   r10d, byte ptr [rdi + r9 + 793]
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 824], 1
        vpmovzxwd       ymm3, xmm3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 856], 2
        vmovd   xmm9, r10d
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 825], 1
        movzx   r10d, byte ptr [rdi + r9 + 26]
        vpaddd  ymm4, ymm4, ymm3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 888], 3
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 857], 2
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 920], 4
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 889], 3
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 952], 5
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 921], 4
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 984], 6
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 953], 5
        vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 1016], 7
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 985], 6
        vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 1017], 7
        vpunpcklbw      xmm5, xmm9, xmm5
        vpaddd  ymm9, ymm14, ymm2
        vmovd   xmm2, r10d
        movzx   r10d, byte ptr [rdi + r9 + 282]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 58], 1
        vpmovzxwd       ymm5, xmm5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 90], 2
        vmovd   xmm10, r10d
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 314], 1
        movzx   r10d, byte ptr [rdi + r9 + 538]
        vpaddd  ymm5, ymm0, ymm5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 122], 3
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 346], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 154], 4
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 378], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 186], 5
        vpinsrb xmm3, xmm10, byte ptr [rdi + r9 + 410], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 218], 6
        vpinsrb xmm10, xmm3, byte ptr [rdi + r9 + 442], 5
        vmovd   xmm3, r10d
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 570], 1
        movzx   r10d, byte ptr [rdi + r9 + 794]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 250], 7
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 602], 2
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 474], 6
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 27]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 826], 1
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 634], 3
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 506], 7
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 858], 2
        vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 666], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 890], 3
        vpinsrb xmm12, xmm3, byte ptr [rdi + r9 + 698], 5
        vpaddd  ymm3, ymm11, ymmword ptr [rsp - 128]
        vmovd   xmm11, r10d
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 59], 1
        movzx   r10d, byte ptr [rdi + r9 + 283]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 922], 4
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 91], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 954], 5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 123], 3
        vmovdqu ymmword ptr [rsp - 128], ymm3
        vpaddd  ymm3, ymm1, ymmword ptr [rsp + 16]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 986], 6
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 155], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 1018], 7
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 187], 5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 219], 6
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 251], 7
        vpunpcklbw      xmm14, xmm11, xmm2
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 315], 1
        movzx   r10d, byte ptr [rdi + r9 + 539]
        vpinsrb xmm11, xmm12, byte ptr [rdi + r9 + 730], 6
        vpmovzxwd       ymm14, xmm14
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 347], 2
        vpaddd  ymm9, ymm9, ymm14
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 379], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 411], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 443], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 475], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 507], 7
        vpunpcklbw      xmm13, xmm2, xmm10
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 571], 1
        vpinsrb xmm10, xmm11, byte ptr [rdi + r9 + 762], 7
        movzx   r10d, byte ptr [rdi + r9 + 795]
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 603], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 635], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 667], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 699], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 731], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 763], 7
        vpunpcklbw      xmm11, xmm2, xmm10
        vmovd   xmm2, r10d
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 827], 1
        movzx   r10d, byte ptr [rdi + r9 + 28]
        vpmovzxwd       ymm11, xmm11
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 859], 2
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 891], 3
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 923], 4
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 955], 5
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 987], 6
        vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 1019], 7
        vpunpcklbw      xmm12, xmm2, xmm0
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 60], 1
        movzx   r10d, byte ptr [rdi + r9 + 284]
        vpaddd  ymm2, ymm7, ymmword ptr [rsp - 96]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 92], 2
        vmovd   xmm1, r10d
        movzx   r10d, byte ptr [rdi + r9 + 540]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 124], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 156], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 188], 5
        vpinsrb xmm15, xmm0, byte ptr [rdi + r9 + 220], 6
        vpinsrb xmm0, xmm1, byte ptr [rdi + r9 + 316], 1
        vmovd   xmm1, r10d
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 572], 1
        movzx   r10d, byte ptr [rdi + r9 + 29]
        vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 604], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 348], 2
        vpinsrb xmm7, xmm1, byte ptr [rdi + r9 + 636], 3
        vpaddd  ymm1, ymm8, ymmword ptr [rsp - 32]
        vpinsrb xmm8, xmm15, byte ptr [rdi + r9 + 252], 7
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 380], 3
        vpmovzxwd       ymm15, xmm13
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 668], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 412], 4
        vpaddd  ymm6, ymm15, ymm6
        vpmovzxwd       ymm15, xmm12
        vpaddd  ymm12, ymm11, ymm4
        vpinsrb xmm10, xmm7, byte ptr [rdi + r9 + 700], 5
        vmovd   xmm7, r10d
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 61], 1
        movzx   r10d, byte ptr [rdi + r9 + 285]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 444], 5
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 93], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 476], 6
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 732], 6
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 125], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 508], 7
        vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 764], 7
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 157], 4
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 189], 5
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 221], 6
        vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 253], 7
        vpunpcklbw      xmm7, xmm7, xmm8
        vmovd   xmm8, r10d
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 317], 1
        movzx   r10d, byte ptr [rdi + r9 + 541]
        vpmovzxwd       ymm4, xmm7
        vpaddd  ymm7, ymm15, ymm5
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 349], 2
        vpaddd  ymm9, ymm9, ymm4
        vpaddd  ymm9, ymm9, ymmword ptr [rsp - 128]
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 381], 3
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 413], 4
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 445], 5
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 477], 6
        vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 509], 7
        vpunpcklbw      xmm8, xmm8, xmm0
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 573], 1
        movzx   r10d, byte ptr [rdi + r9 + 796]
        vpmovzxwd       ymm15, xmm8
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 605], 2
        vpaddd  ymm6, ymm15, ymm6
        vpaddd  ymm3, ymm3, ymm6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 637], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 669], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 701], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 733], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 765], 7
        vpunpcklbw      xmm10, xmm0, xmm10
        vmovd   xmm0, r10d
        movzx   r10d, byte ptr [rdi + r9 + 797]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 828], 1
        vpmovzxwd       ymm10, xmm10
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 860], 2
        vmovd   xmm14, r10d
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 829], 1
        movzx   r10d, byte ptr [rdi + r9 + 30]
        vpaddd  ymm10, ymm12, ymm10
        vpaddd  ymm2, ymm10, ymm2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 892], 3
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 861], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 924], 4
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 893], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 956], 5
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 925], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 988], 6
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 957], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 1020], 7
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 989], 6
        vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 1021], 7
        vpunpcklbw      xmm13, xmm14, xmm0
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 62], 1
        movzx   r10d, byte ptr [rdi + r9 + 286]
        vpmovzxwd       ymm13, xmm13
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 94], 2
        vpaddd  ymm7, ymm13, ymm7
        vpaddd  ymm7, ymm1, ymm7
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 126], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 158], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 190], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 222], 6
        vpinsrb xmm14, xmm0, byte ptr [rdi + r9 + 254], 7
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 318], 1
        movzx   r10d, byte ptr [rdi + r9 + 542]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 350], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 382], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 414], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 446], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 478], 6
        vpinsrb xmm11, xmm0, byte ptr [rdi + r9 + 510], 7
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 574], 1
        movzx   r10d, byte ptr [rdi + r9 + 798]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 606], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 638], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 670], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 702], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 734], 6
        vpinsrb xmm5, xmm0, byte ptr [rdi + r9 + 766], 7
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 830], 1
        movzx   r10d, byte ptr [rdi + r9 + 31]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 862], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 894], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 926], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 958], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 990], 6
        vpinsrb xmm4, xmm0, byte ptr [rdi + r9 + 1022], 7
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 63], 1
        movzx   r10d, byte ptr [rdi + r9 + 287]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 95], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 127], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 159], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 191], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 223], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 255], 7
        vpunpcklbw      xmm8, xmm0, xmm14
        vmovd   xmm0, r10d
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 319], 1
        movzx   r10d, byte ptr [rdi + r9 + 543]
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 351], 2
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 383], 3
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 415], 4
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 447], 5
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 479], 6
        vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 511], 7
        vpunpcklbw      xmm0, xmm0, xmm11
        vmovd   xmm11, r10d
        movzx   r10d, byte ptr [rdi + r9 + 799]
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 575], 1
        vpmovzxwd       ymm0, xmm0
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 607], 2
        vmovd   xmm12, r10d
        vpinsrb xmm6, xmm12, byte ptr [rdi + r9 + 831], 1
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 639], 3
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 863], 2
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 671], 4
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 895], 3
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 703], 5
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 927], 4
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 735], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 959], 5
        vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 767], 7
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 991], 6
        vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 1023], 7
        vpunpcklbw      xmm5, xmm11, xmm5
        vpmovzxwd       ymm5, xmm5
        vpunpcklbw      xmm4, xmm6, xmm4
        vpmovzxwd       ymm6, xmm8
        vpaddd  ymm1, ymm9, ymm6
        vpaddd  ymm6, ymm3, ymm0
        vpaddd  ymm3, ymm2, ymm5
        vpmovzxwd       ymm2, xmm4
        vpaddd  ymm2, ymm7, ymm2
        cmp     r8, rdx
        jne     .LBB0_5
        vpaddd  ymm0, ymm6, ymm1
        vpaddd  ymm0, ymm3, ymm0
        vpaddd  ymm0, ymm2, ymm0
        vextracti128    xmm1, ymm0, 1
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85
        vpaddd  xmm0, xmm0, xmm1
        vmovd   r8d, xmm0
        cmp     rcx, rdx
        je      .LBB0_9
.LBB0_7:
        vbroadcasti128  ymm0, xmmword ptr [rip + .LCPI0_1]
.LBB0_8:
        vmovdqu ymm1, ymmword ptr [rax]
        add     rsi, -32
        add     rax, 32
        vpshufb ymm1, ymm1, ymm0
        vextracti128    xmm2, ymm1, 1
        vpmovzxwd       ymm1, xmm1
        vpmovzxwd       ymm2, xmm2
        vpaddd  ymm1, ymm1, ymm2
        vextracti128    xmm2, ymm1, 1
        vpaddd  xmm1, xmm1, xmm2
        vpshufd xmm2, xmm1, 238
        vpaddd  xmm1, xmm1, xmm2
        vpshufd xmm2, xmm1, 85
        vpaddd  xmm1, xmm1, xmm2
        vmovd   ecx, xmm1
        add     r8d, ecx
        cmp     rsi, 31
        ja      .LBB0_8
.LBB0_9:
        cmp     rsi, 2
        jb      .LBB0_10
        lea     rdx, [rsi - 2]
        cmp     rdx, 62
        jae     .LBB0_16
        xor     ecx, ecx
        jmp     .LBB0_19
.LBB0_10:
        xor     ecx, ecx
        jmp     .LBB0_11
.LBB0_16:
        vmovdqa xmm2, xmmword ptr [rip + .LCPI0_1]
        shr     rdx
        vmovd   xmm0, r8d
        vpxor   xmm1, xmm1, xmm1
        xor     r8d, r8d
        vpxor   xmm3, xmm3, xmm3
        vpxor   xmm4, xmm4, xmm4
        inc     rdx
        mov     rdi, rdx
        and     rdi, -32
        lea     rcx, [rdi + rdi]
.LBB0_17:
        vmovdqu xmm6, xmmword ptr [rax + 2*r8 + 16]
        vmovdqu xmm5, xmmword ptr [rax + 2*r8]
        vmovdqu xmm7, xmmword ptr [rax + 2*r8 + 32]
        vmovdqu xmm8, xmmword ptr [rax + 2*r8 + 48]
        add     r8, 32
        vpshufb xmm6, xmm6, xmm2
        vpshufb xmm5, xmm5, xmm2
        vpshufb xmm7, xmm7, xmm2
        vpshufb xmm8, xmm8, xmm2
        vpmovzxwd       ymm6, xmm6
        vpmovzxwd       ymm5, xmm5
        vpmovzxwd       ymm7, xmm7
        vpaddd  ymm1, ymm1, ymm6
        vpmovzxwd       ymm6, xmm8
        vpaddd  ymm0, ymm0, ymm5
        vpaddd  ymm3, ymm3, ymm7
        vpaddd  ymm4, ymm4, ymm6
        cmp     rdi, r8
        jne     .LBB0_17
        vpaddd  ymm0, ymm1, ymm0
        vpaddd  ymm0, ymm3, ymm0
        vpaddd  ymm0, ymm4, ymm0
        vextracti128    xmm1, ymm0, 1
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85
        vpaddd  xmm0, xmm0, xmm1
        vmovd   r8d, xmm0
        cmp     rdx, rdi
        je      .LBB0_11
.LBB0_19:
        mov     rdx, rcx
.LBB0_20:
        movbe   cx, word ptr [rax + rdx]
        movzx   ecx, cx
        add     r8d, ecx
        lea     rcx, [rdx + 2]
        add     rdx, 3
        cmp     rdx, rsi
        mov     rdx, rcx
        jb      .LBB0_20
.LBB0_11:
        cmp     rcx, rsi
        jae     .LBB0_13
        movzx   eax, byte ptr [rax + rcx]
        shl     eax, 8
        add     r8d, eax
.LBB0_13:
        mov     eax, r8d
        shr     eax, 16
        movzx   ecx, r8w
        add     ecx, eax
        mov     eax, ecx
        shr     eax, 16
        add     eax, ecx
        add     rsp, 88
        vzeroupper
        ret

@jordens
Copy link

jordens commented Jun 20, 2025

I would definitely prefer that bench_checksum_chunks_exact_no_bigchunk here. Defer to the compiler/libcore to get this right, portable, and maintained.

@caobug
Copy link
Author

caobug commented Jun 20, 2025

I don't know the exact reason. I think your method is the best.

bench_checksum_chunks_exact_no_bigchunk

I benchmarked bench_checksum_chunks_exact_no_bigchunk and bench_checksum_chunks_exact on macOS — the former was about 3.5% faster.

@caobug
Copy link
Author

caobug commented Jun 20, 2025

I ran into a surprising result: when the input size is less than 1024 bytes, bench_checksum_original is actually the fastest implementation.

Could you help verify this on your machine? @datdenkikniet

Benchmark results:

For input size < 1024:
bench_checksum_original: 14.38 ns/iter (fastest)
bench_checksum_indexed: 29.10 ns/iter
bench_checksum_chunks_exact: 29.81 ns/iter
bench_checksum_chunks_exact_no_bigchunk: 28.55 ns/iter

fn build_data() -> Vec<u8> {
    (0..1023).map(|x| (x % 256) as u8).collect()
}

For input size ≥ 1024:
bench_checksum_original: 54.46 ns/iter (slowest)
bench_checksum_indexed: 29.54 ns/iter
bench_checksum_chunks_exact: 29.41 ns/iter
bench_checksum_chunks_exact_no_bigchunk: 29.24 ns/iter

fn build_data() -> Vec<u8> {
    (0..1024).map(|x| (x % 256) as u8).collect()
}
Code (click to expand)
mod checksum {
    extern crate test;

    use byteorder::{ByteOrder, NetworkEndian};

    #[bench]
    fn bench_checksum_original(b: &mut test::Bencher) {
        let data = build_data();

        b.iter(|| {
            test::black_box(checksum_original(&data));
        });
    }

    #[bench]
    fn bench_checksum_indexed(b: &mut test::Bencher) {
        let data = build_data();

        b.iter(|| {
            test::black_box(checksum_indexed(&data));
        });
    }

    #[bench]
    fn bench_checksum_chunks_exact(b: &mut test::Bencher) {
        let data = build_data();

        b.iter(|| {
            test::black_box(checksum_chunks_exact(&data));
        });
    }

    #[bench]
    fn bench_checksum_chunks_exact_no_bigchunk(b: &mut test::Bencher) {
        let data = build_data();

        b.iter(|| {
            test::black_box(checksum_chunks_exact_no_bigchunk(&data));
        });
    }

    fn build_data() -> Vec<u8> {
        (0..1024).map(|x| (x % 256) as u8).collect()
    }

    pub fn checksum_original(mut data: &[u8]) -> u16 {
        let mut accum = 0;

        // For each 32-byte chunk...
        const CHUNK_SIZE: usize = 32;
        while data.len() >= CHUNK_SIZE {
            let mut d = &data[..CHUNK_SIZE];
            // ... take by 2 bytes and sum them.
            while d.len() >= 2 {
                accum += NetworkEndian::read_u16(d) as u32;
                d = &d[2..];
            }

            data = &data[CHUNK_SIZE..];
        }

        // Sum the rest that does not fit the last 32-byte chunk,
        // taking by 2 bytes.
        while data.len() >= 2 {
            accum += NetworkEndian::read_u16(data) as u32;
            data = &data[2..];
        }

        // Add the last remaining odd byte, if any.
        if let Some(&value) = data.first() {
            accum += (value as u32) << 8;
        }

        propagate_carries(accum)
    }

    pub fn checksum_indexed(mut data: &[u8]) -> u16 {
        let mut accum = 0;

        // For each 32-byte chunk...
        const CHUNK_SIZE: usize = 32;
        while data.len() >= CHUNK_SIZE {
            let chunk = &data[..CHUNK_SIZE];
            let mut i = 0;
            // ... take by 2 bytes and sum them.
            while i + 1 < CHUNK_SIZE {
                accum += u16::from_be_bytes([chunk[i], chunk[i + 1]]) as u32;
                i += 2;
            }

            data = &data[CHUNK_SIZE..];
        }

        // Sum the rest that does not fit the last 32-byte chunk,
        // taking by 2 bytes.
        let mut i = 0;
        while i + 1 < data.len() {
            accum += u16::from_be_bytes([data[i], data[i + 1]]) as u32;
            i += 2;
        }

        // Add the last remaining odd byte, if any.
        if i < data.len() {
            accum += (data[i] as u32) << 8;
        }

        propagate_carries(accum)
    }

    pub fn checksum_chunks_exact(data: &[u8]) -> u16 {
        let mut accum = 0;

        // For each 32-byte chunk...
        const CHUNK_SIZE: usize = 32;
        const WORD_SIZE: usize = 2;
        let mut chunks = data.chunks_exact(CHUNK_SIZE);
        for chunk in &mut chunks {
            // ... take by 2 bytes and sum them.
            for pair in chunk.chunks_exact(WORD_SIZE) {
                accum += u16::from_be_bytes([pair[0], pair[1]]) as u32;
            }
        }

        // Sum the rest that does not fit the last 32-byte chunk,
        // taking by 2 bytes.
        let remainder = chunks.remainder();
        let mut word_pairs = remainder.chunks_exact(WORD_SIZE);
        for pair in &mut word_pairs {
            accum += u16::from_be_bytes([pair[0], pair[1]]) as u32;
        }

        // Add the last remaining odd byte, if any.
        if let Some(&byte) = word_pairs.remainder().first() {
            accum += (byte as u32) << 8;
        }

        propagate_carries(accum)
    }

    pub fn checksum_chunks_exact_no_bigchunk(data: &[u8]) -> u16 {
        let mut accum = 0;

        // ... take by 2 bytes and sum them.
        let mut chunks = data.chunks_exact(2);
        for pair in &mut chunks {
            accum += u16::from_be_bytes([pair[0], pair[1]]) as u32;
        }

        // Add the last remaining odd byte, if any.
        if let Some(&byte) = chunks.remainder().first() {
            accum += (byte as u32) << 8;
        }

        propagate_carries(accum)
    }

    const fn propagate_carries(word: u32) -> u16 {
        let sum = (word >> 16) + (word & 0xffff);
        ((sum >> 16) as u16) + (sum as u16)
    }
}

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jun 20, 2025

Got a little carried away and made some graphs :P Data for each 10-byte interval, starting at 0 and ending at 1460. Y-axis is time in ns, x-axis is data size. The sawtooth pattern is probably due to speedups due to "SIMD boundaries", so not too noteworthy.

image

With target-cpu=native I also observe this weird spike around/at 1024 bytes, but no_bigchunks is still the overall winner.

image

Details if you'd like to run it locally, too

Required code changes + bash script
#!/bin/bash

for i in $(seq 0 10 1460); do
    while read -r line; do
        data=$(echo "$line" | awk '{print $2 "," $5 "," $8}')
        data="${data::-1}"
        unit=$(echo "$line" | awk '{print $6}' | cut -d '/' -f1)
        echo "$i,$data,$unit"
    done < <(DATA_SIZE=$i cargo +nightly bench -- checksum 2> /dev/null | grep "... bench")
done
    fn build_data() -> Vec<u8> {
        let data_size: usize = std::env::var("DATA_SIZE").unwrap().parse().unwrap();
        (0..data_size).map(|x| (x % 256)as u8).collect()
    }

Raw data results: results.ods

To double-check, I also ran it with 32 byte intervals and the graph is indeed a lot smoother:

32 byte intervals (without `target-cpu`)

image

@caobug
Copy link
Author

caobug commented Jun 20, 2025

I ran the benchmarks on my Mac Studio with an M1 chip, and the results are shown in the table below. While there are some minor differences compared to the results from a Mac mini with an M4 chip, the overall trends remain consistent.
Notably, the original implementation consistently performs best when the data size is below 1024. However, its performance degrades significantly once the size exceeds 1024.

Benchmark results (click to expand)
Size Benchmark Time Unit
900 checksum::bench_checksum_original 17.48 ns
900 checksum::bench_checksum_indexed 34.89 ns
900 checksum::bench_checksum_chunks_exact 33.82 ns
900 checksum::bench_checksum_chunks_exact_no_bigchunk 36.77 ns
910 checksum::bench_checksum_original 18.97 ns
910 checksum::bench_checksum_indexed 36.13 ns
910 checksum::bench_checksum_chunks_exact 36.09 ns
910 checksum::bench_checksum_chunks_exact_no_bigchunk 38.00 ns
920 checksum::bench_checksum_original 19.91 ns
920 checksum::bench_checksum_indexed 36.00 ns
920 checksum::bench_checksum_chunks_exact 35.40 ns
920 checksum::bench_checksum_chunks_exact_no_bigchunk 38.05 ns
930 checksum::bench_checksum_original 17.50 ns
930 checksum::bench_checksum_indexed 35.79 ns
930 checksum::bench_checksum_chunks_exact 35.82 ns
930 checksum::bench_checksum_chunks_exact_no_bigchunk 38.02 ns
940 checksum::bench_checksum_original 19.19 ns
940 checksum::bench_checksum_indexed 36.08 ns
940 checksum::bench_checksum_chunks_exact 37.07 ns
940 checksum::bench_checksum_chunks_exact_no_bigchunk 37.86 ns
950 checksum::bench_checksum_original 20.50 ns
950 checksum::bench_checksum_indexed 37.38 ns
950 checksum::bench_checksum_chunks_exact 36.29 ns
950 checksum::bench_checksum_chunks_exact_no_bigchunk 37.91 ns
960 checksum::bench_checksum_original 17.70 ns
960 checksum::bench_checksum_indexed 35.40 ns
960 checksum::bench_checksum_chunks_exact 35.37 ns
960 checksum::bench_checksum_chunks_exact_no_bigchunk 38.49 ns
970 checksum::bench_checksum_original 19.21 ns
970 checksum::bench_checksum_indexed 37.70 ns
970 checksum::bench_checksum_chunks_exact 37.71 ns
970 checksum::bench_checksum_chunks_exact_no_bigchunk 39.95 ns
980 checksum::bench_checksum_original 20.79 ns
980 checksum::bench_checksum_indexed 38.03 ns
980 checksum::bench_checksum_chunks_exact 38.04 ns
980 checksum::bench_checksum_chunks_exact_no_bigchunk 39.97 ns
990 checksum::bench_checksum_original 22.32 ns
990 checksum::bench_checksum_indexed 39.28 ns
990 checksum::bench_checksum_chunks_exact 39.70 ns
990 checksum::bench_checksum_chunks_exact_no_bigchunk 41.22 ns
1000 checksum::bench_checksum_original 19.81 ns
1000 checksum::bench_checksum_indexed 38.95 ns
1000 checksum::bench_checksum_chunks_exact 37.53 ns
1000 checksum::bench_checksum_chunks_exact_no_bigchunk 40.02 ns
1010 checksum::bench_checksum_original 21.24 ns
1010 checksum::bench_checksum_indexed 38.01 ns
1010 checksum::bench_checksum_chunks_exact 39.33 ns
1010 checksum::bench_checksum_chunks_exact_no_bigchunk 40.87 ns
1020 checksum::bench_checksum_original 22.44 ns
1020 checksum::bench_checksum_indexed 39.39 ns
1020 checksum::bench_checksum_chunks_exact 40.57 ns
1020 checksum::bench_checksum_chunks_exact_no_bigchunk 40.93 ns
1030 checksum::bench_checksum_original 71.59 ns
1030 checksum::bench_checksum_indexed 39.51 ns
1030 checksum::bench_checksum_chunks_exact 38.75 ns
1030 checksum::bench_checksum_chunks_exact_no_bigchunk 41.90 ns
1040 checksum::bench_checksum_original 73.03 ns
1040 checksum::bench_checksum_indexed 39.50 ns
1040 checksum::bench_checksum_chunks_exact 39.89 ns
1040 checksum::bench_checksum_chunks_exact_no_bigchunk 41.89 ns
1050 checksum::bench_checksum_original 75.18 ns
1050 checksum::bench_checksum_indexed 41.07 ns
1050 checksum::bench_checksum_chunks_exact 41.54 ns
1050 checksum::bench_checksum_chunks_exact_no_bigchunk 43.13 ns
1060 checksum::bench_checksum_original 72.72 ns
1060 checksum::bench_checksum_indexed 40.93 ns
1060 checksum::bench_checksum_chunks_exact 40.91 ns
1060 checksum::bench_checksum_chunks_exact_no_bigchunk 41.58 ns
1070 checksum::bench_checksum_original 74.00 ns
1070 checksum::bench_checksum_indexed 42.18 ns
1070 checksum::bench_checksum_chunks_exact 42.22 ns
1070 checksum::bench_checksum_chunks_exact_no_bigchunk 43.42 ns
1080 checksum::bench_checksum_original 75.42 ns
1080 checksum::bench_checksum_indexed 42.50 ns
1080 checksum::bench_checksum_chunks_exact 42.53 ns
1080 checksum::bench_checksum_chunks_exact_no_bigchunk 42.83 ns
1090 checksum::bench_checksum_original 72.02 ns
1090 checksum::bench_checksum_indexed 41.54 ns
1090 checksum::bench_checksum_chunks_exact 41.52 ns
1090 checksum::bench_checksum_chunks_exact_no_bigchunk 43.80 ns
1100 checksum::bench_checksum_original 74.48 ns
1100 checksum::bench_checksum_indexed 41.63 ns
1100 checksum::bench_checksum_chunks_exact 42.84 ns
1100 checksum::bench_checksum_chunks_exact_no_bigchunk 43.72 ns

Update: I just realized that the unusual performance of bench_checksum_original might be related to NetworkEndian::read_u16. I replaced it with u16::from_be_bytes, and now its performance is much closer to the other implementations.

Benchmark results (click to expand)
Size Benchmark Time Unit
900 checksum::bench_checksum_original 34.81 ns
900 checksum::bench_checksum_indexed 34.81 ns
900 checksum::bench_checksum_chunks_exact 34.83 ns
900 checksum::bench_checksum_chunks_exact_no_bigchunk 36.44 ns
910 checksum::bench_checksum_original 36.12 ns
910 checksum::bench_checksum_indexed 36.13 ns
910 checksum::bench_checksum_chunks_exact 36.09 ns
910 checksum::bench_checksum_chunks_exact_no_bigchunk 36.91 ns
920 checksum::bench_checksum_original 37.06 ns
920 checksum::bench_checksum_indexed 36.45 ns
920 checksum::bench_checksum_chunks_exact 36.44 ns
920 checksum::bench_checksum_chunks_exact_no_bigchunk 38.07 ns
930 checksum::bench_checksum_original 34.88 ns
930 checksum::bench_checksum_indexed 35.80 ns
930 checksum::bench_checksum_chunks_exact 35.82 ns
930 checksum::bench_checksum_chunks_exact_no_bigchunk 38.05 ns
940 checksum::bench_checksum_original 37.06 ns
940 checksum::bench_checksum_indexed 36.07 ns
940 checksum::bench_checksum_chunks_exact 36.01 ns
940 checksum::bench_checksum_chunks_exact_no_bigchunk 38.96 ns
950 checksum::bench_checksum_original 38.42 ns
950 checksum::bench_checksum_indexed 36.66 ns
950 checksum::bench_checksum_chunks_exact 37.42 ns
950 checksum::bench_checksum_chunks_exact_no_bigchunk 38.99 ns
960 checksum::bench_checksum_original 35.41 ns
960 checksum::bench_checksum_indexed 36.46 ns
960 checksum::bench_checksum_chunks_exact 36.43 ns
960 checksum::bench_checksum_chunks_exact_no_bigchunk 37.99 ns
970 checksum::bench_checksum_original 36.62 ns
970 checksum::bench_checksum_indexed 36.61 ns
970 checksum::bench_checksum_chunks_exact 36.67 ns
970 checksum::bench_checksum_chunks_exact_no_bigchunk 38.78 ns
980 checksum::bench_checksum_original 38.17 ns
980 checksum::bench_checksum_indexed 36.93 ns
980 checksum::bench_checksum_chunks_exact 38.06 ns
980 checksum::bench_checksum_chunks_exact_no_bigchunk 39.95 ns
990 checksum::bench_checksum_original 42.11 ns
990 checksum::bench_checksum_indexed 40.00 ns
990 checksum::bench_checksum_chunks_exact 39.70 ns
990 checksum::bench_checksum_chunks_exact_no_bigchunk 42.49 ns
1000 checksum::bench_checksum_original 37.53 ns
1000 checksum::bench_checksum_indexed 39.00 ns
1000 checksum::bench_checksum_chunks_exact 38.72 ns
1000 checksum::bench_checksum_chunks_exact_no_bigchunk 41.24 ns
1010 checksum::bench_checksum_original 39.16 ns
1010 checksum::bench_checksum_indexed 38.66 ns
1010 checksum::bench_checksum_chunks_exact 39.31 ns
1010 checksum::bench_checksum_chunks_exact_no_bigchunk 40.90 ns
1020 checksum::bench_checksum_original 41.88 ns
1020 checksum::bench_checksum_indexed 40.60 ns
1020 checksum::bench_checksum_chunks_exact 40.57 ns
1020 checksum::bench_checksum_chunks_exact_no_bigchunk 42.20 ns
1030 checksum::bench_checksum_original 39.02 ns
1030 checksum::bench_checksum_indexed 39.97 ns
1030 checksum::bench_checksum_chunks_exact 38.47 ns
1030 checksum::bench_checksum_chunks_exact_no_bigchunk 40.94 ns
1040 checksum::bench_checksum_original 40.02 ns
1040 checksum::bench_checksum_indexed 38.83 ns
1040 checksum::bench_checksum_chunks_exact 39.08 ns
1040 checksum::bench_checksum_chunks_exact_no_bigchunk 40.90 ns
1050 checksum::bench_checksum_original 41.56 ns
1050 checksum::bench_checksum_indexed 41.50 ns
1050 checksum::bench_checksum_chunks_exact 40.44 ns
1050 checksum::bench_checksum_chunks_exact_no_bigchunk 43.17 ns
1060 checksum::bench_checksum_original 39.70 ns
1060 checksum::bench_checksum_indexed 39.75 ns
1060 checksum::bench_checksum_chunks_exact 39.71 ns
1060 checksum::bench_checksum_chunks_exact_no_bigchunk 42.83 ns
1070 checksum::bench_checksum_original 40.98 ns
1070 checksum::bench_checksum_indexed 42.21 ns
1070 checksum::bench_checksum_chunks_exact 42.19 ns
1070 checksum::bench_checksum_chunks_exact_no_bigchunk 44.13 ns
1080 checksum::bench_checksum_original 42.21 ns
1080 checksum::bench_checksum_indexed 41.27 ns
1080 checksum::bench_checksum_chunks_exact 42.42 ns
1080 checksum::bench_checksum_chunks_exact_no_bigchunk 42.83 ns
1090 checksum::bench_checksum_original 41.51 ns
1090 checksum::bench_checksum_indexed 41.56 ns
1090 checksum::bench_checksum_chunks_exact 41.55 ns
1090 checksum::bench_checksum_chunks_exact_no_bigchunk 43.80 ns
1100 checksum::bench_checksum_original 42.83 ns
1100 checksum::bench_checksum_indexed 42.85 ns
1100 checksum::bench_checksum_chunks_exact 42.59 ns
1100 checksum::bench_checksum_chunks_exact_no_bigchunk 45.09 ns

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jun 20, 2025

Well, that's an interesting turn of events...

Graph version for those interested:

image

Given that the tradeoffs here are really not very obvious (because I have no clue how much time is actually spent computing this checksum, nor how many people run it on their Macs), and I don't think we even have results for the "most relevant" targets (embedded ARM, IMO), I will leave judgement to whoever has approval powers :P At least we have some data to show for it now!

@whitequark
Copy link
Contributor

I think we should go with bench_checksum_chunks_exact_no_bigchunk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants