-
Notifications
You must be signed in to change notification settings - Fork 466
Optimize checksum calculation #1065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replace slice iteration with indexed access to reduce overhead and improve performance - CPU usage dropped by 8% on Apple M1.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1065 +/- ##
==========================================
- Coverage 81.17% 81.17% -0.01%
==========================================
Files 81 81
Lines 28955 28954 -1
==========================================
- Hits 23503 23502 -1
Misses 5452 5452 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
As a question: could you also compare this to using/try this with From what I know Code for reference /// Compute an RFC 1071 compliant checksum (without the final complement).
pub fn data(data: &[u8]) -> u16 {
let mut accum = 0;
let mut chunks = data.chunks_exact(2);
for chunk in &mut chunks {
accum += u16::from_be_bytes([chunk[0], chunk[1]]) as u32;
}
// Add the last remaining odd byte, if any.
if let Some(data) = chunks.remainder().get(0) {
accum += (*data as u32) << 8;
}
propagate_carries(accum)
} ETA: additional question, why does this use the 32 byte chunk thing? Is it a sort of cache optimization? On my PC the benchmarks run equally quick by just using ETA2: Yes, does seem that the 32 byte thing was added to "aid autovectorization" back in 2017. I wonder if that is still valid, since I imagine that a lot has changed in the last 8(!) years :) Heh, turns out that ETA3: assuming this is measured using the normal benchmarks, it seems that using bare indexing (so removing the chunking logic and letting the compiler do its thing on the optimized, non-chunked version) is actually the fastest. Please try that in your comparison too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to feel about the fact that there's now a combination of indexed and slice accesses in the checksum function--it's definitely a lot harder to read.
I benchmarked three implementations: slice-based, raw pointer, and chunks_exact. The latter two (indexed and chunks_exact) showed nearly identical performance:
Note: I did not use NetworkEndian::read_u16. Results suggest it's not the most optimal choice. Code (click to expand)
|
I think CPU optimizations play a big role in how fast this is(n't)... When I run the benchmark on my Ryzen 7 5700X, I get the following, so the new and old impls are practically equally quick:
Adding a no-big-chunk version of Code pub fn checksum_chunks_exact_no_bigchunk(data: &[u8]) -> u16 {
let mut accum = 0;
// ... take by 2 bytes and sum them.
let mut chunks = data.chunks_exact(2);
for pair in &mut chunks {
accum += u16::from_be_bytes([pair[0], pair[1]]) as u32;
}
// Add the last remaining odd byte, if any.
if let Some(&byte) = chunks.remainder().first() {
accum += (byte as u32) << 8;
}
propagate_carries(accum)
}
so its clearly a little slower. However, re-running with
Edit: In the x86 assembly of manually chunked code with `target-cpu=native` (warning: long!).LCPI0_1:
.byte 1
.byte 0
.byte 3
.byte 2
.byte 5
.byte 4
.byte 7
.byte 6
.byte 9
.byte 8
.byte 11
.byte 10
.byte 13
.byte 12
.byte 15
.byte 14
example::checksum_indexed::h5c5cab1f5c9c8f52:
sub rsp, 88
xor r8d, r8d
cmp rsi, 32
jb .LBB0_1
lea rcx, [rsi - 32]
xor r8d, r8d
cmp rcx, 992
jae .LBB0_4
mov rax, rdi
jmp .LBB0_7
.LBB0_1:
mov rax, rdi
jmp .LBB0_9
.LBB0_4:
shr rcx, 5
vpxor xmm1, xmm1, xmm1
vpxor xmm6, xmm6, xmm6
vpxor xmm3, xmm3, xmm3
vpxor xmm2, xmm2, xmm2
inc rcx
mov rdx, rcx
and rdx, -32
mov r8, rdx
shl r8, 5
lea rax, [rdi + r8]
sub rsi, r8
xor r8d, r8d
.LBB0_5:
mov r9, r8
shl r9, 5
vmovdqu ymmword ptr [rsp - 32], ymm6
vmovdqu ymmword ptr [rsp - 96], ymm1
add r8, 32
movzx r10d, byte ptr [rdi + r9]
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 256]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 32], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 64], 2
vmovd xmm4, r10d
movzx r10d, byte ptr [rdi + r9 + 512]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 288], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 96], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 320], 2
vmovd xmm5, r10d
movzx r10d, byte ptr [rdi + r9 + 768]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 544], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 128], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 352], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 576], 2
vmovd xmm6, r10d
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 800], 1
movzx r10d, byte ptr [rdi + r9 + 1]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 160], 5
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 384], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 608], 3
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 832], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 192], 6
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 416], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 640], 4
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 864], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 448], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 672], 5
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 896], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 480], 7
vpinsrb xmm7, xmm5, byte ptr [rdi + r9 + 704], 6
vpinsrb xmm5, xmm6, byte ptr [rdi + r9 + 928], 5
vmovd xmm6, r10d
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 33], 1
movzx r10d, byte ptr [rdi + r9 + 257]
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 736], 7
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 65], 2
vpinsrb xmm8, xmm5, byte ptr [rdi + r9 + 960], 6
vpinsrb xmm5, xmm6, byte ptr [rdi + r9 + 97], 3
vpinsrb xmm6, xmm0, byte ptr [rdi + r9 + 224], 7
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 289], 1
movzx r10d, byte ptr [rdi + r9 + 513]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 321], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 129], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 353], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 161], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 385], 4
vpinsrb xmm9, xmm5, byte ptr [rdi + r9 + 193], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 417], 5
vpinsrb xmm10, xmm9, byte ptr [rdi + r9 + 225], 7
vpinsrb xmm5, xmm0, byte ptr [rdi + r9 + 449], 6
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 545], 1
movzx r10d, byte ptr [rdi + r9 + 769]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 577], 2
vpinsrb xmm13, xmm5, byte ptr [rdi + r9 + 481], 7
vpunpcklbw xmm1, xmm10, xmm6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 609], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 641], 4
vpunpcklbw xmm4, xmm13, xmm4
vpmovzxwd ymm4, xmm4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 673], 5
vpinsrb xmm11, xmm0, byte ptr [rdi + r9 + 705], 6
vpinsrb xmm0, xmm8, byte ptr [rdi + r9 + 992], 7
vmovd xmm8, r10d
movzx r10d, byte ptr [rdi + r9 + 2]
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 801], 1
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 833], 2
vpinsrb xmm14, xmm11, byte ptr [rdi + r9 + 737], 7
vmovd xmm9, r10d
vpinsrb xmm5, xmm9, byte ptr [rdi + r9 + 34], 1
movzx r10d, byte ptr [rdi + r9 + 258]
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 865], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 66], 2
vmovd xmm9, r10d
movzx r10d, byte ptr [rdi + r9 + 514]
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 290], 1
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 897], 4
vpunpcklbw xmm6, xmm14, xmm7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 98], 3
vmovdqa xmmword ptr [rsp - 128], xmm6
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 322], 2
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 929], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 130], 4
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 961], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 162], 5
vpinsrb xmm12, xmm8, byte ptr [rdi + r9 + 993], 7
vpinsrb xmm8, xmm9, byte ptr [rdi + r9 + 354], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 194], 6
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 386], 4
vpinsrb xmm11, xmm5, byte ptr [rdi + r9 + 226], 7
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 546], 1
movzx r10d, byte ptr [rdi + r9 + 770]
vpunpcklbw xmm0, xmm12, xmm0
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 418], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 578], 2
vmovdqa xmmword ptr [rsp - 64], xmm0
vmovd xmm9, r10d
movzx r10d, byte ptr [rdi + r9 + 3]
vpinsrb xmm15, xmm9, byte ptr [rdi + r9 + 802], 1
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 450], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 610], 3
vmovd xmm10, r10d
movzx r10d, byte ptr [rdi + r9 + 259]
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 35], 1
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 482], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 642], 4
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 67], 2
vmovd xmm13, r10d
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 291], 1
movzx r10d, byte ptr [rdi + r9 + 515]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 674], 5
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 323], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 706], 6
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 355], 3
vpinsrb xmm9, xmm5, byte ptr [rdi + r9 + 738], 7
vpinsrb xmm5, xmm15, byte ptr [rdi + r9 + 834], 2
vpinsrb xmm15, xmm10, byte ptr [rdi + r9 + 99], 3
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 387], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 866], 3
vpinsrb xmm7, xmm13, byte ptr [rdi + r9 + 419], 5
vmovd xmm13, r10d
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 547], 1
movzx r10d, byte ptr [rdi + r9 + 771]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 898], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 451], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 930], 5
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 4]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 803], 1
vpinsrb xmm14, xmm7, byte ptr [rdi + r9 + 483], 7
vpinsrb xmm7, xmm13, byte ptr [rdi + r9 + 579], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 962], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 835], 2
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 611], 3
vpinsrb xmm10, xmm5, byte ptr [rdi + r9 + 994], 7
vpinsrb xmm5, xmm15, byte ptr [rdi + r9 + 131], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 867], 3
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 643], 4
vpunpcklbw xmm8, xmm14, xmm8
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 163], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 899], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 675], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 195], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 931], 5
vpinsrb xmm13, xmm7, byte ptr [rdi + r9 + 707], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 227], 7
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 963], 6
vpinsrb xmm12, xmm13, byte ptr [rdi + r9 + 739], 7
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 995], 7
vpunpcklbw xmm7, xmm5, xmm11
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 36], 1
movzx r10d, byte ptr [rdi + r9 + 260]
vpunpcklbw xmm9, xmm12, xmm9
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 68], 2
vpunpcklbw xmm10, xmm0, xmm10
vpmovzxwd ymm9, xmm9
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 100], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 132], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 164], 5
vpinsrb xmm11, xmm5, byte ptr [rdi + r9 + 196], 6
vmovd xmm5, r10d
movzx r10d, byte ptr [rdi + r9 + 516]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 292], 1
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 228], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 324], 2
vmovd xmm12, r10d
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 548], 1
movzx r10d, byte ptr [rdi + r9 + 5]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 356], 3
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 580], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 388], 4
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 612], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 420], 5
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 644], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 452], 6
vpinsrb xmm0, xmm12, byte ptr [rdi + r9 + 676], 5
vmovd xmm12, r10d
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 37], 1
movzx r10d, byte ptr [rdi + r9 + 261]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 484], 7
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 69], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 708], 6
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 101], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 740], 7
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 133], 4
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 165], 5
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 197], 6
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 229], 7
vpunpcklbw xmm11, xmm12, xmm11
vmovd xmm12, r10d
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 293], 1
movzx r10d, byte ptr [rdi + r9 + 517]
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 325], 2
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 357], 3
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 389], 4
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 421], 5
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 453], 6
vpinsrb xmm12, xmm12, byte ptr [rdi + r9 + 485], 7
vpunpcklbw xmm12, xmm12, xmm5
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 549], 1
movzx r10d, byte ptr [rdi + r9 + 772]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 581], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 613], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 645], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 677], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 709], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 741], 7
vpunpcklbw xmm15, xmm5, xmm0
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 773]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 804], 1
vpmovzxwd ymm15, xmm15
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 836], 2
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 805], 1
movzx r10d, byte ptr [rdi + r9 + 6]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 868], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 837], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 900], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 869], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 932], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 901], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 964], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 933], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 996], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 965], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 997], 7
vpunpcklbw xmm6, xmm5, xmm0
vpmovzxwd ymm5, xmm1
vpaddd ymm0, ymm5, ymmword ptr [rsp - 96]
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 38], 1
movzx r10d, byte ptr [rdi + r9 + 7]
vpmovzxwd ymm1, xmmword ptr [rsp - 64]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 70], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 102], 3
vmovdqu ymmword ptr [rsp - 96], ymm0
vpaddd ymm0, ymm4, ymmword ptr [rsp - 32]
vpaddd ymm1, ymm2, ymm1
vpmovzxwd ymm2, xmm6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 134], 4
vmovdqu ymmword ptr [rsp - 32], ymm1
vpinsrb xmm4, xmm5, byte ptr [rdi + r9 + 166], 5
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 39], 1
movzx r10d, byte ptr [rdi + r9 + 262]
vmovdqu ymmword ptr [rsp + 16], ymm0
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 71], 2
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 198], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 103], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 230], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 135], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 167], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 199], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 231], 7
vpunpcklbw xmm13, xmm5, xmm4
vmovd xmm4, r10d
movzx r10d, byte ptr [rdi + r9 + 263]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 294], 1
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 326], 2
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 295], 1
movzx r10d, byte ptr [rdi + r9 + 518]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 358], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 327], 2
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 390], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 359], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 422], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 391], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 454], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 423], 5
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 486], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 455], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 487], 7
vpunpcklbw xmm14, xmm5, xmm4
vmovd xmm4, r10d
movzx r10d, byte ptr [rdi + r9 + 519]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 550], 1
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 582], 2
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 551], 1
movzx r10d, byte ptr [rdi + r9 + 774]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 614], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 583], 2
vmovd xmm1, r10d
movzx r10d, byte ptr [rdi + r9 + 775]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 806], 1
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 646], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 615], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 838], 2
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 678], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 647], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 870], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 710], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 679], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 902], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 742], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 711], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 934], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 743], 7
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 966], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 998], 7
vpunpcklbw xmm4, xmm5, xmm4
vpmovzxwd ymm5, xmmword ptr [rsp - 128]
vpmovzxwd ymm4, xmm4
vpaddd ymm0, ymm3, ymm5
vpmovzxwd ymm5, xmm8
vpmovzxwd ymm8, xmm11
vpmovzxwd ymm11, xmm12
vmovdqu ymmword ptr [rsp + 48], ymm0
vpmovzxwd ymm0, xmm7
vpmovzxwd ymm7, xmm10
vpaddd ymm12, ymm11, ymm5
vpaddd ymm11, ymm9, ymm15
vpaddd ymm6, ymm11, ymmword ptr [rsp + 48]
vpaddd ymm8, ymm8, ymm0
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 807], 1
movzx r10d, byte ptr [rdi + r9 + 8]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 839], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 871], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 903], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 935], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 967], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 999], 7
vpunpcklbw xmm1, xmm0, xmm1
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 40], 1
movzx r10d, byte ptr [rdi + r9 + 264]
vpmovzxwd ymm1, xmm1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 72], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 104], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 136], 4
vpinsrb xmm10, xmm0, byte ptr [rdi + r9 + 168], 5
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 296], 1
movzx r10d, byte ptr [rdi + r9 + 520]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 328], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 360], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 392], 4
vpinsrb xmm9, xmm0, byte ptr [rdi + r9 + 424], 5
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 552], 1
movzx r10d, byte ptr [rdi + r9 + 9]
vpinsrb xmm5, xmm0, byte ptr [rdi + r9 + 584], 2
vpaddd ymm0, ymm7, ymm2
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 456], 6
vpinsrb xmm2, xmm5, byte ptr [rdi + r9 + 616], 3
vpinsrb xmm5, xmm10, byte ptr [rdi + r9 + 200], 6
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 488], 7
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 648], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 232], 7
vpinsrb xmm7, xmm2, byte ptr [rdi + r9 + 680], 5
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 41], 1
movzx r10d, byte ptr [rdi + r9 + 265]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 73], 2
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 712], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 105], 3
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 744], 7
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 137], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 169], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 201], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 233], 7
vpunpcklbw xmm5, xmm2, xmm5
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 297], 1
movzx r10d, byte ptr [rdi + r9 + 521]
vpmovzxwd ymm5, xmm5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 329], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 361], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 393], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 425], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 457], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 489], 7
vpunpcklbw xmm9, xmm2, xmm9
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 553], 1
movzx r10d, byte ptr [rdi + r9 + 776]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 585], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 617], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 649], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 681], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 713], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 745], 7
vpunpcklbw xmm2, xmm2, xmm7
vmovd xmm7, r10d
movzx r10d, byte ptr [rdi + r9 + 777]
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 808], 1
vpmovzxwd ymm2, xmm2
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 840], 2
vmovd xmm10, r10d
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 809], 1
movzx r10d, byte ptr [rdi + r9 + 10]
vpaddd ymm2, ymm4, ymm2
vmovdqu ymmword ptr [rsp - 64], ymm2
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 872], 3
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 841], 2
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 904], 4
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 873], 3
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 936], 5
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 905], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 968], 6
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 937], 5
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 1000], 7
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 969], 6
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 1001], 7
vpunpcklbw xmm7, xmm10, xmm7
vpmovzxwd ymm10, xmm13
vpmovzxwd ymm13, xmm14
vpmovzxwd ymm14, xmm9
vpmovzxwd ymm15, xmm7
vpaddd ymm10, ymm10, ymm5
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 42], 1
movzx r10d, byte ptr [rdi + r9 + 266]
vpaddd ymm13, ymm13, ymm14
vpaddd ymm1, ymm15, ymm1
vmovdqu ymmword ptr [rsp - 128], ymm1
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 74], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 106], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 138], 4
vpinsrb xmm9, xmm5, byte ptr [rdi + r9 + 170], 5
vmovd xmm5, r10d
movzx r10d, byte ptr [rdi + r9 + 522]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 298], 1
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 330], 2
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 202], 6
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 554], 1
movzx r10d, byte ptr [rdi + r9 + 11]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 362], 3
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 234], 7
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 586], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 394], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 618], 3
vpinsrb xmm7, xmm5, byte ptr [rdi + r9 + 426], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 650], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 458], 6
vpinsrb xmm5, xmm2, byte ptr [rdi + r9 + 682], 5
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 43], 1
movzx r10d, byte ptr [rdi + r9 + 267]
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 490], 7
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 75], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 714], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 107], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 746], 7
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 139], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 171], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 203], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 235], 7
vpunpcklbw xmm14, xmm2, xmm9
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 299], 1
movzx r10d, byte ptr [rdi + r9 + 523]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 331], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 363], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 395], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 427], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 459], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 491], 7
vpunpcklbw xmm15, xmm2, xmm7
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 555], 1
movzx r10d, byte ptr [rdi + r9 + 778]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 587], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 619], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 651], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 683], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 715], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 747], 7
vpunpcklbw xmm1, xmm2, xmm5
vmovd xmm2, r10d
movzx r10d, byte ptr [rdi + r9 + 779]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 810], 1
vmovdqa xmmword ptr [rsp], xmm1
vpaddd ymm1, ymm8, ymmword ptr [rsp - 96]
vpaddd ymm8, ymm12, ymmword ptr [rsp + 16]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 842], 2
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 811], 1
movzx r10d, byte ptr [rdi + r9 + 12]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 874], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 843], 2
vmovdqu ymmword ptr [rsp - 96], ymm1
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 906], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 875], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 938], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 907], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 970], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 939], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 1002], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 971], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 1003], 7
vpunpcklbw xmm9, xmm5, xmm2
vmovd xmm5, r10d
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 44], 1
movzx r10d, byte ptr [rdi + r9 + 268]
vpaddd ymm2, ymm0, ymmword ptr [rsp - 32]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 76], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 108], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 140], 4
vpinsrb xmm3, xmm5, byte ptr [rdi + r9 + 172], 5
vmovd xmm5, r10d
movzx r10d, byte ptr [rdi + r9 + 13]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 300], 1
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 204], 6
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 332], 2
vmovd xmm11, r10d
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 45], 1
movzx r10d, byte ptr [rdi + r9 + 269]
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 236], 7
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 364], 3
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 77], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 396], 4
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 109], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 428], 5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 141], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 460], 6
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 173], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 492], 7
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 205], 6
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 237], 7
vpunpcklbw xmm3, xmm11, xmm3
vmovd xmm11, r10d
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 301], 1
movzx r10d, byte ptr [rdi + r9 + 524]
vpmovzxwd ymm3, xmm3
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 333], 2
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 365], 3
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 397], 4
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 429], 5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 461], 6
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 493], 7
vpunpcklbw xmm12, xmm11, xmm5
vmovd xmm5, r10d
movzx r10d, byte ptr [rdi + r9 + 525]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 556], 1
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 588], 2
vmovd xmm11, r10d
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 557], 1
movzx r10d, byte ptr [rdi + r9 + 780]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 620], 3
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 589], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 652], 4
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 621], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 684], 5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 653], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 716], 6
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 685], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 748], 7
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 717], 6
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 749], 7
vpunpcklbw xmm5, xmm11, xmm5
vmovd xmm11, r10d
movzx r10d, byte ptr [rdi + r9 + 781]
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 812], 1
vpmovzxwd ymm5, xmm5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 844], 2
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 813], 1
movzx r10d, byte ptr [rdi + r9 + 14]
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 876], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 845], 2
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 908], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 877], 3
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 940], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 909], 4
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 972], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 941], 5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 1004], 7
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 973], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 1005], 7
vpunpcklbw xmm11, xmm1, xmm11
vpmovzxwd ymm1, xmm14
vpmovzxwd ymm14, xmm15
vpaddd ymm10, ymm10, ymm1
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 46], 1
movzx r10d, byte ptr [rdi + r9 + 15]
vpaddd ymm13, ymm13, ymm14
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 78], 2
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 110], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 142], 4
vpinsrb xmm14, xmm1, byte ptr [rdi + r9 + 174], 5
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 47], 1
movzx r10d, byte ptr [rdi + r9 + 270]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 79], 2
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 206], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 111], 3
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 238], 7
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 143], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 175], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 207], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 239], 7
vpunpcklbw xmm14, xmm1, xmm14
vmovd xmm1, r10d
movzx r10d, byte ptr [rdi + r9 + 271]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 302], 1
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 334], 2
vmovd xmm15, r10d
vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 303], 1
movzx r10d, byte ptr [rdi + r9 + 526]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 366], 3
vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 335], 2
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 398], 4
vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 367], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 430], 5
vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 399], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 462], 6
vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 431], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 494], 7
vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 463], 6
vpinsrb xmm15, xmm15, byte ptr [rdi + r9 + 495], 7
vpunpcklbw xmm15, xmm15, xmm1
vmovd xmm1, r10d
movzx r10d, byte ptr [rdi + r9 + 527]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 558], 1
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 590], 2
vmovd xmm4, r10d
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 559], 1
movzx r10d, byte ptr [rdi + r9 + 782]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 622], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 591], 2
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 654], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 623], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 686], 5
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 655], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 718], 6
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 687], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 750], 7
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 719], 6
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 751], 7
vpunpcklbw xmm1, xmm4, xmm1
vmovd xmm4, r10d
movzx r10d, byte ptr [rdi + r9 + 783]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 814], 1
vpmovzxwd ymm1, xmm1
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 846], 2
vmovd xmm7, r10d
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 815], 1
movzx r10d, byte ptr [rdi + r9 + 16]
vpaddd ymm5, ymm5, ymm1
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 878], 3
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 847], 2
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 910], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 879], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 942], 5
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 911], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 974], 6
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 943], 5
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 1006], 7
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 975], 6
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 1007], 7
vpunpcklbw xmm4, xmm7, xmm4
vpmovzxwd ymm7, xmm14
vpmovzxwd ymm1, xmm4
vpaddd ymm14, ymm3, ymm7
vpmovzxwd ymm3, xmmword ptr [rsp]
vpmovzxwd ymm7, xmm12
vpaddd ymm0, ymm3, ymmword ptr [rsp - 64]
vpmovzxwd ymm3, xmm9
vpmovzxwd ymm9, xmm11
vpmovzxwd ymm11, xmm15
vpaddd ymm4, ymm9, ymm1
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 48], 1
movzx r10d, byte ptr [rdi + r9 + 272]
vpaddd ymm12, ymm11, ymm7
vpaddd ymm9, ymm3, ymmword ptr [rsp - 128]
vpaddd ymm3, ymm10, ymmword ptr [rsp - 96]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 80], 2
vpaddd ymm0, ymm6, ymm0
vmovdqu ymmword ptr [rsp - 96], ymm0
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 112], 3
vpaddd ymm2, ymm9, ymm2
vmovdqu ymmword ptr [rsp - 128], ymm3
vmovdqu ymmword ptr [rsp - 32], ymm2
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 144], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 176], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 208], 6
vpinsrb xmm7, xmm1, byte ptr [rdi + r9 + 240], 7
vpaddd ymm1, ymm8, ymm13
vmovdqu ymmword ptr [rsp + 16], ymm1
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 304], 1
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 336], 2
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 368], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 400], 4
movzx r10d, byte ptr [rdi + r9 + 528]
vpinsrb xmm0, xmm1, byte ptr [rdi + r9 + 432], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 464], 6
vpinsrb xmm1, xmm0, byte ptr [rdi + r9 + 496], 7
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 784]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 560], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 592], 2
vmovd xmm2, r10d
movzx r10d, byte ptr [rdi + r9 + 17]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 816], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 624], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 848], 2
vmovd xmm6, r10d
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 49], 1
movzx r10d, byte ptr [rdi + r9 + 273]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 656], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 880], 3
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 81], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 688], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 912], 4
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 113], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 720], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 944], 5
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 145], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 976], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 177], 5
vpinsrb xmm8, xmm2, byte ptr [rdi + r9 + 1008], 7
vpinsrb xmm9, xmm6, byte ptr [rdi + r9 + 209], 6
vpinsrb xmm6, xmm0, byte ptr [rdi + r9 + 752], 7
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 305], 1
movzx r10d, byte ptr [rdi + r9 + 529]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 337], 2
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 241], 7
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 561], 1
movzx r10d, byte ptr [rdi + r9 + 785]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 369], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 593], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 401], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 625], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 433], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 657], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 465], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 689], 5
vpinsrb xmm10, xmm0, byte ptr [rdi + r9 + 497], 7
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 817], 1
movzx r10d, byte ptr [rdi + r9 + 18]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 721], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 849], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 753], 7
vpinsrb xmm11, xmm0, byte ptr [rdi + r9 + 881], 3
vpunpcklbw xmm0, xmm9, xmm7
vmovd xmm9, r10d
movzx r10d, byte ptr [rdi + r9 + 274]
vpunpcklbw xmm1, xmm10, xmm1
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 50], 1
vpmovzxwd ymm0, xmm0
vpmovzxwd ymm1, xmm1
vpinsrb xmm7, xmm11, byte ptr [rdi + r9 + 913], 4
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 82], 2
vmovd xmm10, r10d
movzx r10d, byte ptr [rdi + r9 + 530]
vpunpcklbw xmm13, xmm2, xmm6
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 306], 1
vpaddd ymm12, ymm12, ymm1
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 945], 5
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 114], 3
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 338], 2
vmovd xmm2, r10d
movzx r10d, byte ptr [rdi + r9 + 19]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 562], 1
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 146], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 977], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 594], 2
vmovd xmm6, r10d
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 51], 1
movzx r10d, byte ptr [rdi + r9 + 275]
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 178], 5
vpinsrb xmm11, xmm7, byte ptr [rdi + r9 + 1009], 7
vpinsrb xmm7, xmm10, byte ptr [rdi + r9 + 370], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 626], 3
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 83], 2
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 210], 6
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 402], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 658], 4
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 115], 3
vpunpcklbw xmm10, xmm11, xmm8
vpinsrb xmm8, xmm9, byte ptr [rdi + r9 + 242], 7
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 434], 5
vpaddd ymm11, ymm14, ymm0
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 690], 5
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 147], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 466], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 722], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 179], 5
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 498], 7
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 754], 7
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 211], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 243], 7
vpunpcklbw xmm3, xmm6, xmm8
vmovd xmm8, r10d
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 307], 1
movzx r10d, byte ptr [rdi + r9 + 531]
vmovdqa xmmword ptr [rsp - 64], xmm3
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 339], 2
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 371], 3
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 403], 4
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 435], 5
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 467], 6
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 499], 7
vpunpcklbw xmm7, xmm8, xmm7
vmovd xmm8, r10d
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 563], 1
movzx r10d, byte ptr [rdi + r9 + 786]
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 595], 2
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 627], 3
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 659], 4
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 691], 5
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 723], 6
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 755], 7
vpunpcklbw xmm8, xmm8, xmm2
vmovd xmm2, r10d
movzx r10d, byte ptr [rdi + r9 + 787]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 818], 1
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 850], 2
vmovd xmm9, r10d
movzx r10d, byte ptr [rdi + r9 + 20]
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 819], 1
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 882], 3
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 851], 2
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 21]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 52], 1
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 914], 4
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 883], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 84], 2
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 53], 1
movzx r10d, byte ptr [rdi + r9 + 276]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 946], 5
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 915], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 116], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 85], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 978], 6
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 947], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 148], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 117], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 1010], 7
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 979], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 180], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 149], 4
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 1011], 7
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 212], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 181], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 244], 7
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 213], 6
vpunpcklbw xmm9, xmm9, xmm2
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 245], 7
vpunpcklbw xmm14, xmm1, xmm0
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 277]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 308], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 340], 2
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 309], 1
movzx r10d, byte ptr [rdi + r9 + 532]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 372], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 341], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 404], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 373], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 436], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 405], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 468], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 437], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 500], 7
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 469], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 501], 7
vpunpcklbw xmm15, xmm1, xmm0
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 533]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 564], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 596], 2
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 565], 1
movzx r10d, byte ptr [rdi + r9 + 788]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 628], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 597], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 660], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 629], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 692], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 661], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 724], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 693], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 756], 7
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 725], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 757], 7
vpunpcklbw xmm2, xmm1, xmm0
vpmovzxwd ymm0, xmm13
vpaddd ymm5, ymm5, ymm0
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 789]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 820], 1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 852], 2
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 821], 1
movzx r10d, byte ptr [rdi + r9 + 22]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 884], 3
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 853], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 916], 4
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 885], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 948], 5
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 917], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 980], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 949], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 1012], 7
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 981], 6
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 1013], 7
vpunpcklbw xmm1, xmm1, xmm0
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 23]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 54], 1
vpmovzxwd ymm1, xmm1
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 86], 2
vmovd xmm13, r10d
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 55], 1
movzx r10d, byte ptr [rdi + r9 + 278]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 118], 3
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 87], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 150], 4
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 119], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 182], 5
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 151], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 214], 6
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 183], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 246], 7
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 215], 6
vpinsrb xmm13, xmm13, byte ptr [rdi + r9 + 247], 7
vpunpcklbw xmm13, xmm13, xmm0
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 279]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 310], 1
vpmovzxwd ymm13, xmm13
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 342], 2
vmovd xmm3, r10d
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 311], 1
movzx r10d, byte ptr [rdi + r9 + 534]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 374], 3
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 343], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 406], 4
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 375], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 438], 5
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 407], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 470], 6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 439], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 502], 7
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 471], 6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 503], 7
vpunpcklbw xmm0, xmm3, xmm0
vpmovzxwd ymm3, xmm10
vpmovzxwd ymm0, xmm0
vpaddd ymm10, ymm4, ymm3
vmovd xmm3, r10d
movzx r10d, byte ptr [rdi + r9 + 535]
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 566], 1
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 598], 2
vmovd xmm4, r10d
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 567], 1
movzx r10d, byte ptr [rdi + r9 + 790]
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 630], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 599], 2
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 662], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 631], 3
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 694], 5
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 663], 4
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 726], 6
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 695], 5
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 758], 7
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 727], 6
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 759], 7
vpunpcklbw xmm3, xmm4, xmm3
vmovd xmm4, r10d
movzx r10d, byte ptr [rdi + r9 + 791]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 822], 1
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 854], 2
vmovd xmm6, r10d
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 823], 1
movzx r10d, byte ptr [rdi + r9 + 24]
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 886], 3
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 855], 2
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 918], 4
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 887], 3
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 950], 5
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 919], 4
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 982], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 951], 5
vpinsrb xmm4, xmm4, byte ptr [rdi + r9 + 1014], 7
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 983], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 1015], 7
vpunpcklbw xmm6, xmm6, xmm4
vpmovzxwd ymm4, xmm14
vpaddd ymm14, ymm13, ymm4
vpmovzxwd ymm4, xmm15
vpaddd ymm13, ymm4, ymm0
vpmovzxwd ymm0, xmm2
vpmovzxwd ymm2, xmm3
vpmovzxwd ymm3, xmm8
vpmovzxwd ymm8, xmm9
vpaddd ymm4, ymm0, ymm2
vpmovzxwd ymm0, xmm6
vpmovzxwd ymm2, xmm7
vpaddd ymm7, ymm5, ymm3
vpaddd ymm8, ymm10, ymm8
vpaddd ymm0, ymm1, ymm0
vpmovzxwd ymm1, xmmword ptr [rsp - 64]
vpaddd ymm11, ymm11, ymm1
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 56], 1
movzx r10d, byte ptr [rdi + r9 + 280]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 88], 2
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 120], 3
vpinsrb xmm6, xmm1, byte ptr [rdi + r9 + 152], 4
vpaddd ymm1, ymm12, ymm2
vpinsrb xmm2, xmm6, byte ptr [rdi + r9 + 184], 5
vmovd xmm6, r10d
movzx r10d, byte ptr [rdi + r9 + 536]
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 312], 1
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 216], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 344], 2
vmovd xmm3, r10d
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 568], 1
movzx r10d, byte ptr [rdi + r9 + 25]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 248], 7
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 376], 3
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 600], 2
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 408], 4
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 632], 3
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 440], 5
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 664], 4
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 472], 6
vpinsrb xmm5, xmm3, byte ptr [rdi + r9 + 696], 5
vmovd xmm3, r10d
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 57], 1
movzx r10d, byte ptr [rdi + r9 + 281]
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 504], 7
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 89], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 728], 6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 121], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 760], 7
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 153], 4
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 185], 5
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 217], 6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 249], 7
vpunpcklbw xmm2, xmm3, xmm2
vmovd xmm3, r10d
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 313], 1
movzx r10d, byte ptr [rdi + r9 + 537]
vpmovzxwd ymm2, xmm2
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 345], 2
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 377], 3
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 409], 4
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 441], 5
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 473], 6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 505], 7
vpunpcklbw xmm6, xmm3, xmm6
vmovd xmm3, r10d
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 569], 1
movzx r10d, byte ptr [rdi + r9 + 792]
vpmovzxwd ymm6, xmm6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 601], 2
vpaddd ymm6, ymm13, ymm6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 633], 3
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 665], 4
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 697], 5
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 729], 6
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 761], 7
vpunpcklbw xmm3, xmm3, xmm5
vmovd xmm5, r10d
movzx r10d, byte ptr [rdi + r9 + 793]
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 824], 1
vpmovzxwd ymm3, xmm3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 856], 2
vmovd xmm9, r10d
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 825], 1
movzx r10d, byte ptr [rdi + r9 + 26]
vpaddd ymm4, ymm4, ymm3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 888], 3
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 857], 2
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 920], 4
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 889], 3
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 952], 5
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 921], 4
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 984], 6
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 953], 5
vpinsrb xmm5, xmm5, byte ptr [rdi + r9 + 1016], 7
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 985], 6
vpinsrb xmm9, xmm9, byte ptr [rdi + r9 + 1017], 7
vpunpcklbw xmm5, xmm9, xmm5
vpaddd ymm9, ymm14, ymm2
vmovd xmm2, r10d
movzx r10d, byte ptr [rdi + r9 + 282]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 58], 1
vpmovzxwd ymm5, xmm5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 90], 2
vmovd xmm10, r10d
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 314], 1
movzx r10d, byte ptr [rdi + r9 + 538]
vpaddd ymm5, ymm0, ymm5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 122], 3
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 346], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 154], 4
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 378], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 186], 5
vpinsrb xmm3, xmm10, byte ptr [rdi + r9 + 410], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 218], 6
vpinsrb xmm10, xmm3, byte ptr [rdi + r9 + 442], 5
vmovd xmm3, r10d
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 570], 1
movzx r10d, byte ptr [rdi + r9 + 794]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 250], 7
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 602], 2
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 474], 6
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 27]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 826], 1
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 634], 3
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 506], 7
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 858], 2
vpinsrb xmm3, xmm3, byte ptr [rdi + r9 + 666], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 890], 3
vpinsrb xmm12, xmm3, byte ptr [rdi + r9 + 698], 5
vpaddd ymm3, ymm11, ymmword ptr [rsp - 128]
vmovd xmm11, r10d
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 59], 1
movzx r10d, byte ptr [rdi + r9 + 283]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 922], 4
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 91], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 954], 5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 123], 3
vmovdqu ymmword ptr [rsp - 128], ymm3
vpaddd ymm3, ymm1, ymmword ptr [rsp + 16]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 986], 6
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 155], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 1018], 7
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 187], 5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 219], 6
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 251], 7
vpunpcklbw xmm14, xmm11, xmm2
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 315], 1
movzx r10d, byte ptr [rdi + r9 + 539]
vpinsrb xmm11, xmm12, byte ptr [rdi + r9 + 730], 6
vpmovzxwd ymm14, xmm14
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 347], 2
vpaddd ymm9, ymm9, ymm14
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 379], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 411], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 443], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 475], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 507], 7
vpunpcklbw xmm13, xmm2, xmm10
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 571], 1
vpinsrb xmm10, xmm11, byte ptr [rdi + r9 + 762], 7
movzx r10d, byte ptr [rdi + r9 + 795]
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 603], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 635], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 667], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 699], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 731], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 763], 7
vpunpcklbw xmm11, xmm2, xmm10
vmovd xmm2, r10d
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 827], 1
movzx r10d, byte ptr [rdi + r9 + 28]
vpmovzxwd ymm11, xmm11
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 859], 2
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 891], 3
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 923], 4
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 955], 5
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 987], 6
vpinsrb xmm2, xmm2, byte ptr [rdi + r9 + 1019], 7
vpunpcklbw xmm12, xmm2, xmm0
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 60], 1
movzx r10d, byte ptr [rdi + r9 + 284]
vpaddd ymm2, ymm7, ymmword ptr [rsp - 96]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 92], 2
vmovd xmm1, r10d
movzx r10d, byte ptr [rdi + r9 + 540]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 124], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 156], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 188], 5
vpinsrb xmm15, xmm0, byte ptr [rdi + r9 + 220], 6
vpinsrb xmm0, xmm1, byte ptr [rdi + r9 + 316], 1
vmovd xmm1, r10d
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 572], 1
movzx r10d, byte ptr [rdi + r9 + 29]
vpinsrb xmm1, xmm1, byte ptr [rdi + r9 + 604], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 348], 2
vpinsrb xmm7, xmm1, byte ptr [rdi + r9 + 636], 3
vpaddd ymm1, ymm8, ymmword ptr [rsp - 32]
vpinsrb xmm8, xmm15, byte ptr [rdi + r9 + 252], 7
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 380], 3
vpmovzxwd ymm15, xmm13
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 668], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 412], 4
vpaddd ymm6, ymm15, ymm6
vpmovzxwd ymm15, xmm12
vpaddd ymm12, ymm11, ymm4
vpinsrb xmm10, xmm7, byte ptr [rdi + r9 + 700], 5
vmovd xmm7, r10d
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 61], 1
movzx r10d, byte ptr [rdi + r9 + 285]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 444], 5
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 93], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 476], 6
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 732], 6
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 125], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 508], 7
vpinsrb xmm10, xmm10, byte ptr [rdi + r9 + 764], 7
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 157], 4
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 189], 5
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 221], 6
vpinsrb xmm7, xmm7, byte ptr [rdi + r9 + 253], 7
vpunpcklbw xmm7, xmm7, xmm8
vmovd xmm8, r10d
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 317], 1
movzx r10d, byte ptr [rdi + r9 + 541]
vpmovzxwd ymm4, xmm7
vpaddd ymm7, ymm15, ymm5
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 349], 2
vpaddd ymm9, ymm9, ymm4
vpaddd ymm9, ymm9, ymmword ptr [rsp - 128]
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 381], 3
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 413], 4
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 445], 5
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 477], 6
vpinsrb xmm8, xmm8, byte ptr [rdi + r9 + 509], 7
vpunpcklbw xmm8, xmm8, xmm0
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 573], 1
movzx r10d, byte ptr [rdi + r9 + 796]
vpmovzxwd ymm15, xmm8
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 605], 2
vpaddd ymm6, ymm15, ymm6
vpaddd ymm3, ymm3, ymm6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 637], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 669], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 701], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 733], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 765], 7
vpunpcklbw xmm10, xmm0, xmm10
vmovd xmm0, r10d
movzx r10d, byte ptr [rdi + r9 + 797]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 828], 1
vpmovzxwd ymm10, xmm10
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 860], 2
vmovd xmm14, r10d
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 829], 1
movzx r10d, byte ptr [rdi + r9 + 30]
vpaddd ymm10, ymm12, ymm10
vpaddd ymm2, ymm10, ymm2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 892], 3
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 861], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 924], 4
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 893], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 956], 5
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 925], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 988], 6
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 957], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 1020], 7
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 989], 6
vpinsrb xmm14, xmm14, byte ptr [rdi + r9 + 1021], 7
vpunpcklbw xmm13, xmm14, xmm0
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 62], 1
movzx r10d, byte ptr [rdi + r9 + 286]
vpmovzxwd ymm13, xmm13
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 94], 2
vpaddd ymm7, ymm13, ymm7
vpaddd ymm7, ymm1, ymm7
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 126], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 158], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 190], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 222], 6
vpinsrb xmm14, xmm0, byte ptr [rdi + r9 + 254], 7
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 318], 1
movzx r10d, byte ptr [rdi + r9 + 542]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 350], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 382], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 414], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 446], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 478], 6
vpinsrb xmm11, xmm0, byte ptr [rdi + r9 + 510], 7
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 574], 1
movzx r10d, byte ptr [rdi + r9 + 798]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 606], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 638], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 670], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 702], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 734], 6
vpinsrb xmm5, xmm0, byte ptr [rdi + r9 + 766], 7
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 830], 1
movzx r10d, byte ptr [rdi + r9 + 31]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 862], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 894], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 926], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 958], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 990], 6
vpinsrb xmm4, xmm0, byte ptr [rdi + r9 + 1022], 7
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 63], 1
movzx r10d, byte ptr [rdi + r9 + 287]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 95], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 127], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 159], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 191], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 223], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 255], 7
vpunpcklbw xmm8, xmm0, xmm14
vmovd xmm0, r10d
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 319], 1
movzx r10d, byte ptr [rdi + r9 + 543]
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 351], 2
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 383], 3
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 415], 4
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 447], 5
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 479], 6
vpinsrb xmm0, xmm0, byte ptr [rdi + r9 + 511], 7
vpunpcklbw xmm0, xmm0, xmm11
vmovd xmm11, r10d
movzx r10d, byte ptr [rdi + r9 + 799]
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 575], 1
vpmovzxwd ymm0, xmm0
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 607], 2
vmovd xmm12, r10d
vpinsrb xmm6, xmm12, byte ptr [rdi + r9 + 831], 1
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 639], 3
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 863], 2
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 671], 4
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 895], 3
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 703], 5
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 927], 4
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 735], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 959], 5
vpinsrb xmm11, xmm11, byte ptr [rdi + r9 + 767], 7
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 991], 6
vpinsrb xmm6, xmm6, byte ptr [rdi + r9 + 1023], 7
vpunpcklbw xmm5, xmm11, xmm5
vpmovzxwd ymm5, xmm5
vpunpcklbw xmm4, xmm6, xmm4
vpmovzxwd ymm6, xmm8
vpaddd ymm1, ymm9, ymm6
vpaddd ymm6, ymm3, ymm0
vpaddd ymm3, ymm2, ymm5
vpmovzxwd ymm2, xmm4
vpaddd ymm2, ymm7, ymm2
cmp r8, rdx
jne .LBB0_5
vpaddd ymm0, ymm6, ymm1
vpaddd ymm0, ymm3, ymm0
vpaddd ymm0, ymm2, ymm0
vextracti128 xmm1, ymm0, 1
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 238
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 85
vpaddd xmm0, xmm0, xmm1
vmovd r8d, xmm0
cmp rcx, rdx
je .LBB0_9
.LBB0_7:
vbroadcasti128 ymm0, xmmword ptr [rip + .LCPI0_1]
.LBB0_8:
vmovdqu ymm1, ymmword ptr [rax]
add rsi, -32
add rax, 32
vpshufb ymm1, ymm1, ymm0
vextracti128 xmm2, ymm1, 1
vpmovzxwd ymm1, xmm1
vpmovzxwd ymm2, xmm2
vpaddd ymm1, ymm1, ymm2
vextracti128 xmm2, ymm1, 1
vpaddd xmm1, xmm1, xmm2
vpshufd xmm2, xmm1, 238
vpaddd xmm1, xmm1, xmm2
vpshufd xmm2, xmm1, 85
vpaddd xmm1, xmm1, xmm2
vmovd ecx, xmm1
add r8d, ecx
cmp rsi, 31
ja .LBB0_8
.LBB0_9:
cmp rsi, 2
jb .LBB0_10
lea rdx, [rsi - 2]
cmp rdx, 62
jae .LBB0_16
xor ecx, ecx
jmp .LBB0_19
.LBB0_10:
xor ecx, ecx
jmp .LBB0_11
.LBB0_16:
vmovdqa xmm2, xmmword ptr [rip + .LCPI0_1]
shr rdx
vmovd xmm0, r8d
vpxor xmm1, xmm1, xmm1
xor r8d, r8d
vpxor xmm3, xmm3, xmm3
vpxor xmm4, xmm4, xmm4
inc rdx
mov rdi, rdx
and rdi, -32
lea rcx, [rdi + rdi]
.LBB0_17:
vmovdqu xmm6, xmmword ptr [rax + 2*r8 + 16]
vmovdqu xmm5, xmmword ptr [rax + 2*r8]
vmovdqu xmm7, xmmword ptr [rax + 2*r8 + 32]
vmovdqu xmm8, xmmword ptr [rax + 2*r8 + 48]
add r8, 32
vpshufb xmm6, xmm6, xmm2
vpshufb xmm5, xmm5, xmm2
vpshufb xmm7, xmm7, xmm2
vpshufb xmm8, xmm8, xmm2
vpmovzxwd ymm6, xmm6
vpmovzxwd ymm5, xmm5
vpmovzxwd ymm7, xmm7
vpaddd ymm1, ymm1, ymm6
vpmovzxwd ymm6, xmm8
vpaddd ymm0, ymm0, ymm5
vpaddd ymm3, ymm3, ymm7
vpaddd ymm4, ymm4, ymm6
cmp rdi, r8
jne .LBB0_17
vpaddd ymm0, ymm1, ymm0
vpaddd ymm0, ymm3, ymm0
vpaddd ymm0, ymm4, ymm0
vextracti128 xmm1, ymm0, 1
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 238
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 85
vpaddd xmm0, xmm0, xmm1
vmovd r8d, xmm0
cmp rdx, rdi
je .LBB0_11
.LBB0_19:
mov rdx, rcx
.LBB0_20:
movbe cx, word ptr [rax + rdx]
movzx ecx, cx
add r8d, ecx
lea rcx, [rdx + 2]
add rdx, 3
cmp rdx, rsi
mov rdx, rcx
jb .LBB0_20
.LBB0_11:
cmp rcx, rsi
jae .LBB0_13
movzx eax, byte ptr [rax + rcx]
shl eax, 8
add r8d, eax
.LBB0_13:
mov eax, r8d
shr eax, 16
movzx ecx, r8w
add ecx, eax
mov eax, ecx
shr eax, 16
add eax, ecx
add rsp, 88
vzeroupper
ret |
I would definitely prefer that |
I don't know the exact reason. I think your method is the best.
I benchmarked bench_checksum_chunks_exact_no_bigchunk and bench_checksum_chunks_exact on macOS — the former was about 3.5% faster. |
I ran into a surprising result: when the input size is less than 1024 bytes, bench_checksum_original is actually the fastest implementation. Could you help verify this on your machine? @datdenkikniet Benchmark results: For input size < 1024:
For input size ≥ 1024:
Code (click to expand)
|
Got a little carried away and made some graphs :P Data for each 10-byte interval, starting at 0 and ending at 1460. Y-axis is time in ns, x-axis is data size. The sawtooth pattern is probably due to speedups due to "SIMD boundaries", so not too noteworthy. With Details if you'd like to run it locally, too Required code changes + bash script#!/bin/bash
for i in $(seq 0 10 1460); do
while read -r line; do
data=$(echo "$line" | awk '{print $2 "," $5 "," $8}')
data="${data::-1}"
unit=$(echo "$line" | awk '{print $6}' | cut -d '/' -f1)
echo "$i,$data,$unit"
done < <(DATA_SIZE=$i cargo +nightly bench -- checksum 2> /dev/null | grep "... bench")
done fn build_data() -> Vec<u8> {
let data_size: usize = std::env::var("DATA_SIZE").unwrap().parse().unwrap();
(0..data_size).map(|x| (x % 256)as u8).collect()
} Raw data results: results.ods To double-check, I also ran it with 32 byte intervals and the graph is indeed a lot smoother: |
I ran the benchmarks on my Mac Studio with an M1 chip, and the results are shown in the table below. While there are some minor differences compared to the results from a Mac mini with an M4 chip, the overall trends remain consistent. Benchmark results (click to expand)
Update: I just realized that the unusual performance of bench_checksum_original might be related to NetworkEndian::read_u16. I replaced it with u16::from_be_bytes, and now its performance is much closer to the other implementations. Benchmark results (click to expand)
|
Well, that's an interesting turn of events... Graph version for those interested: Given that the tradeoffs here are really not very obvious (because I have no clue how much time is actually spent computing this checksum, nor how many people run it on their Macs), and I don't think we even have results for the "most relevant" targets (embedded ARM, IMO), I will leave judgement to whoever has approval powers :P At least we have some data to show for it now! |
I think we should go with |
Replace slice iteration with indexed access to reduce overhead and improve performance - CPU usage dropped by 8% on Apple M1.