Skip to content

Speedup compress #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 21, 2025
Merged

Speedup compress #83

merged 5 commits into from
Jan 21, 2025

Conversation

folkertdev
Copy link
Collaborator

Rewrite code so that it is easier to vectorize. With these changes we now beat stock bzip2 handsomely on the compression benchmarks that I looked at. e.g

Benchmark 1 (6 runs): target/release/examples/compress c 9 /home/folkertdev/rust/zlib-rs/silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           923ms ± 9.69ms     918ms …  943ms          0 ( 0%)        0%
  peak_rss           29.5MB ± 98.7KB    29.4MB … 29.6MB          0 ( 0%)        0%
  cpu_cycles         4.16G  ± 33.1M     4.14G  … 4.23G           0 ( 0%)        0%
  instructions       7.94G  ±  310      7.94G  … 7.94G           0 ( 0%)        0%
  cache_references    150M  ±  504K      150M  …  151M           0 ( 0%)        0%
  cache_misses       37.8M  ±  445K     37.1M  … 38.4M           0 ( 0%)        0%
  branch_misses      61.2M  ± 46.2K     61.2M  … 61.3M           0 ( 0%)        0%
Benchmark 2 (7 runs): target/release/examples/compress rs 9 /home/folkertdev/rust/zlib-rs/silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           818ms ± 3.00ms     814ms …  823ms          0 ( 0%)        ⚡- 11.5% ±  0.9%
  peak_rss           29.6MB ± 49.5KB    29.5MB … 29.6MB          1 (14%)          +  0.3% ±  0.3%
  cpu_cycles         3.67G  ± 10.8M     3.66G  … 3.69G           0 ( 0%)        ⚡- 11.8% ±  0.7%
  instructions       9.27G  ± 40.5K     9.27G  … 9.27G           0 ( 0%)        💩+ 16.7% ±  0.0%
  cache_references    151M  ±  556K      150M  …  152M           0 ( 0%)          +  0.5% ±  0.4%
  cache_misses       38.4M  ±  302K     38.0M  … 38.9M           0 ( 0%)          +  1.6% ±  1.2%
  branch_misses      52.3M  ± 21.3K     52.3M  … 52.4M           0 ( 0%)        ⚡- 14.5% ±  0.1%

Comment on lines +418 to +425
for (((c1, c2), s1), s2) in b1.iter().zip(b2).zip(q1).zip(q2) {
if c1 != c2 {
return c1 > c2;
}
if s1 != s2 {
return s1 > s2;
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet found a good way to vectorize this part (tried some stuff with xor and leading_zeros but I could not get it to be correct so far). at least it's out of the hot path, and the equality check will uses full-width loads/compares (even an avx one for the quadrant compare)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just documenting, I came up with this

        let lc1 = u64::from_be_bytes(*b1.first_chunk().unwrap());
        let lc2 = u64::from_be_bytes(*b2.first_chunk().unwrap());

        #[inline(always)]
        fn transform(slice: &[u16]) -> u128 {
            let raw = unsafe { slice.as_ptr().cast::<u128>().read_unaligned().to_be() };
            let mask = 0xFF00ff00_FF00ff00_FF00ff00_FF00ff00u128;

            let upper = raw & mask;
            let lower = raw & !mask;

            (upper >> 8) | (lower << 8)
        }

        if b1 != b2 || q1 != q2 {
            let lq1 = transform(q1);
            let lq2 = transform(q2);

            let first_bad_c = (lc1 ^ lc2).leading_zeros() / 8;
            let first_bad_q = (lq1 ^ lq2).leading_zeros() / 16;

            if first_bad_c <= first_bad_q {
                return lc1 > lc2;
            } else {
                return lq1 > lq2;
            }
        }

which is OK, but for some reason won't use xmm registers, so it's overall just too many instructions to be profitable.

@folkertdev folkertdev requested a review from bjorn3 January 20, 2025 17:44
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
libbz2-rs-sys/src/blocksort.rs 96.42% 1 Missing ⚠️
Files with missing lines Coverage Δ
libbz2-rs-sys/src/blocksort.rs 99.32% <96.42%> (-0.11%) ⬇️

Copy link
Collaborator

@bjorn3 bjorn3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the profile change removed.

@folkertdev folkertdev merged commit 34257dd into main Jan 21, 2025
20 checks passed
@folkertdev folkertdev deleted the speedup-compress branch January 21, 2025 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants