Speedup compress #83

folkertdev · 2025-01-20T17:42:46Z

Rewrite code so that it is easier to vectorize. With these changes we now beat stock bzip2 handsomely on the compression benchmarks that I looked at. e.g

Benchmark 1 (6 runs): target/release/examples/compress c 9 /home/folkertdev/rust/zlib-rs/silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           923ms ± 9.69ms     918ms …  943ms          0 ( 0%)        0%
  peak_rss           29.5MB ± 98.7KB    29.4MB … 29.6MB          0 ( 0%)        0%
  cpu_cycles         4.16G  ± 33.1M     4.14G  … 4.23G           0 ( 0%)        0%
  instructions       7.94G  ±  310      7.94G  … 7.94G           0 ( 0%)        0%
  cache_references    150M  ±  504K      150M  …  151M           0 ( 0%)        0%
  cache_misses       37.8M  ±  445K     37.1M  … 38.4M           0 ( 0%)        0%
  branch_misses      61.2M  ± 46.2K     61.2M  … 61.3M           0 ( 0%)        0%
Benchmark 2 (7 runs): target/release/examples/compress rs 9 /home/folkertdev/rust/zlib-rs/silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           818ms ± 3.00ms     814ms …  823ms          0 ( 0%)        ⚡- 11.5% ±  0.9%
  peak_rss           29.6MB ± 49.5KB    29.5MB … 29.6MB          1 (14%)          +  0.3% ±  0.3%
  cpu_cycles         3.67G  ± 10.8M     3.66G  … 3.69G           0 ( 0%)        ⚡- 11.8% ±  0.7%
  instructions       9.27G  ± 40.5K     9.27G  … 9.27G           0 ( 0%)        💩+ 16.7% ±  0.0%
  cache_references    151M  ±  556K      150M  …  152M           0 ( 0%)          +  0.5% ±  0.4%
  cache_misses       38.4M  ±  302K     38.0M  … 38.9M           0 ( 0%)          +  1.6% ±  1.2%
  branch_misses      52.3M  ± 21.3K     52.3M  … 52.4M           0 ( 0%)        ⚡- 14.5% ±  0.1%

folkertdev · 2025-01-20T17:44:13Z

libbz2-rs-sys/src/blocksort.rs

+            for (((c1, c2), s1), s2) in b1.iter().zip(b2).zip(q1).zip(q2) {
+                if c1 != c2 {
+                    return c1 > c2;
+                }
+                if s1 != s2 {
+                    return s1 > s2;
+                }
+            }


I haven't yet found a good way to vectorize this part (tried some stuff with xor and leading_zeros but I could not get it to be correct so far). at least it's out of the hot path, and the equality check will uses full-width loads/compares (even an avx one for the quadrant compare)

just documenting, I came up with this

let lc1 = u64::from_be_bytes(*b1.first_chunk().unwrap()); let lc2 = u64::from_be_bytes(*b2.first_chunk().unwrap()); #[inline(always)] fn transform(slice: &[u16]) -> u128 { let raw = unsafe { slice.as_ptr().cast::<u128>().read_unaligned().to_be() }; let mask = 0xFF00ff00_FF00ff00_FF00ff00_FF00ff00u128; let upper = raw & mask; let lower = raw & !mask; (upper >> 8) | (lower << 8) } if b1 != b2 || q1 != q2 { let lq1 = transform(q1); let lq2 = transform(q2); let first_bad_c = (lc1 ^ lc2).leading_zeros() / 8; let first_bad_q = (lq1 ^ lq2).leading_zeros() / 16; if first_bad_c <= first_bad_q { return lc1 > lc2; } else { return lq1 > lq2; } }

which is OK, but for some reason won't use xmm registers, so it's overall just too many instructions to be profitable.

codecov · 2025-01-20T17:49:22Z

Codecov Report

Attention: Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
libbz2-rs-sys/src/blocksort.rs	96.42%	1 Missing ⚠️

Files with missing lines	Coverage Δ
libbz2-rs-sys/src/blocksort.rs	`99.32% <96.42%> (-0.11%)`	⬇️

Cargo.toml

bjorn3

LGTM with the profile change removed.

folkertdev commented Jan 20, 2025

View reviewed changes

folkertdev requested a review from bjorn3 January 20, 2025 17:44

bjorn3 reviewed Jan 21, 2025

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

bjorn3 approved these changes Jan 21, 2025

View reviewed changes

folkertdev added 5 commits January 21, 2025 10:36

blocksort.rs: make mainGtU easier to vectorize, step 1

2ca444e

blocksort.rs: make mainGtU easier to vectorize, step 2

37fc943

blocksort.rs: make mainGtU easier to vectorize, step 3

1f4d754

blocksort.rs: make mainGtU easier to vectorize, step 4

98acb28

blocksort.rs: make mainGtU easier to vectorize, step 5

9b09dfa

folkertdev force-pushed the speedup-compress branch from 6182d2c to 9b09dfa Compare January 21, 2025 09:37

folkertdev merged commit 34257dd into main Jan 21, 2025
20 checks passed

folkertdev deleted the speedup-compress branch January 21, 2025 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speedup compress #83

Speedup compress #83

Uh oh!

folkertdev commented Jan 20, 2025

Uh oh!

folkertdev Jan 20, 2025

Uh oh!

folkertdev Jan 24, 2025

Uh oh!

codecov bot commented Jan 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

bjorn3 left a comment

Uh oh!

Uh oh!

Uh oh!

Speedup compress #83

Speedup compress #83

Uh oh!

Conversation

folkertdev commented Jan 20, 2025

Uh oh!

folkertdev Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

folkertdev Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

bjorn3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 20, 2025 •

edited

Loading