Skip to content

Conversation

@folkertdev
Copy link
Member

@folkertdev folkertdev commented May 9, 2025

closes #361

Locally this is very effective, CI shows a lot more noise. But it does make sense that this would be faster right?

Benchmark 2 (73 runs): target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          68.9ms ±  522us    68.3ms … 70.7ms          2 ( 3%)          -  1.2% ±  0.3%
  peak_rss           24.1MB ± 75.4KB    23.9MB … 24.1MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          270M  ± 1.59M      268M  …  278M           4 ( 5%)        ⚡-  1.5% ±  0.2%
  instructions        781M  ±  400       781M  …  781M           2 ( 3%)          -  0.1% ±  0.0%
  cache_references   3.14M  ± 97.1K     2.85M  … 3.36M           0 ( 0%)          -  2.1% ±  1.1%
  cache_misses       83.8K  ± 14.4K     51.4K  …  129K           1 ( 1%)        💩+  7.0% ±  5.4%
  branch_misses      4.09M  ± 11.0K     4.08M  … 4.12M           0 ( 0%)          +  0.2% ±  0.1%
Benchmark 2 (152 runs): target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          32.9ms ±  842us    32.3ms … 41.8ms         10 ( 7%)        ⚡-  2.8% ±  0.5%
  peak_rss           24.1MB ± 58.4KB    24.0MB … 24.1MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          111M  ± 2.73M      110M  …  141M          13 ( 9%)        ⚡-  3.7% ±  0.4%
  instructions        265M  ±  276       265M  …  265M           0 ( 0%)        ⚡-  6.4% ±  0.0%
  cache_references   3.38M  ±  153K     3.13M  … 4.62M           5 ( 3%)        ⚡-  2.3% ±  0.9%
  cache_misses       40.3K  ± 12.2K     30.6K  …  167K          10 ( 7%)        💩+  8.7% ±  6.0%
  branch_misses      1.79M  ± 1.95K     1.79M  … 1.81M           8 ( 5%)          -  0.1% ±  0.0%
Benchmark 2 (62 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.8ms ±  950us    80.1ms … 86.5ms          3 ( 5%)          -  0.1% ±  0.4%
  peak_rss           26.7MB ± 68.6KB    26.5MB … 26.7MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          288M  ± 2.76M      284M  …  303M           1 ( 2%)          +  0.3% ±  0.3%
  instructions        582M  ±  296       582M  …  582M           0 ( 0%)          -  1.0% ±  0.0%
  cache_references   19.9M  ±  211K     19.4M  … 20.5M           5 ( 8%)          +  0.0% ±  0.3%
  cache_misses        407K  ± 83.5K      293K  …  644K           5 ( 8%)        ⚡-  9.6% ±  5.7%
  branch_misses      2.97M  ± 2.46K     2.97M  … 2.98M           1 ( 2%)          +  0.2% ±  0.0%
Benchmark 2 (12 runs): target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           429ms ± 2.82ms     426ms …  434ms          0 ( 0%)        ⚡-  2.5% ±  0.8%
  peak_rss           24.4MB ±  102KB    24.2MB … 24.5MB          0 ( 0%)          +  0.0% ±  0.3%
  cpu_cycles         1.84G  ± 9.16M     1.83G  … 1.86G           2 (17%)        ⚡-  2.5% ±  0.8%
  instructions       3.43G  ±  270      3.43G  … 3.43G           0 ( 0%)          -  0.6% ±  0.0%
  cache_references    197M  ±  759K      196M  …  198M           0 ( 0%)          -  0.9% ±  0.8%
  cache_misses       2.36M  ±  438K     1.69M  … 3.00M           0 ( 0%)          - 34.7% ± 40.4%
  branch_misses      15.8M  ± 41.1K     15.7M  … 15.8M           0 ( 0%)          +  0.1% ±  0.2%

@codecov
Copy link

codecov bot commented May 9, 2025

Codecov Report

Attention: Patch coverage is 52.50000% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zlib-rs/src/cpu_features.rs 34.48% 19 Missing ⚠️
Flag Coverage Δ
fuzz-compress 41.41% <76.47%> (-0.45%) ⬇️
fuzz-decompress 31.09% <70.58%> (+0.08%) ⬆️
test-aarch64-apple-darwin 88.91% <22.22%> (-0.24%) ⬇️
test-x86_64-apple-darwin 85.63% <80.00%> (-0.04%) ⬇️
test-x86_64-unknown-linux-gnu 87.07% <95.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
zlib-rs/src/adler32.rs 98.82% <100.00%> (ø)
zlib-rs/src/adler32/avx2.rs 92.50% <100.00%> (ø)
zlib-rs/src/deflate/compare256.rs 98.02% <100.00%> (ø)
zlib-rs/src/deflate/slide_hash.rs 98.96% <100.00%> (ø)
zlib-rs/src/inflate.rs 95.97% <100.00%> (-0.17%) ⬇️
zlib-rs/src/inflate/writer.rs 91.54% <100.00%> (ø)
zlib-rs/src/cpu_features.rs 46.25% <34.48%> (-13.40%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

return std::is_x86_feature_detected!("pclmulqdq")
&& std::is_x86_feature_detected!("sse2")
&& std::is_x86_feature_detected!("sse4.1");
return std::is_x86_feature_detected!("pclmulqdq") && std::is_x86_feature_detected!("sse4.1");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@folkertdev folkertdev requested a review from bjorn3 May 9, 2025 11:59
@folkertdev folkertdev changed the title check for and enable the bmi2 target feature check for and enable the bmi1 and bmi2 target feature May 9, 2025
@bjorn3
Copy link
Collaborator

bjorn3 commented May 9, 2025

How much does disabling bmi2 hurt? It contains the pdep and pext instructions which are slow on Zen 2 and lower.

@folkertdev
Copy link
Member Author

Pretty significant, at least on my CPU

Benchmark 2 (74 runs): target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          68.0ms ±  376us    67.4ms … 69.8ms          1 ( 1%)        💩+  1.7% ±  0.2%
  peak_rss           24.1MB ± 70.5KB    23.9MB … 24.1MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          275M  ± 1.37M      274M  …  285M           6 ( 8%)        💩+  1.9% ±  0.2%
  instructions        786M  ±  383       786M  …  786M           2 ( 3%)          +  0.7% ±  0.0%
  cache_references   3.19M  ±  387K     2.97M  … 5.72M           3 ( 4%)          +  0.2% ±  3.4%
  cache_misses       64.8K  ± 11.4K     43.4K  …  123K           1 ( 1%)        ⚡- 52.3% ±  2.7%
  branch_misses      4.08M  ± 3.61K     4.08M  … 4.10M           2 ( 3%)          -  0.2% ±  0.1%
Benchmark 2 (153 runs): target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          32.8ms ±  803us    32.1ms … 41.1ms          9 ( 6%)        💩+  2.4% ±  0.4%
  peak_rss           24.1MB ± 64.3KB    23.9MB … 24.1MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          114M  ± 2.26M      113M  …  139M          17 (11%)        💩+  2.9% ±  0.3%
  instructions        285M  ±  257       285M  …  285M           0 ( 0%)        💩+  7.8% ±  0.0%
  cache_references   3.61M  ± 99.4K     3.35M  … 4.06M           4 ( 3%)        💩+  8.4% ±  0.7%
  cache_misses       36.6K  ± 6.04K     31.4K  … 86.8K           8 ( 5%)          -  1.1% ±  2.9%
  branch_misses      1.79M  ± 1.28K     1.79M  … 1.80M           5 ( 3%)          -  0.0% ±  0.0%

We already don't go you of our way to support older hardware, instead preferring the best performance on modern hardware, so I'm in favor of enabling it.

@folkertdev folkertdev merged commit 3983883 into main May 9, 2025
24 checks passed
@folkertdev folkertdev deleted the enable-bmi2 branch May 9, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider detecting and enabling bmi1 and bmi2 target features together with avx2

3 participants