check for and enable the `bmi1` and `bmi2` target feature #362

folkertdev · 2025-05-09T10:20:02Z

closes #361

Locally this is very effective, CI shows a lot more noise. But it does make sense that this would be faster right?

Benchmark 2 (73 runs): target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          68.9ms ±  522us    68.3ms … 70.7ms          2 ( 3%)          -  1.2% ±  0.3%
  peak_rss           24.1MB ± 75.4KB    23.9MB … 24.1MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          270M  ± 1.59M      268M  …  278M           4 ( 5%)        ⚡-  1.5% ±  0.2%
  instructions        781M  ±  400       781M  …  781M           2 ( 3%)          -  0.1% ±  0.0%
  cache_references   3.14M  ± 97.1K     2.85M  … 3.36M           0 ( 0%)          -  2.1% ±  1.1%
  cache_misses       83.8K  ± 14.4K     51.4K  …  129K           1 ( 1%)        💩+  7.0% ±  5.4%
  branch_misses      4.09M  ± 11.0K     4.08M  … 4.12M           0 ( 0%)          +  0.2% ±  0.1%
Benchmark 2 (152 runs): target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          32.9ms ±  842us    32.3ms … 41.8ms         10 ( 7%)        ⚡-  2.8% ±  0.5%
  peak_rss           24.1MB ± 58.4KB    24.0MB … 24.1MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          111M  ± 2.73M      110M  …  141M          13 ( 9%)        ⚡-  3.7% ±  0.4%
  instructions        265M  ±  276       265M  …  265M           0 ( 0%)        ⚡-  6.4% ±  0.0%
  cache_references   3.38M  ±  153K     3.13M  … 4.62M           5 ( 3%)        ⚡-  2.3% ±  0.9%
  cache_misses       40.3K  ± 12.2K     30.6K  …  167K          10 ( 7%)        💩+  8.7% ±  6.0%
  branch_misses      1.79M  ± 1.95K     1.79M  … 1.81M           8 ( 5%)          -  0.1% ±  0.0%
Benchmark 2 (62 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.8ms ±  950us    80.1ms … 86.5ms          3 ( 5%)          -  0.1% ±  0.4%
  peak_rss           26.7MB ± 68.6KB    26.5MB … 26.7MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          288M  ± 2.76M      284M  …  303M           1 ( 2%)          +  0.3% ±  0.3%
  instructions        582M  ±  296       582M  …  582M           0 ( 0%)          -  1.0% ±  0.0%
  cache_references   19.9M  ±  211K     19.4M  … 20.5M           5 ( 8%)          +  0.0% ±  0.3%
  cache_misses        407K  ± 83.5K      293K  …  644K           5 ( 8%)        ⚡-  9.6% ±  5.7%
  branch_misses      2.97M  ± 2.46K     2.97M  … 2.98M           1 ( 2%)          +  0.2% ±  0.0%
Benchmark 2 (12 runs): target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           429ms ± 2.82ms     426ms …  434ms          0 ( 0%)        ⚡-  2.5% ±  0.8%
  peak_rss           24.4MB ±  102KB    24.2MB … 24.5MB          0 ( 0%)          +  0.0% ±  0.3%
  cpu_cycles         1.84G  ± 9.16M     1.83G  … 1.86G           2 (17%)        ⚡-  2.5% ±  0.8%
  instructions       3.43G  ±  270      3.43G  … 3.43G           0 ( 0%)          -  0.6% ±  0.0%
  cache_references    197M  ±  759K      196M  …  198M           0 ( 0%)          -  0.9% ±  0.8%
  cache_misses       2.36M  ±  438K     1.69M  … 3.00M           0 ( 0%)          - 34.7% ± 40.4%
  branch_misses      15.8M  ± 41.1K     15.7M  … 15.8M           0 ( 0%)          +  0.1% ±  0.2%

codecov · 2025-05-09T10:21:42Z

Codecov Report

Attention: Patch coverage is 52.50000% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zlib-rs/src/cpu_features.rs	34.48%	19 Missing ⚠️

Flag	Coverage Δ
fuzz-compress	`41.41% <76.47%> (-0.45%)`	⬇️
fuzz-decompress	`31.09% <70.58%> (+0.08%)`	⬆️
test-aarch64-apple-darwin	`88.91% <22.22%> (-0.24%)`	⬇️
test-x86_64-apple-darwin	`85.63% <80.00%> (-0.04%)`	⬇️
test-x86_64-unknown-linux-gnu	`87.07% <95.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
zlib-rs/src/adler32.rs	`98.82% <100.00%> (ø)`
zlib-rs/src/adler32/avx2.rs	`92.50% <100.00%> (ø)`
zlib-rs/src/deflate/compare256.rs	`98.02% <100.00%> (ø)`
zlib-rs/src/deflate/slide_hash.rs	`98.96% <100.00%> (ø)`
zlib-rs/src/inflate.rs	`95.97% <100.00%> (-0.17%)`	⬇️
zlib-rs/src/inflate/writer.rs	`91.54% <100.00%> (ø)`
zlib-rs/src/cpu_features.rs	`46.25% <34.48%> (-13.40%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zlib-rs/src/cpu_features.rs

folkertdev · 2025-05-09T11:58:24Z

zlib-rs/src/cpu_features.rs

-    return std::is_x86_feature_detected!("pclmulqdq")
-        && std::is_x86_feature_detected!("sse2")
-        && std::is_x86_feature_detected!("sse4.1");
+    return std::is_x86_feature_detected!("pclmulqdq") && std::is_x86_feature_detected!("sse4.1");


pclmulqdq implies sse2 https://github.com/rust-lang/rust/blob/master/compiler/rustc_target/src/target_features.rs#L451

bjorn3 · 2025-05-09T12:09:43Z

How much does disabling bmi2 hurt? It contains the pdep and pext instructions which are slow on Zen 2 and lower.

folkertdev · 2025-05-09T12:59:27Z

Pretty significant, at least on my CPU

Benchmark 2 (74 runs): target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          68.0ms ±  376us    67.4ms … 69.8ms          1 ( 1%)        💩+  1.7% ±  0.2%
  peak_rss           24.1MB ± 70.5KB    23.9MB … 24.1MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          275M  ± 1.37M      274M  …  285M           6 ( 8%)        💩+  1.9% ±  0.2%
  instructions        786M  ±  383       786M  …  786M           2 ( 3%)          +  0.7% ±  0.0%
  cache_references   3.19M  ±  387K     2.97M  … 5.72M           3 ( 4%)          +  0.2% ±  3.4%
  cache_misses       64.8K  ± 11.4K     43.4K  …  123K           1 ( 1%)        ⚡- 52.3% ±  2.7%
  branch_misses      4.08M  ± 3.61K     4.08M  … 4.10M           2 ( 3%)          -  0.2% ±  0.1%
Benchmark 2 (153 runs): target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          32.8ms ±  803us    32.1ms … 41.1ms          9 ( 6%)        💩+  2.4% ±  0.4%
  peak_rss           24.1MB ± 64.3KB    23.9MB … 24.1MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          114M  ± 2.26M      113M  …  139M          17 (11%)        💩+  2.9% ±  0.3%
  instructions        285M  ±  257       285M  …  285M           0 ( 0%)        💩+  7.8% ±  0.0%
  cache_references   3.61M  ± 99.4K     3.35M  … 4.06M           4 ( 3%)        💩+  8.4% ±  0.7%
  cache_misses       36.6K  ± 6.04K     31.4K  … 86.8K           8 ( 5%)          -  1.1% ±  2.9%
  branch_misses      1.79M  ± 1.28K     1.79M  … 1.80M           5 ( 3%)          -  0.0% ±  0.0%

We already don't go you of our way to support older hardware, instead preferring the best performance on modern hardware, so I'm in favor of enabling it.

folkertdev force-pushed the enable-bmi2 branch from dd67290 to 59034e9 Compare May 9, 2025 10:31

folkertdev added 2 commits May 9, 2025 13:31

check for and enable the bmi2 target feature

2781219

add bmi1

62fcc88

folkertdev force-pushed the enable-bmi2 branch from 59034e9 to 62fcc88 Compare May 9, 2025 11:37

folkertdev commented May 9, 2025

View reviewed changes

zlib-rs/src/cpu_features.rs Show resolved Hide resolved

folkertdev commented May 9, 2025

View reviewed changes

folkertdev requested a review from bjorn3 May 9, 2025 11:59

folkertdev changed the title ~~check for and enable the bmi2 target feature~~ check for and enable the bmi1 and bmi2 target feature May 9, 2025

skip the cache when features are statically enabled

d7a8fc8

bjorn3 approved these changes May 9, 2025

View reviewed changes

folkertdev merged commit 3983883 into main May 9, 2025
24 checks passed

folkertdev deleted the enable-bmi2 branch May 9, 2025 14:05

BrewTestBot mentioned this pull request Jun 6, 2025

zlib-rs 0.5.1 Homebrew/homebrew-core#225936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

check for and enable the `bmi1` and `bmi2` target feature #362

check for and enable the `bmi1` and `bmi2` target feature #362

Uh oh!

folkertdev commented May 9, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

folkertdev May 9, 2025

Uh oh!

bjorn3 commented May 9, 2025

Uh oh!

folkertdev commented May 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

check for and enable the bmi1 and bmi2 target feature #362

check for and enable the bmi1 and bmi2 target feature #362

Uh oh!

Conversation

folkertdev commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

folkertdev May 9, 2025

Choose a reason for hiding this comment

Uh oh!

bjorn3 commented May 9, 2025

Uh oh!

folkertdev commented May 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

check for and enable the `bmi1` and `bmi2` target feature #362

check for and enable the `bmi1` and `bmi2` target feature #362

folkertdev commented May 9, 2025 •

edited

Loading

codecov bot commented May 9, 2025 •

edited

Loading