Skip to content

Conversation

@mat-1
Copy link
Contributor

@mat-1 mat-1 commented Mar 22, 2025

The BitReader::refill function flips the last 6 bits by subtracting a number from 63, but in this case it's actually faster to do it with xor.

asm before (assuming it's not inlined):

mov rax, qword ptr [rdi]
mov rdx, qword ptr [rax]
movzx ecx, byte ptr [rdi + 24]
shl rdx, cl
or qword ptr [rdi + 16], rdx
mov dl, 63
sub dl, cl
shr dl, 3
movzx edx, dl
add rdx, rax
mov qword ptr [rdi], rdx
or cl, 56
mov byte ptr [rdi + 24], cl
ret

asm after:

mov rax, qword ptr [rdi]
mov rdx, qword ptr [rax]
movzx ecx, byte ptr [rdi + 24]
shl rdx, cl
or qword ptr [rdi + 16], rdx
mov edx, ecx
shr edx, 3
xor rdx, 7
add rdx, rax
mov qword ptr [rdi], rdx
or cl, 56
mov byte ptr [rdi + 24], cl
ret

Usually this change wouldn't matter, but because we extend the increment into a usize after, the compiler was adding an unnecessary movzx. However if we use xor, then the compiler is smart enough to realize that there's no need to zero-extend.

In my benchmarks, this change makes inflate about 1-2% faster (and according to Callgrind it uses 0.9% less instructions).

@folkertdev
Copy link
Member

Hmm, this is a neat change, but I don't see any improvement (not in runtime, not in instructions executed) using our setup. That might be an artifact of the test file we use. This is what I'm running

#!/bin/bash

# Check if a commit hash is provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 <commit_hash> [<branch_or_commit>]"
  exit 1
fi

COMMIT=$1
BRANCH_OR_COMMIT=${2:-HEAD}

git checkout $COMMIT

# FLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread"
FLAGS="-Cllvm-args=-enable-dfa-jump-thread"
# FLAGS=""

RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-uncompress
cp target/release/examples/blogpost-uncompress target/release/examples/uncompress-baseline

RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-compress
cp target/release/examples/blogpost-compress target/release/examples/compress-baseline

git checkout $BRANCH_OR_COMMIT

RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-uncompress
RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-compress

poop "target/release/examples/uncompress-baseline rs-chunked 4 silesia-small.tar.gz"  "target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz"
poop "target/release/examples/uncompress-baseline rs-chunked 12 silesia-small.tar.gz"  "target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz"

this uses https://github.com/andrewrk/poop

with that I see no significant change, not in runtime, also not in instructions executed.

> sh compare.sh main mat1/main
<snip>
Benchmark 1 (73 runs): target/release/examples/uncompress-baseline rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          68.8ms ± 1.08ms    68.1ms … 76.1ms          8 (11%)        0%
  peak_rss           24.1MB ± 57.6KB    23.9MB … 24.1MB         14 (19%)        0%
  cpu_cycles          280M  ± 4.02M      277M  …  304M          10 (14%)        0%
  instructions        781M  ±  261       781M  …  781M           1 ( 1%)        0%
  cache_references   3.24M  ±  557K     2.94M  … 7.47M           5 ( 7%)        0%
  cache_misses        143K  ± 20.2K      100K  …  209K           1 ( 1%)        0%
  branch_misses      4.08M  ± 9.74K     4.08M  … 4.14M           8 (11%)        0%
Benchmark 2 (73 runs): target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          68.6ms ±  535us    68.0ms … 71.5ms          5 ( 7%)          -  0.3% ±  0.4%
  peak_rss           24.1MB ± 50.5KB    24.0MB … 24.1MB         13 (18%)          +  0.0% ±  0.1%
  cpu_cycles          279M  ± 1.92M      278M  …  290M           3 ( 4%)          -  0.3% ±  0.4%
  instructions        781M  ±  264       781M  …  781M           0 ( 0%)          -  0.0% ±  0.0%
  cache_references   3.21M  ±  525K     2.92M  … 6.44M           4 ( 5%)          -  1.0% ±  5.4%
  cache_misses        115K  ± 10.0K     86.2K  …  160K           4 ( 5%)        ⚡- 19.6% ±  3.6%
  branch_misses      4.08M  ± 9.54K     4.08M  … 4.13M           7 (10%)          +  0.0% ±  0.1%
Benchmark 1 (153 runs): target/release/examples/uncompress-baseline rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          32.7ms ±  614us    31.9ms … 37.3ms          6 ( 4%)        0%
  peak_rss           24.1MB ± 57.6KB    23.9MB … 24.1MB         30 (20%)        0%
  cpu_cycles          114M  ± 1.70M      114M  …  128M          17 (11%)        0%
  instructions        284M  ±  289       284M  …  284M           1 ( 1%)        0%
  cache_references   3.57M  ±  335K     3.33M  … 7.52M           3 ( 2%)        0%
  cache_misses       31.4K  ± 4.26K     27.3K  … 69.3K           7 ( 5%)        0%
  branch_misses      1.79M  ±  984      1.79M  … 1.80M           2 ( 1%)        0%
Benchmark 2 (153 runs): target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          32.7ms ±  518us    31.9ms … 37.9ms          8 ( 5%)          +  0.1% ±  0.4%
  peak_rss           24.1MB ± 61.9KB    23.9MB … 24.1MB         29 (19%)          -  0.0% ±  0.1%
  cpu_cycles          114M  ± 1.44M      114M  …  131M          16 (10%)          -  0.0% ±  0.3%
  instructions        284M  ±  252       284M  …  284M           1 ( 1%)          +  0.0% ±  0.0%
  cache_references   3.53M  ± 90.4K     3.34M  … 3.89M           1 ( 1%)          -  1.0% ±  1.5%
  cache_misses       33.2K  ± 4.28K     28.7K  … 69.3K           7 ( 5%)        💩+  5.8% ±  3.0%
  branch_misses      1.79M  ±  931      1.79M  … 1.80M           9 ( 6%)          +  0.0% ±  0.0%

How would I replicate your numbers?

let read = unsafe { core::ptr::read_unaligned(self.ptr.cast::<u64>()) }.to_le();
self.bit_buffer |= read << self.bits_used;
let increment = (63 - self.bits_used) >> 3;
let increment = (63 ^ self.bits_used) >> 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future reference, this is correct assuming that self.bits_used is at most 63. We'll need to make sure that this is actually the case (I think it should be, and running this function is wasteful when the bit_buffer is already full. Just noting that; it's late over here right now so that needs a careful look later).

https://alive2.llvm.org/ce/z/boN8R8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allright, I've looked at this: the only increment to bits_used is the bitwise or below, which would never exceed 63

@mat-1
Copy link
Contributor Author

mat-1 commented Mar 23, 2025

Sure, I couldn't figure out the benchmark in the repo so I just made my own Criterion program to benchmark wall-time:

use criterion::{Bencher, Criterion, black_box, criterion_group, criterion_main};

const PAPER_100K: &[u8] =
    include_bytes!("../../zlib-rs/test-libz-rs-sys/src/test-data/paper-100k.pdf");

fn bench_compress(b: &mut Bencher, input: &[u8]) {
    b.iter(|| {
        let mut config = zlib_rs::deflate::DeflateConfig::default();
        config.level = 6;
        let mut output = [0; 1 << 17];
        let (output, err) = zlib_rs::deflate::compress_slice(&mut output, input, config);
        black_box((output, err));
    });
}
fn bench_decompress(b: &mut Bencher, input: &[u8]) {
    // compress it first
    let mut deflate_config = zlib_rs::deflate::DeflateConfig::default();
    deflate_config.level = 6;
    let input = input;
    let mut compressed = [0; 1 << 17];
    let (compressed, _err) =
        zlib_rs::deflate::compress_slice(&mut compressed, input, deflate_config);
    // now decompress it
    b.iter(|| {
        let inflate_config = zlib_rs::inflate::InflateConfig::default();
        let mut output = [0; 1 << 17];
        let (output, err) =
            zlib_rs::inflate::uncompress_slice(&mut output, compressed, inflate_config);
        black_box((output, err));
    });
}

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("compress paper-100k.pdf", |b| {
        bench_compress(b, PAPER_100K);
    });

    c.bench_function("decompress paper-100k.pdf", |b| {
        bench_decompress(b, PAPER_100K);
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

You may have to change that include_bytes depending on where you put the code. To run it, follow the Criterion quickstart and put that in benches/my_benchmark.rs, then run with cargo bench.

And for the instructions benchmark I just put in my main.rs:

use std::hint::black_box;

fn main() {
    // const PAPER_100K: &[u8] =
    //     include_bytes!("../../zlib-rs/test-libz-rs-sys/src/test-data/paper-100k.pdf");
    // let mut compressed = [0; 1 << 17];
    // let mut deflate_config = zlib_rs::deflate::DeflateConfig::default();
    // deflate_config.level = 6;
    // let (compressed, _) =
    //     zlib_rs::deflate::compress_slice(&mut compressed, PAPER_100K, deflate_config);
    // std::fs::write("paper-100k.pdf.gz", compressed).unwrap();

    let compressed = include_bytes!("../paper-100k.pdf.gz");
    let inflate_config = zlib_rs::inflate::InflateConfig::default();
    let mut output = [0; 1 << 17];
    let (output, err) = zlib_rs::inflate::uncompress_slice(&mut output, compressed, inflate_config);
    assert_eq!(err, zlib_rs::ReturnCode::Ok);
    black_box(output);
}

I ran the commented code first to generate the paper-100k.pdf.gz, then I ran it with cargo b -r && valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes target/release/zlib-rs-bench (and looked at the "I refs" value).

I did also try it with RUSTFLAGS="-Cllvm-args=-enable-dfa-jump-thread" (but not target-cpu=native since Valgrind segfaulted on it). I just tried your silesia-small.tar.gz (using my program, after increasing the output size to 1 << 27) but I'm still seeing a speedup (267,107,619 vs 264,931,868 instructions).

Here's what I get if I run your compare.sh:

Benchmark 1 (105 runs): target/release/examples/uncompress-baseline rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          47.8ms ± 1.39ms    46.9ms … 60.3ms          7 ( 7%)        0%
  peak_rss           25.3MB ±  672KB    24.2MB … 26.4MB          0 ( 0%)        0%
  cpu_cycles          244M  ± 3.13M      241M  …  271M          12 (11%)        0%
  instructions        777M  ±  326       777M  …  777M           3 ( 3%)        0%
  cache_references   2.67M  ±  219K     2.51M  … 4.70M           5 ( 5%)        0%
  cache_misses       24.0K  ± 1.87K     20.6K  … 30.7K           3 ( 3%)        0%
  branch_misses      4.12M  ± 10.2K     4.10M  … 4.15M           0 ( 0%)        0%
Benchmark 2 (105 runs): target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          47.8ms ±  512us    46.9ms … 49.7ms          2 ( 2%)          +  0.0% ±  0.6%
  peak_rss           25.2MB ±  607KB    24.0MB … 26.2MB          0 ( 0%)          -  0.3% ±  0.7%
  cpu_cycles          244M  ± 1.75M      241M  …  251M           6 ( 6%)          +  0.1% ±  0.3%
  instructions        776M  ±  405       776M  …  776M           1 ( 1%)          -  0.0% ±  0.0%
  cache_references   2.66M  ± 85.9K     2.51M  … 3.07M           4 ( 4%)          -  0.5% ±  1.7%
  cache_misses       23.9K  ± 2.12K     20.4K  … 31.3K           2 ( 2%)          -  0.4% ±  2.3%
  branch_misses      4.12M  ± 15.1K     4.06M  … 4.15M           3 ( 3%)          +  0.0% ±  0.1%
Benchmark 1 (234 runs): target/release/examples/uncompress-baseline rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          21.4ms ±  320us    20.7ms … 22.7ms          5 ( 2%)        0%
  peak_rss           25.3MB ±  587KB    24.0MB … 26.4MB          0 ( 0%)        0%
  cpu_cycles          106M  ±  821K      105M  …  113M          14 ( 6%)        0%
  instructions        284M  ±  349       284M  …  284M           1 ( 0%)        0%
  cache_references   3.70M  ±  101K     3.42M  … 4.38M           5 ( 2%)        0%
  cache_misses       26.6K  ± 1.87K     22.5K  … 33.0K           4 ( 2%)        0%
  branch_misses      1.81M  ± 2.26K     1.80M  … 1.82M           6 ( 3%)        0%
Benchmark 2 (234 runs): target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          21.4ms ±  228us    20.8ms … 22.2ms          2 ( 1%)          -  0.0% ±  0.2%
  peak_rss           25.3MB ±  656KB    24.1MB … 26.5MB          0 ( 0%)          +  0.2% ±  0.4%
  cpu_cycles          106M  ±  391K      105M  …  108M          11 ( 5%)          +  0.1% ±  0.1%
  instructions        283M  ±  345       283M  …  283M           0 ( 0%)          -  0.5% ±  0.0%
  cache_references   3.65M  ± 78.9K     3.44M  … 3.91M           1 ( 0%)          -  1.2% ±  0.4%
  cache_misses       26.4K  ± 1.76K     22.0K  … 33.3K          13 ( 6%)          -  0.8% ±  1.2%
  branch_misses      1.81M  ± 2.17K     1.80M  … 1.82M          14 ( 6%)          -  0.1% ±  0.0%

The differences here are sadly a little less significant than what I observed from my own benchmarks, I'm guessing the differences are because these benchmarks are "chunked" and something related to how poop works? I also have no idea what's up with the difference in cache_references and cache_misses.

... Anyways, I notice your script doesn't error if the checkout fails, and you put the branch as mat1/main (as opposed to mat-1/main). That might be why you're not noticing a difference, lol.

Copy link
Member

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, poop shows emoji for significant results, so the difference is not significant on your CPU really. That's probably just instruction level parallelism doing its job. Still, it could be relevant for other CPUs (e.g. wasm maybe), so I'm happy to merge this with the comment added

let read = unsafe { core::ptr::read_unaligned(self.ptr.cast::<u64>()) }.to_le();
self.bit_buffer |= read << self.bits_used;
let increment = (63 - self.bits_used) >> 3;
let increment = (63 ^ self.bits_used) >> 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allright, I've looked at this: the only increment to bits_used is the bitwise or below, which would never exceed 63

@folkertdev
Copy link
Member

Neat, thanks!

@folkertdev folkertdev merged commit c75f868 into trifectatechfoundation:main Apr 13, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants