-
-
Notifications
You must be signed in to change notification settings - Fork 29
Micro-optimization for decompression #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hmm, this is a neat change, but I don't see any improvement (not in runtime, not in instructions executed) using our setup. That might be an artifact of the test file we use. This is what I'm running #!/bin/bash
# Check if a commit hash is provided as an argument
if [ -z "$1" ]; then
echo "Usage: $0 <commit_hash> [<branch_or_commit>]"
exit 1
fi
COMMIT=$1
BRANCH_OR_COMMIT=${2:-HEAD}
git checkout $COMMIT
# FLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread"
FLAGS="-Cllvm-args=-enable-dfa-jump-thread"
# FLAGS=""
RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-uncompress
cp target/release/examples/blogpost-uncompress target/release/examples/uncompress-baseline
RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-compress
cp target/release/examples/blogpost-compress target/release/examples/compress-baseline
git checkout $BRANCH_OR_COMMIT
RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-uncompress
RUSTFLAGS="$FLAGS" cargo build --release --example blogpost-compress
poop "target/release/examples/uncompress-baseline rs-chunked 4 silesia-small.tar.gz" "target/release/examples/blogpost-uncompress rs-chunked 4 silesia-small.tar.gz"
poop "target/release/examples/uncompress-baseline rs-chunked 12 silesia-small.tar.gz" "target/release/examples/blogpost-uncompress rs-chunked 12 silesia-small.tar.gz"this uses https://github.com/andrewrk/poop with that I see no significant change, not in runtime, also not in instructions executed. How would I replicate your numbers? |
| let read = unsafe { core::ptr::read_unaligned(self.ptr.cast::<u64>()) }.to_le(); | ||
| self.bit_buffer |= read << self.bits_used; | ||
| let increment = (63 - self.bits_used) >> 3; | ||
| let increment = (63 ^ self.bits_used) >> 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for future reference, this is correct assuming that self.bits_used is at most 63. We'll need to make sure that this is actually the case (I think it should be, and running this function is wasteful when the bit_buffer is already full. Just noting that; it's late over here right now so that needs a careful look later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allright, I've looked at this: the only increment to bits_used is the bitwise or below, which would never exceed 63
|
Sure, I couldn't figure out the benchmark in the repo so I just made my own Criterion program to benchmark wall-time: use criterion::{Bencher, Criterion, black_box, criterion_group, criterion_main};
const PAPER_100K: &[u8] =
include_bytes!("../../zlib-rs/test-libz-rs-sys/src/test-data/paper-100k.pdf");
fn bench_compress(b: &mut Bencher, input: &[u8]) {
b.iter(|| {
let mut config = zlib_rs::deflate::DeflateConfig::default();
config.level = 6;
let mut output = [0; 1 << 17];
let (output, err) = zlib_rs::deflate::compress_slice(&mut output, input, config);
black_box((output, err));
});
}
fn bench_decompress(b: &mut Bencher, input: &[u8]) {
// compress it first
let mut deflate_config = zlib_rs::deflate::DeflateConfig::default();
deflate_config.level = 6;
let input = input;
let mut compressed = [0; 1 << 17];
let (compressed, _err) =
zlib_rs::deflate::compress_slice(&mut compressed, input, deflate_config);
// now decompress it
b.iter(|| {
let inflate_config = zlib_rs::inflate::InflateConfig::default();
let mut output = [0; 1 << 17];
let (output, err) =
zlib_rs::inflate::uncompress_slice(&mut output, compressed, inflate_config);
black_box((output, err));
});
}
pub fn criterion_benchmark(c: &mut Criterion) {
c.bench_function("compress paper-100k.pdf", |b| {
bench_compress(b, PAPER_100K);
});
c.bench_function("decompress paper-100k.pdf", |b| {
bench_decompress(b, PAPER_100K);
});
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);You may have to change that And for the instructions benchmark I just put in my use std::hint::black_box;
fn main() {
// const PAPER_100K: &[u8] =
// include_bytes!("../../zlib-rs/test-libz-rs-sys/src/test-data/paper-100k.pdf");
// let mut compressed = [0; 1 << 17];
// let mut deflate_config = zlib_rs::deflate::DeflateConfig::default();
// deflate_config.level = 6;
// let (compressed, _) =
// zlib_rs::deflate::compress_slice(&mut compressed, PAPER_100K, deflate_config);
// std::fs::write("paper-100k.pdf.gz", compressed).unwrap();
let compressed = include_bytes!("../paper-100k.pdf.gz");
let inflate_config = zlib_rs::inflate::InflateConfig::default();
let mut output = [0; 1 << 17];
let (output, err) = zlib_rs::inflate::uncompress_slice(&mut output, compressed, inflate_config);
assert_eq!(err, zlib_rs::ReturnCode::Ok);
black_box(output);
}I ran the commented code first to generate the paper-100k.pdf.gz, then I ran it with I did also try it with Here's what I get if I run your compare.sh: The differences here are sadly a little less significant than what I observed from my own benchmarks, I'm guessing the differences are because these benchmarks are "chunked" and something related to how ... Anyways, I notice your script doesn't error if the checkout fails, and you put the branch as |
folkertdev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, poop shows emoji for significant results, so the difference is not significant on your CPU really. That's probably just instruction level parallelism doing its job. Still, it could be relevant for other CPUs (e.g. wasm maybe), so I'm happy to merge this with the comment added
| let read = unsafe { core::ptr::read_unaligned(self.ptr.cast::<u64>()) }.to_le(); | ||
| self.bit_buffer |= read << self.bits_used; | ||
| let increment = (63 - self.bits_used) >> 3; | ||
| let increment = (63 ^ self.bits_used) >> 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allright, I've looked at this: the only increment to bits_used is the bitwise or below, which would never exceed 63
|
Neat, thanks! |
The BitReader::refill function flips the last 6 bits by subtracting a number from 63, but in this case it's actually faster to do it with xor.
asm before (assuming it's not inlined):
asm after:
Usually this change wouldn't matter, but because we extend the
incrementinto a usize after, the compiler was adding an unnecessarymovzx. However if we use xor, then the compiler is smart enough to realize that there's no need to zero-extend.In my benchmarks, this change makes inflate about 1-2% faster (and according to Callgrind it uses 0.9% less instructions).