Description
Hi!
Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to Ouch. I already performed some benchmarks and want to share my results here.
Test environment
- Fedora 38
- Linux kernel 6.5.5
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Compiler - Rustc 1.73
- Ouch version: the latest for now from the
main
branch on commitdc21932102011da61a85a98f43d9d8d9ab6bd917
- Disabled Turbo boost
Benchmark setup
For benchmarking purposes, I use these benchmarks - https://github.com/ouch-org/ouch/blob/main/benchmarks/run-benchmarks.sh . Release build is done with cargo build --release
, PGO optimized build is done with cargo-pgo. PGO profiles are collected from the benchmark workload itself.
All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).
Results
ouch_release
- Release build, ouch_optimized
- Release + PGO build.
I got the following results:
./run-benchmarks.sh
Benchmark 1: ./ouch_release compress rust output.tar
Time (mean ± σ): 781.0 ms ± 3.9 ms [User: 119.2 ms, System: 649.2 ms]
Range (min … max): 772.3 ms … 789.9 ms 50 runs
Benchmark 2: ./ouch_optimized compress rust output.tar
Time (mean ± σ): 759.7 ms ± 7.0 ms [User: 104.1 ms, System: 643.2 ms]
Range (min … max): 732.5 ms … 784.5 ms 50 runs
Summary
./ouch_optimized compress rust output.tar ran
1.03 ± 0.01 times faster than ./ouch_release compress rust output.tar
Creating tar archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar --dir output
Time (mean ± σ): 3.138 s ± 0.022 s [User: 0.339 s, System: 2.725 s]
Range (min … max): 3.103 s … 3.239 s 50 runs
Benchmark 2: ./ouch_optimized decompress input.tar --dir output
Time (mean ± σ): 3.091 s ± 0.014 s [User: 0.312 s, System: 2.704 s]
Range (min … max): 3.063 s … 3.134 s 50 runs
Summary
./ouch_optimized decompress input.tar --dir output ran
1.02 ± 0.01 times faster than ./ouch_release decompress input.tar --dir output
Benchmark 1: ./ouch_release compress compiler output.tar.gz
Time (mean ± σ): 70.5 ms ± 2.6 ms [User: 729.9 ms, System: 62.0 ms]
Range (min … max): 66.5 ms … 79.9 ms 50 runs
Benchmark 2: ./ouch_optimized compress compiler output.tar.gz
Time (mean ± σ): 68.8 ms ± 2.3 ms [User: 727.0 ms, System: 62.3 ms]
Range (min … max): 64.6 ms … 76.3 ms 50 runs
Summary
./ouch_optimized compress compiler output.tar.gz ran
1.02 ± 0.05 times faster than ./ouch_release compress compiler output.tar.gz
Creating tar.gz archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar.gz --dir output
Time (mean ± σ): 255.9 ms ± 4.0 ms [User: 82.4 ms, System: 173.9 ms]
Range (min … max): 251.7 ms … 273.4 ms 50 runs
Benchmark 2: ./ouch_optimized decompress input.tar.gz --dir output
Time (mean ± σ): 254.8 ms ± 2.9 ms [User: 79.2 ms, System: 175.4 ms]
Range (min … max): 250.6 ms … 263.6 ms 50 runs
Summary
./ouch_optimized decompress input.tar.gz --dir output ran
1.00 ± 0.02 times faster than ./ouch_release decompress input.tar.gz --dir output
Benchmark 1: ./ouch_optimized compress compiler output.zip
Time (mean ± σ): 523.7 ms ± 1.4 ms [User: 474.3 ms, System: 46.8 ms]
Range (min … max): 521.4 ms … 530.8 ms 50 runs
Benchmark 2: ./ouch_release compress compiler output.zip
Time (mean ± σ): 527.0 ms ± 2.5 ms [User: 479.2 ms, System: 45.1 ms]
Range (min … max): 524.2 ms … 535.9 ms 50 runs
Summary
./ouch_optimized compress compiler output.zip ran
1.01 ± 0.01 times faster than ./ouch_release compress compiler output.zip
Creating zip archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.zip --dir output
Time (mean ± σ): 241.0 ms ± 2.0 ms [User: 84.2 ms, System: 157.6 ms]
Range (min … max): 238.7 ms … 249.3 ms 50 runs
Benchmark 2: ./ouch_optimized decompress input.zip --dir output
Time (mean ± σ): 243.5 ms ± 3.1 ms [User: 84.6 ms, System: 158.6 ms]
Range (min … max): 236.7 ms … 253.0 ms 50 runs
Summary
./ouch_release decompress input.zip --dir output ran
1.01 ± 0.02 times faster than ./ouch_optimized decompress input.zip --dir output
check results at results.md
According to the tests, it's possible to achieve several percent improvements with PGO at least in these benchmarks.
Further steps
I can suggest the following things to do:
- Evaluate PGO's applicability to Ouch in more scenarios.
- If PGO helps to achieve better performance - add a note to Ouch's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for Ouch.
- Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
- Optimize prebuilt binaries with PGO.
Here are some examples of how PGO is already integrated into other projects' build scripts:
- Rustc: a CI script for the multi-stage build
- GCC:
- Clang: Docs
- Python:
- Go: Bash script
- V8: Bazel flag
- ChakraCore: Scripts
- Chromium: Script
- Firefox: Docs
- Thunderbird has PGO support too
- PHP - Makefile command and old Centminmod scripts
- MySQL: CMake script
- YugabyteDB: GitHub commit
- FoundationDB: Script
- Zstd: Makefile
- Foot: Scripts
- Windows Terminal: GitHub PR
- Pydantic-core: GitHub PR
After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.