Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available [here](https://github.com/zamazan4ik/awesome-pgo). So that's why I think it's worth trying to apply PGO to Ouch. I already performed some benchmarks and want to share my results here.

## Test environment

* Fedora 38
* Linux kernel 6.5.5
* AMD Ryzen 9 5900x
* 48 Gib RAM
* SSD Samsung 980 Pro 2 Tib
* Compiler - Rustc 1.73
* Ouch version: the latest for now from the `main` branch on commit `dc21932102011da61a85a98f43d9d8d9ab6bd917`
* Disabled Turbo boost

## Benchmark setup

For benchmarking purposes, I use these benchmarks - https://github.com/ouch-org/ouch/blob/main/benchmarks/run-benchmarks.sh . Release build is done with `cargo build --release`, PGO optimized build is done with [cargo-pgo](https://github.com/Kobzol/cargo-pgo). PGO profiles are collected from the benchmark workload itself.

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).

## Results

`ouch_release` - Release build, `ouch_optimized` - Release + PGO build.

I got the following results:
```
./run-benchmarks.sh
Benchmark 1: ./ouch_release compress rust output.tar
  Time (mean ± σ):     781.0 ms ±   3.9 ms    [User: 119.2 ms, System: 649.2 ms]
  Range (min … max):   772.3 ms … 789.9 ms    50 runs

Benchmark 2: ./ouch_optimized compress rust output.tar
  Time (mean ± σ):     759.7 ms ±   7.0 ms    [User: 104.1 ms, System: 643.2 ms]
  Range (min … max):   732.5 ms … 784.5 ms    50 runs

Summary
  ./ouch_optimized compress rust output.tar ran
    1.03 ± 0.01 times faster than ./ouch_release compress rust output.tar
Creating tar archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar --dir output
  Time (mean ± σ):      3.138 s ±  0.022 s    [User: 0.339 s, System: 2.725 s]
  Range (min … max):    3.103 s …  3.239 s    50 runs

Benchmark 2: ./ouch_optimized decompress input.tar --dir output
  Time (mean ± σ):      3.091 s ±  0.014 s    [User: 0.312 s, System: 2.704 s]
  Range (min … max):    3.063 s …  3.134 s    50 runs

Summary
  ./ouch_optimized decompress input.tar --dir output ran
    1.02 ± 0.01 times faster than ./ouch_release decompress input.tar --dir output
Benchmark 1: ./ouch_release compress compiler output.tar.gz
  Time (mean ± σ):      70.5 ms ±   2.6 ms    [User: 729.9 ms, System: 62.0 ms]
  Range (min … max):    66.5 ms …  79.9 ms    50 runs

Benchmark 2: ./ouch_optimized compress compiler output.tar.gz
  Time (mean ± σ):      68.8 ms ±   2.3 ms    [User: 727.0 ms, System: 62.3 ms]
  Range (min … max):    64.6 ms …  76.3 ms    50 runs

Summary
  ./ouch_optimized compress compiler output.tar.gz ran
    1.02 ± 0.05 times faster than ./ouch_release compress compiler output.tar.gz
Creating tar.gz archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar.gz --dir output
  Time (mean ± σ):     255.9 ms ±   4.0 ms    [User: 82.4 ms, System: 173.9 ms]
  Range (min … max):   251.7 ms … 273.4 ms    50 runs

Benchmark 2: ./ouch_optimized decompress input.tar.gz --dir output
  Time (mean ± σ):     254.8 ms ±   2.9 ms    [User: 79.2 ms, System: 175.4 ms]
  Range (min … max):   250.6 ms … 263.6 ms    50 runs

Summary
  ./ouch_optimized decompress input.tar.gz --dir output ran
    1.00 ± 0.02 times faster than ./ouch_release decompress input.tar.gz --dir output
Benchmark 1: ./ouch_optimized compress compiler output.zip
  Time (mean ± σ):     523.7 ms ±   1.4 ms    [User: 474.3 ms, System: 46.8 ms]
  Range (min … max):   521.4 ms … 530.8 ms    50 runs

Benchmark 2: ./ouch_release compress compiler output.zip
  Time (mean ± σ):     527.0 ms ±   2.5 ms    [User: 479.2 ms, System: 45.1 ms]
  Range (min … max):   524.2 ms … 535.9 ms    50 runs

Summary
  ./ouch_optimized compress compiler output.zip ran
    1.01 ± 0.01 times faster than ./ouch_release compress compiler output.zip
Creating zip archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.zip --dir output
  Time (mean ± σ):     241.0 ms ±   2.0 ms    [User: 84.2 ms, System: 157.6 ms]
  Range (min … max):   238.7 ms … 249.3 ms    50 runs

Benchmark 2: ./ouch_optimized decompress input.zip --dir output
  Time (mean ± σ):     243.5 ms ±   3.1 ms    [User: 84.6 ms, System: 158.6 ms]
  Range (min … max):   236.7 ms … 253.0 ms    50 runs

Summary
  ./ouch_release decompress input.zip --dir output ran
    1.01 ± 0.02 times faster than ./ouch_optimized decompress input.zip --dir output

check results at results.md
```

According to the tests, it's possible to achieve several percent improvements with PGO at least in these benchmarks.

## Further steps

I can suggest the following things to do:
* Evaluate PGO's applicability to Ouch in more scenarios.
* If PGO helps to achieve better performance - add a note to Ouch's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for Ouch.
* Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
* Optimize prebuilt binaries with PGO.

Here are some examples of how PGO is already integrated into other projects' build scripts:
* Rustc: a CI [script](https://github.com/rust-lang/rust/blob/master/src/ci/stage-build.py) for the multi-stage build
* GCC:
  - Official [docs](https://gcc.gnu.org/install/build.html), section "Building with profile feedback" (even AutoFDO build is supported)
  - A [part](https://github.com/gcc-mirror/gcc/blob/4832767db7897be6fb5cbc44f079482c90cb95a6/configure#L7818) in a "wonderful" `configure` script 
* Clang: [Docs](https://llvm.org/docs/HowToBuildWithPGO.html) 
* Python: 
  - CPython: [README](https://github.com/python/cpython#profile-guided-optimization)
  - Pyston: [README](https://github.com/pyston/pyston#building)
* Go: [Bash script](https://github.com/golang/go/blob/master/src/cmd/compile/profile.sh)
* V8: [Bazel flag](https://github.com/v8/v8/blob/main/BUILD.gn#L184)
* ChakraCore: [Scripts](https://github.com/chakra-core/ChakraCore/tree/master/Build/scripts/pgo)
* Chromium: [Script](https://chromium.googlesource.com/chromium/src/build/config/+/refs/heads/main/compiler/pgo/BUILD.gn)
* Firefox: [Docs](https://firefox-source-docs.mozilla.org/build/buildsystem/pgo.html)
   - Thunderbird has PGO support too
* PHP - [Makefile command](https://github.com/php/php-src/blob/master/build/Makefile.global#L138) and old Centminmod [scripts](https://github.com/centminmod/php_pgo_training_scripts)
* MySQL: [CMake script](https://github.com/mysql/mysql-server/blob/8.0/cmake/fprofile.cmake)
* YugabyteDB: [GitHub commit](https://github.com/yugabyte/yugabyte-db/commit/34cb791ed9d3d5f8ae9a9b9e9181a46485e1981d)
* FoundationDB: [Script](https://github.com/apple/foundationdb/blob/1a6114a66f3de508c0cf0a45f72f3687ba05750c/contrib/generate_profile.sh)
* Zstd: [Makefile](https://github.com/facebook/zstd/blob/dev/programs/Makefile#L232)
* [Foot](https://codeberg.org/dnkl/foot): [Scripts](https://codeberg.org/dnkl/foot/src/branch/master/pgo)
* Windows Terminal: [GitHub PR](https://github.com/microsoft/terminal/pull/10071)
* Pydantic-core: [GitHub PR](https://github.com/pydantic/pydantic-core/pull/741)

After PGO, I can suggest evaluating [LLVM BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md) as an additional optimization step after PGO.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #537

Test environment

Benchmark setup

Results

Further steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #537

Description

Test environment

Benchmark setup

Results

Further steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions