Description
Hi!
Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to jql
. I already performed some benchmarks and want to share my results here.
Test environment
- Fedora 38
- Linux kernel 6.5.5
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Compiler - Rustc 1.73
jql
version: the latest for now from themain
branch on commit7729cbafaea0367c2f86227234fb3f8e9a8fd905
- Disabled Turbo boost
Benchmark setup
For benchmarking purposes, I use the scenario from https://github.com/yamafaktory/jql/blob/main/performance.sh , just a bit tweaked (removed vanilla jq
invocations, add multiple jql
versions to the script) - edited version is available here. Release build is done with cargo build --release
, PGO optimized build is done with cargo-pgo.
I tested 3 configurations:
- Default Release, binary called
jql_z
- Tweked Release (use
opt-level = 3
), binary calledjql_opt_3
- PGO-optimized, binary called
jql_optimized
PGO profiles were collected from the same workload in performance.sh
and merged via llvm-profdata
during the PGO optimization phase.
All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).
Results
I got the following results from running performance.sh
:
./performance.sh
Benchmark 1: echo '{ "foo": "bar" }' | ./jql_z '"foo"'
Time (mean ± σ): 4.5 ms ± 0.2 ms [User: 1.5 ms, System: 7.6 ms]
Range (min … max): 4.0 ms … 6.1 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: The first benchmarking run for this command was significantly slower than the rest (6.1 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
Benchmark 2: echo '{ "foo": "bar" }' | ./jql_opt_3 '"foo"'
Time (mean ± σ): 4.5 ms ± 0.4 ms [User: 1.4 ms, System: 7.8 ms]
Range (min … max): 4.1 ms … 14.4 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: The first benchmarking run for this command was significantly slower than the rest (14.4 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
Benchmark 3: echo '{ "foo": "bar" }' | ./jql_optimized '"foo"'
Time (mean ± σ): 4.4 ms ± 0.2 ms [User: 1.2 ms, System: 7.7 ms]
Range (min … max): 4.1 ms … 6.4 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: The first benchmarking run for this command was significantly slower than the rest (6.4 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
Summary
echo '{ "foo": "bar" }' | ./jql_optimized '"foo"' ran
1.00 ± 0.05 times faster than echo '{ "foo": "bar" }' | ./jql_z '"foo"'
1.01 ± 0.09 times faster than echo '{ "foo": "bar" }' | ./jql_opt_3 '"foo"'
Benchmark 1: echo '[1, 2, 3]' | ./jql_z '[0]'
Time (mean ± σ): 4.6 ms ± 0.2 ms [User: 1.4 ms, System: 7.9 ms]
Range (min … max): 4.2 ms … 5.9 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: echo '[1, 2, 3]' | ./jql_opt_3 '[0]'
Time (mean ± σ): 4.6 ms ± 0.1 ms [User: 1.3 ms, System: 8.1 ms]
Range (min … max): 4.2 ms … 5.2 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 3: echo '[1, 2, 3]' | ./jql_optimized '[0]'
Time (mean ± σ): 4.6 ms ± 0.1 ms [User: 1.2 ms, System: 7.9 ms]
Range (min … max): 4.2 ms … 5.6 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Summary
echo '[1, 2, 3]' | ./jql_z '[0]' ran
1.00 ± 0.05 times faster than echo '[1, 2, 3]' | ./jql_optimized '[0]'
1.01 ± 0.05 times faster than echo '[1, 2, 3]' | ./jql_opt_3 '[0]'
Benchmark 1: echo '[1, [2], [[3]]]' | ./jql_z '..'
Time (mean ± σ): 4.7 ms ± 0.2 ms [User: 1.8 ms, System: 8.0 ms]
Range (min … max): 4.2 ms … 6.2 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: echo '[1, [2], [[3]]]' | ./jql_opt_3 '..'
Time (mean ± σ): 4.7 ms ± 0.2 ms [User: 1.6 ms, System: 8.1 ms]
Range (min … max): 4.3 ms … 5.7 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 3: echo '[1, [2], [[3]]]' | ./jql_optimized '..'
Time (mean ± σ): 4.7 ms ± 0.1 ms [User: 1.5 ms, System: 8.1 ms]
Range (min … max): 4.3 ms … 5.4 ms 1000 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Summary
echo '[1, [2], [[3]]]' | ./jql_optimized '..' ran
1.00 ± 0.04 times faster than echo '[1, [2], [[3]]]' | ./jql_opt_3 '..'
1.00 ± 0.04 times faster than echo '[1, [2], [[3]]]' | ./jql_z '..'
Benchmark 1: cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_z '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
Time (mean ± σ): 20.6 ms ± 1.3 ms [User: 14.7 ms, System: 26.0 ms]
Range (min … max): 18.6 ms … 31.5 ms 1000 runs
Benchmark 2: cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_opt_3 '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
Time (mean ± σ): 19.9 ms ± 1.2 ms [User: 11.8 ms, System: 26.0 ms]
Range (min … max): 17.8 ms … 26.3 ms 1000 runs
Benchmark 3: cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_optimized '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
Time (mean ± σ): 19.3 ms ± 1.3 ms [User: 10.6 ms, System: 26.6 ms]
Range (min … max): 17.1 ms … 26.2 ms 1000 runs
Summary
cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_optimized '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null ran
1.04 ± 0.09 times faster than cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_opt_3 '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
1.07 ± 0.10 times faster than cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json | ./jql_z '|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null
The same results in performance.md
format:
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ File: PERFORMANCE.md
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
2 │ |:---|---:|---:|---:|---:|
3 │ | `echo '[1, [2], [[3]]]' \| ./jql_z '..'` | 4.7 ± 0.2 | 4.2 | 6.2 | 1.00 ± 0.04 |
4 │ | `echo '[1, [2], [[3]]]' \| ./jql_opt_3 '..'` | 4.7 ± 0.2 | 4.3 | 5.7 | 1.00 ± 0.04 |
5 │ | `echo '[1, [2], [[3]]]' \| ./jql_optimized '..'` | 4.7 ± 0.1 | 4.3 | 5.4 | 1.00 |
6 │
7 │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
8 │ |:---|---:|---:|---:|---:|
9 │ | `echo '[1, 2, 3]' \| ./jql_z '[0]'` | 4.6 ± 0.2 | 4.2 | 5.9 | 1.00 |
10 │ | `echo '[1, 2, 3]' \| ./jql_opt_3 '[0]'` | 4.6 ± 0.1 | 4.2 | 5.2 | 1.01 ± 0.05 |
11 │ | `echo '[1, 2, 3]' \| ./jql_optimized '[0]'` | 4.6 ± 0.1 | 4.2 | 5.6 | 1.00 ± 0.05 |
12 │
13 │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
14 │ |:---|---:|---:|---:|---:|
15 │ | `echo '{ "foo": "bar" }' \| ./jql_z '"foo"'` | 4.5 ± 0.2 | 4.0 | 6.1 | 1.00 ± 0.05 |
16 │ | `echo '{ "foo": "bar" }' \| ./jql_opt_3 '"foo"'` | 4.5 ± 0.4 | 4.1 | 14.4 | 1.01 ± 0.09 |
17 │ | `echo '{ "foo": "bar" }' \| ./jql_optimized '"foo"'` | 4.4 ± 0.2 | 4.1 | 6.4 | 1.00 |
18 │
19 │ | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
20 │ |:---|---:|---:|---:|---:|
21 │ | `cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json \| ./jql_z '\|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null` | 20.6 ±
│ 1.3 | 18.6 | 31.5 | 1.07 ± 0.10 |
22 │ | `cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json \| ./jql_opt_3 '\|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null` | 19
│ .9 ± 1.2 | 17.8 | 26.3 | 1.04 ± 0.09 |
23 │ | `cat /home/zamazan4ik/open_source/bench_jql/github-repositories.json \| ./jql_optimized '\|>{"name", "url", "language", "stargazers_count", "watchers_count"}' > /dev/null`
│ | 19.3 ± 1.3 | 17.1 | 26.2 | 1.00 |
24 │
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
According to the tests, sometimes user
time is significantly decreased with the PGO optimization.
Further steps
I can suggest the following things to do:
- Evaluate PGO's applicability to
jql
in more scenarios. - If PGO helps to achieve better performance - add a note to jql's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for jql.
- Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
- Optimize prebuilt binaries with PGO.
Here are some examples of how PGO is already integrated into other projects' build scripts:
- Rustc: a CI script for the multi-stage build
- GCC:
- Clang: Docs
- Python:
- Go: Bash script
- V8: Bazel flag
- ChakraCore: Scripts
- Chromium: Script
- Firefox: Docs
- Thunderbird has PGO support too
- PHP - Makefile command and old Centminmod scripts
- MySQL: CMake script
- YugabyteDB: GitHub commit
- FoundationDB: Script
- Zstd: Makefile
- Foot: Scripts
- Windows Terminal: GitHub PR
- Pydantic-core: GitHub PR
After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.