Open
Description
My recent experience with LLVM trunk shows a smaller improvement on clang than my prior experience with the incubator repo (https://github.com/facebookincubator/BOLT).
Here is the log for perf2bolt and llvm-bolt:
> perf2bolt -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata clang-15
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for pgo-labels.perfdata
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x5400000, offset 0x5400000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-WARNING: Failed to analyze 2529 relocations
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-WARNING: build-id will not be checked because we could not read one from input binary
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 100 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 492075 samples and 15682980 LBR entries
PERF2BOLT: 216 samples (0.0%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 5324 (0.0%)
PERF2BOLT: out of range traces involving unknown regions: 1618631 (10.7%)
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm10BasicBlock28replaceSuccessorsPhiUsesWithEPS0_S1_
BOLT-WARNING: 4 collisions detected while hashing binary objects. Use -v=1 to see the list.
PERF2BOLT: processing branch events..
> llvm-bolt clang-15 -o clang-15-bolt -b pgo-relocs-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2529 relocations
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm10BasicBlock28replaceSuccessorsPhiUsesWithEPS0_S1_
BOLT-INFO: 6042 out of 136908 functions in the binary (4.4%) have non-empty execution profile
BOLT-INFO: 347 functions with profile could not be optimized
BOLT-INFO: the input contains 4354 (dynamic count : 268784) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 371417 instructions were shortened
BOLT-INFO: removed 344 empty blocks
BOLT-INFO: ICF folded 413 out of 137214 functions in 3 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 59.75 KB of code space. Folded functions were called 113460 times based on profile.
BOLT-INFO: simplified 102 out of 3594 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 4317
BOLT-INFO: dynamic loads found: 61577
BOLT-INFO: inlined 1227 calls at 18 call sites in 2 iteration(s). Change in binary size: 4 bytes.
BOLT-INFO: 4879 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 3729 (2.73%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 3226174 hot bytes from 7737417 cold bytes (29.43% of split functions is hot).
BOLT-INFO: 106 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 5975 to 650
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
17279782 : executed forward branches
1942886 : taken forward branches
2900344 : executed backward branches
1779625 : taken backward branches
855760 : executed unconditional branches
1686232 : all function calls
571541 : indirect calls
243850 : PLT calls
163314338 : executed instructions
38492046 : executed load instructions
20762991 : executed store instructions
224132 : taken jump table branches
0 : taken unknown indirect branches
21035886 : total branches
4578271 : taken branches
16457615 : non-taken conditional branches
3722511 : taken conditional branches
20180126 : all conditional branches
16810312 : executed forward branches (-2.7%)
824937 : taken forward branches (-57.5%)
3369814 : executed backward branches (+16.2%)
1647148 : taken backward branches (-7.4%)
599903 : executed unconditional branches (-29.9%)
1441570 : all function calls (-14.5%)
571541 : indirect calls (=)
0 : PLT calls (-100.0%)
162404688 : executed instructions (-0.6%)
38488076 : executed load instructions (-0.0%)
20762991 : executed store instructions (=)
224132 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
20780029 : total branches (-1.2%)
3071988 : taken branches (-32.9%)
17708041 : non-taken conditional branches (+7.6%)
2472085 : taken conditional branches (-33.6%)
20180126 : all conditional branches (=)
BOLT-INFO: SCTC: patched 8 tail calls (8 forward) tail calls (0 backward) from a total of 8 while removing 0 double jumps and removing 8 basic blocks totalling 40 bytes of code. CTCs total execution count is 1207 and the number of times CTCs are taken is 1164.
BOLT-INFO: setting __hot_start to 0x5400000
BOLT-INFO: setting __hot_end to 0x59d53e5
I am measuring 5.5% improvement on top of PGO binary (compared to around 9-10% I was seeing before):
pgo-labels-bolt-compiler -> average(507.406)
pgo-labels-compiler -> average(537.33)
Metric: time
Group 1 mean = 537.330005 ± 1.036598
Group 2 mean = 507.406000 ± 3.630159
P value = 2.01e-05
Diff mean (95% CI) = -29.9240 ± 3.5663
Percent (95% CI) = -5.5690% (± 0.6637%)