Skip to content

BOLT gives lower improvement on clang-bootstrap than before #56274

Open
@rlavaee

Description

@rlavaee

My recent experience with LLVM trunk shows a smaller improvement on clang than my prior experience with the incubator repo (https://github.com/facebookincubator/BOLT).

Here is the log for perf2bolt and llvm-bolt:

> perf2bolt -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata clang-15                                                                                                                                                             
BOLT-INFO: shared object or position-independent executable detected                                                                                                                                                                                                                                                                                                                                                  
PERF2BOLT: Starting data aggregation job for pgo-labels.perfdata                                                                                                                                                                                                                                                                                                                                                      
PERF2BOLT: spawning perf job to read branch events                                                                                                                                                                                                                                                                                                                                                                    
PERF2BOLT: spawning perf job to read mem events                                                                                                                                                                                                                                                                                                                                                                       
PERF2BOLT: spawning perf job to read process events                                                                                                                                                                                                                                                                                                                                                                   
PERF2BOLT: spawning perf job to read task events                                                                                                                                                                                                                                                                                                                                                                      
BOLT-INFO: Target architecture: x86_64                                                                                                                                                                                                                                                                                                                                                                                
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b                                                                                                                                                                                                                                                                                                                                                     
BOLT-INFO: first alloc address is 0x0                                                                                                                                                                                                                                                                                                                                                                                 
BOLT-INFO: creating new program header table at address 0x5400000, offset 0x5400000                                                                                                                                                                                                                                                                                                                                   
BOLT-INFO: enabling relocation mode                                                                                                                                                                                                                                                                                                                                                                                   
BOLT-INFO: enabling strict relocation mode for aggregation purposes                                                                                                                                                                                                                                                                                                                                                   
BOLT-WARNING: Failed to analyze 2529 relocations                                                                                                                                                                                                                                                                                                                                                                      
BOLT-INFO: pre-processing profile using perf data aggregator                                                                                                                                                                                                                                                                                                                                                          
BOLT-WARNING: build-id will not be checked because we could not read one from input binary                                                                                                                                                                                                                                                                                                                            
PERF2BOLT: waiting for perf mmap events collection to finish...                                                                                                                                                                                                                                                                                                                                                       
PERF2BOLT: parsing perf-script mmap events output                                                                                                                                                                                                                                                                                                                                                                     
PERF2BOLT: waiting for perf task events collection to finish...                                                                                                                                                                                                                                                                                                                                                       
PERF2BOLT: parsing perf-script task events output                                                                                                                                                                                                                                                                                                                                                                     
PERF2BOLT: input binary is associated with 100 PID(s)                                                                                                                                                                                                                                                                                                                                                                 
PERF2BOLT: waiting for perf events collection to finish...                                                                                                                                                                                                                                                                                                                                                            
PERF2BOLT: parse branch events...                                                                                                      
PERF2BOLT: read 492075 samples and 15682980 LBR entries                                                                                
PERF2BOLT: 216 samples (0.0%) were ignored                                                                                             
PERF2BOLT: traces mismatching disassembled function contents: 5324 (0.0%)                                                              
PERF2BOLT: out of range traces involving unknown regions: 1618631 (10.7%)                                                              
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm10BasicBlock28replaceSuccessorsPhiUsesWithEPS0_S1_                                                                                                                                                             
BOLT-WARNING: 4 collisions detected while hashing binary objects. Use -v=1 to see the list.                                                                                                                                                                                    
PERF2BOLT: processing branch events..
> llvm-bolt clang-15 -o clang-15-bolt -b pgo-relocs-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2529 relocations
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm10BasicBlock28replaceSuccessorsPhiUsesWithEPS0_S1_
BOLT-INFO: 6042 out of 136908 functions in the binary (4.4%) have non-empty execution profile
BOLT-INFO: 347 functions with profile could not be optimized
BOLT-INFO: the input contains 4354 (dynamic count : 268784) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 371417 instructions were shortened
BOLT-INFO: removed 344 empty blocks
BOLT-INFO: ICF folded 413 out of 137214 functions in 3 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 59.75 KB of code space. Folded functions were called 113460 times based on profile.
BOLT-INFO: simplified 102 out of 3594 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 4317
BOLT-INFO: dynamic loads found: 61577
BOLT-INFO: inlined 1227 calls at 18 call sites in 2 iteration(s). Change in binary size: 4 bytes.
BOLT-INFO: 4879 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 3729 (2.73%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 3226174 hot bytes from 7737417 cold bytes (29.43% of split functions is hot).
BOLT-INFO: 106 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 5975 to 650
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            17279782 : executed forward branches
             1942886 : taken forward branches
             2900344 : executed backward branches
             1779625 : taken backward branches
              855760 : executed unconditional branches
             1686232 : all function calls
              571541 : indirect calls
              243850 : PLT calls
           163314338 : executed instructions
            38492046 : executed load instructions
            20762991 : executed store instructions
              224132 : taken jump table branches
                   0 : taken unknown indirect branches
            21035886 : total branches
             4578271 : taken branches
            16457615 : non-taken conditional branches
             3722511 : taken conditional branches
            20180126 : all conditional branches

            16810312 : executed forward branches (-2.7%)
              824937 : taken forward branches (-57.5%)
             3369814 : executed backward branches (+16.2%)
             1647148 : taken backward branches (-7.4%)
              599903 : executed unconditional branches (-29.9%)
             1441570 : all function calls (-14.5%)
              571541 : indirect calls (=)
                   0 : PLT calls (-100.0%)
           162404688 : executed instructions (-0.6%)
            38488076 : executed load instructions (-0.0%)
            20762991 : executed store instructions (=)
              224132 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            20780029 : total branches (-1.2%)
             3071988 : taken branches (-32.9%)
            17708041 : non-taken conditional branches (+7.6%)
             2472085 : taken conditional branches (-33.6%)
            20180126 : all conditional branches (=)

BOLT-INFO: SCTC: patched 8 tail calls (8 forward) tail calls (0 backward) from a total of 8 while removing 0 double jumps and removing 8 basic blocks totalling 40 bytes of code. CTCs total execution count is 1207 and the number of times CTCs are taken is 1164.
BOLT-INFO: setting __hot_start to 0x5400000
BOLT-INFO: setting __hot_end to 0x59d53e5

I am measuring 5.5% improvement on top of PGO binary (compared to around 9-10% I was seeing before):

pgo-labels-bolt-compiler -> average(507.406)
pgo-labels-compiler -> average(537.33)
Metric: time
Group 1 mean = 537.330005 ± 1.036598
Group 2 mean = 507.406000 ± 3.630159
P value      = 2.01e-05
Diff mean (95% CI)  = -29.9240 ± 3.5663
Percent   (95% CI) = -5.5690% (± 0.6637%)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions