Description
After discussing it on the forums I bit I decided to look into what the optimizer spends time on, and inlining seems to be a significant part of it. There seems to be a lot of room for improvement here.
Tested by compiling syntex in rustc-benchmarks (though the inlining time seem quite significant in other crates as well.).
I counted which functions were inlined the most, and fmt::debug_tuple
and to a lesser extent fmt::debug_struct
, which are emitted when automatically deriving Debug
, and which simply wrap two other functions inside fmt::builder show up a lot. Maybe derive(debug) could output the wrapped functions instead, or the equivalent code, so llvm doesn't have to spend time inlining them (unless this causes problems for debug info). fmt::arguments::new_v1
is also inlined a lot, and also plays a role in derive(Debug). Maybe something could be done about that one as well, as it's simply a function that creates a struct but doesn't do anything else.
Output from time-llvm-passes
Using `RUSTFLAGS="-Z time-llvm-passes" cargo +nightly build --release 2> llvm_time.txt` in the syntex-0.42.2 directory in rustc-benchmarks.It's a bit hard to say which crate is which, as the output from llvm and cargo isn't synced, though these are the bottom three (excluding the benchmark dummy) in reverse order:
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 6.3800 seconds (6.2624 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
0.7920 ( 14.1%) 0.0960 ( 12.8%) 0.8880 ( 13.9%) 0.8489 ( 13.6%) Function Integration/Inlining
0.5400 ( 9.6%) 0.0760 ( 10.2%) 0.6160 ( 9.7%) 0.5365 ( 8.6%) X86 DAG->DAG Instruction Selection
0.3080 ( 5.5%) 0.0280 ( 3.7%) 0.3360 ( 5.3%) 0.3031 ( 4.8%) Combine redundant instructions
0.2400 ( 4.3%) 0.0080 ( 1.1%) 0.2480 ( 3.9%) 0.2227 ( 3.6%) Dead Store Elimination
0.2040 ( 3.6%) 0.0120 ( 1.6%) 0.2160 ( 3.4%) 0.2147 ( 3.4%) SROA
...
Total Execution Time: 96.6840 seconds (96.6925 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
9.5480 ( 10.4%) 0.5440 ( 10.3%) 10.0920 ( 10.4%) 10.1854 ( 10.5%) Function Integration/Inlining
6.9000 ( 7.5%) 0.7120 ( 13.5%) 7.6120 ( 7.9%) 7.7547 ( 8.0%) X86 DAG->DAG Instruction Selection
5.8280 ( 6.4%) 0.1120 ( 2.1%) 5.9400 ( 6.1%) 6.0239 ( 6.2%) Loop Invariant Code Motion
4.5640 ( 5.0%) 0.1680 ( 3.2%) 4.7320 ( 4.9%) 4.8229 ( 5.0%) Global Value Numbering
3.8240 ( 4.2%) 0.1920 ( 3.6%) 4.0160 ( 4.2%) 4.1534 ( 4.3%) Combine redundant instructions
...
Total Execution Time: 5.4880 seconds (5.5137 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
0.4040 ( 8.3%) 0.0680 ( 10.9%) 0.4720 ( 8.6%) 0.4675 ( 8.5%) Function Integration/Inlining
0.3000 ( 6.2%) 0.0720 ( 11.5%) 0.3720 ( 6.8%) 0.3340 ( 6.1%) X86 DAG->DAG Instruction Selection
0.2720 ( 5.6%) 0.0320 ( 5.1%) 0.3040 ( 5.5%) 0.2744 ( 5.0%) Global Value Numbering
0.2440 ( 5.0%) 0.0120 ( 1.9%) 0.2560 ( 4.7%) 0.2738 ( 5.0%) Loop Invariant Code Motion
0.2280 ( 4.7%) 0.0120 ( 1.9%) 0.2400 ( 4.4%) 0.2527 ( 4.6%) Combine redundant instructions
...
Most inlined functions:
Running `RUSTFLAGS="-C remark=inline -g" cargo +nightly build --release 2> inlined.txt`, and filtering the output using this: ``` grep '.*[[:alnum:]]\{5\} inlined into' inlined.txt | sed 's/.*[0-9]: //g; s/d into.*/d into/' | sort | uniq -c | sort -gr > unique_inlines.txt ``` (Yeah it's a bit of a monstrosity, and takes a long time, maybe I could look into making some tool analyze this if it turns out to be useful.) gives:1865 _ZN4core3ptr13drop_in_place17hffb7ed55089717f8E inlined into 917 _ZN50_$LT$T$u20$as$u20$core..convert..From$LT$T$GT$$GT$4from17h99c56b7d2123957eE inlined into 677 _ZN4core3fmt9Arguments6new_v117h0a32116e8ade096aE inlined into 549 _ZN4core3ptr13drop_in_place17h8905708a4bea3261E inlined into 517 _ZN4core3fmt9Formatter11debug_tuple17h312acfb009988478E inlined into 484 _ZN50_$LT$T$u20$as$u20$core..convert..From$LT$T$GT$$GT$4from17hf6536c6dc214691bE inlined into 438 _ZN5alloc9allocator6Layout4size17h278602cb6317171dE inlined into 402 _ZN63_$LT$alloc..allocator..Layout$u20$as$u20$core..clone..Clone$GT$5clone17hba86f6db001903cdE inlined into 381 _ZN61_$LT$alloc..heap..Heap$u20$as$u20$alloc..allocator..Alloc$GT$7dealloc17h3db9762b4fdfd21eE inlined into 344 _ZN5alloc4heap15exchange_malloc17he9800f033a6330e4E inlined into 341 _ZN4core3ptr13drop_in_place17hb5325e51b98a161aE inlined into ... 82 _ZN4core3fmt9Formatter12debug_struct17h5cfb920e90696799E inlined into ...