Description
This issue summarizes problems which Inliner doesn't handle well currently.
Better understanding of what will be removed if we inline a callee
TODO:
Bottom-up inlining for AOT
TODO:
"Agile" time budget
TODO:
stfld is expensive for ref types
Currently inliner doesn't realize stfld
for ref types will end up as ASSIGN_REF calls (gc write barriers) and is cheap for value-types. There is a benchmark that shows better results with NoInlining: https://discord.com/channels/143867839282020352/312132327348240384/886606988287569950
class MyTest
{
private object _a;
private object _b;
public MyTest(object a, object b)
{
_a = a;
_b = b;
}
static MyTest Create(object a, object b) => new MyTest(a, b);
}
Codegen for Create
:
; Method MyTest:Test(System.Object,System.Object):MyTest
G_M58625_IG01: ;; offset=0000H
57 push rdi
56 push rsi
53 push rbx
4883EC20 sub rsp, 32
488BF1 mov rsi, rcx
488BFA mov rdi, rdx
;; bbWeight=1 PerfScore 3.75
G_M58625_IG02: ;; offset=000DH
48B9D00288CDFB7F0000 mov rcx, 0x7FFBCD8802D0
E8B4C7A15F call CORINFO_HELP_NEWSFAST
488BD8 mov rbx, rax
488D4B08 lea rcx, bword ptr [rbx+8]
488BD6 mov rdx, rsi
E825C3A15F call CORINFO_HELP_ASSIGN_REF
488D4B10 lea rcx, bword ptr [rbx+16]
488BD7 mov rdx, rdi
E819C3A15F call CORINFO_HELP_ASSIGN_REF
488BC3 mov rax, rbx
;; bbWeight=1 PerfScore 5.25
G_M58625_IG03: ;; offset=003AH
4883C420 add rsp, 32
5B pop rbx
5E pop rsi
5F pop rdi
C3 ret
;; bbWeight=1 PerfScore 2.75
; Total bytes of code: 66
ldsfld over a static readonly field returns constant
static readonly
fields over primitive types most likely are converted to constants, e.g.:
private static readonly int s_Field = Environment.ProcessorCount;
void Validate()
{
if (s_Field < 16)
throw new Exception("");
}
void Test() => Validate();
Inliner's log in Test
:
Inline candidate has 1 binary expressions with constants. Multiplier increased to 0.5.
Inline candidate callsite is boring. Multiplier increased to 1.8.
while it is expected from it to recognize a foldable branch here.
Callees with Pinvokes are expensive
(or basically any callee that injects code into callers' prologues/epilogues)
class MyTest
{
[DllImport("user32.dll")]
static extern int MessageBeep(uint uType);
static int MessageBeepWrapper(uint uType) => MessageBeep(uType);
static void Beep(bool beep)
{
if (beep)
MessageBeepWrapper(0x20);
}
}
Codegen for Beep
:
G_M29584_IG01: ;; offset=0000H
55 push rbp
4157 push r15
4156 push r14
4155 push r13
4154 push r12
57 push rdi
56 push rsi
53 push rbx
4883EC68 sub rsp, 104
488DAC24A0000000 lea rbp, [rsp+A0H]
8BF1 mov esi, ecx
;; bbWeight=1 PerfScore 9.00
G_M29584_IG02: ;; offset=001AH
488D4D88 lea rcx, [rbp-78H]
498BD2 mov rdx, r10
E8FA4E705F call CORINFO_HELP_INIT_PINVOKE_FRAME
488BF8 mov rdi, rax
488BCC mov rcx, rsp
48894DA8 mov qword ptr [rbp-58H], rcx
488BCD mov rcx, rbp
48894DB8 mov qword ptr [rbp-48H], rcx
4084F6 test sil, sil
744B je SHORT G_M29584_IG07
;; bbWeight=1 PerfScore 5.75
G_M29584_IG03: ;; offset=003CH
B920000000 mov ecx, 32
48B8C8018ACDFB7F0000 mov rax, 0x7FFBCD8A01C8
48894598 mov qword ptr [rbp-68H], rax
488D0516000000 lea rax, G_M29584_IG05
488945B0 mov qword ptr [rbp-50H], rax
488D4588 lea rax, bword ptr [rbp-78H]
48894710 mov qword ptr [rdi+16], rax
C6470C00 mov byte ptr [rdi+12], 0
;; bbWeight=0.50 PerfScore 3.00
G_M29584_IG04: ;; offset=0066H
FF15FCD54600 call [MyTest:MessageBeep(int):int]
;; bbWeight=0.50 PerfScore 1.50
G_M29584_IG05: ;; offset=006CH
C6470C01 mov byte ptr [rdi+12], 1
833D11C7F55F00 cmp dword ptr [(reloc 0x7ffc2d38f428)], 0
7406 je SHORT G_M29584_IG06
FF15310AF45F call [CORINFO_HELP_STOP_FOR_GC]
;; bbWeight=0.50 PerfScore 4.00
G_M29584_IG06: ;; offset=007FH
488B4590 mov rax, bword ptr [rbp-70H]
48894710 mov qword ptr [rdi+16], rax
;; bbWeight=0.50 PerfScore 1.00
G_M29584_IG07: ;; offset=0087H
4883C468 add rsp, 104
5B pop rbx
5E pop rsi
5F pop rdi
415C pop r12
415D pop r13
415E pop r14
415F pop r15
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 5.25
; Total bytes of code: 152
NOTE if caller already have a pinvoke - then it should make sense to inline callees with pinvokes too.
typeof(T) should be recognized as a constant argument (and propagated by value)
void Validate(Type type)
{
if (type.IsValueType)
throw new InvalidOperationException("");
}
void CallTest() => Validate(typeof(string));
Inliner's log for CallTest
:
Inline has 1 intrinsics. Multiplier increased to 1.3.
Inline candidate callsite is boring. Multiplier increased to 2.6.
calleeNativeSizeEstimate=642
callsiteNativeSizeEstimate=85
benefit multiplier=2.6
threshold=221
Native estimate for function size exceeds threshold for inlining 64.2 > 22.1 (multiplier = 2.6)
So it recognized get_IsValueType
intrinsic, but didn't realize we pass a constant there. #55875 is a good example of how to propagate complex trees during inlining so the same should be done with typeof
.
call (cnsArg, ...) heuristic is too optimistic
When inliner sees a pattern like if (CallSomething(arg0, ...)
it might recognize it as a foldable branch currently if arg0 is a constant arg: it's too optimistic and doesn't check call's signature.
runtime/src/coreclr/jit/fgbasic.cpp
Lines 1230 to 1235 in f29484a
Methods with string literals from external assemblies are never inlined
See #53726 (comment)
JIT should not give up completely on large methods
It should be able to analyze a small part of such method to find potential foldable branches in large methods or when it runs out of "time budget", related issues:
- AggressiveInlining not respected when Method grow too large #41692
- JIT code optimization inconsistency #51587
- System.Numerics.Vector: Recognize division by a constant and inline #43761 (comment)
- Weird inlining of methods optimized by JIT #47434
JIT should be able to resolve 'call' instructions in methods with loops.
We either should allow it to resolve 'call' in TC=0/methods with loops or enable QJFL. Enabling it for TC=0 will allow users to see how it works with intrinsics on sharplab. Potential impact on TP. So currently we'll never see a foldable branch here:
void Test<T>()
{
if (typeof(T) == typeof(int))
return;
for (int i = 0; i < 100; i++) {}
}
Cache machine code size
If JIT compiles a method with IL size, say, 100 bytes and it ends up with only 20 bytes of machine code we can keep that info for future compilations that this method is not as big as it looks (it can be just a single bit instead of machine code size).
Misc
OS/Platform specifics: Callee-saved registers and CSE, ABI differences
Explicit tail-calls are never inlined
Save some tips for inliner in tier0/crossgen
etc.
category:planning
theme:inlining
skill-level:expert
cost:large
impact:large