Optimize multicast stubs#130207
Conversation
|
@EgorBot -arm -amd --envvars DOTNET_JitDisasm:IL_STUB_MulticastDelegate_Invoke using BenchmarkDotNet.Attributes;
public class MyBenchmarks
{
public Action a;
[GlobalSetup]
public void Setup()
{
for (int i = 0; i < 10000; i++)
a += new Action(() => {});
}
[Benchmark]
public void Bench() => a();
} |
|
Hmm the asm seems better but the arm perf is much worse, I assume it's cause the JIT is ordering blocks wrong which causes branch mispredictions. EDIT: the branch order is also different than what I get locally, is the bot using any weird settings? @EgorBo |
These sort of
We have seen number of cases where unsafe code "optimizations" result in worse performance. It sounds like this is another one of those.
What are the covariant helpers that this is eliminating?
The typical multicast delegate has very few targets, and the targets are typically different. This is not very representative microbenchmark. |
|
@EgorBot -arm -amd --envvars DOTNET_JitDisasm:IL_STUB_MulticastDelegate_Invoke using BenchmarkDotNet.Attributes;
public class MyBenchmarks
{
public Action a;
[GlobalSetup]
public void Setup()
{
for (int i = 0; i < 10000; i++)
a += new Action(() => {});
}
[Benchmark]
public void Bench() => a();
} |
| #endif // DEBUGGING_SUPPORTED | ||
|
|
||
| ILCodeLabel *realLoopStart = pCode->NewCodeLabel(); | ||
| pCode->EmitBEQ(realLoopStart); |
There was a problem hiding this comment.
This doesn't look optimal for branch predicting (forward conditional jump as hot path). The original logic was constructed for optimizing this.
There was a problem hiding this comment.
This doesn't look optimal for branch predicting (forward conditional jump as hot path). The original logic was constructed for optimizing this.
The original IL was unusual and confusing the JIT causing even worse performance. Additionally, it was treated as hot there too. cc @tannergooding since we discussed this on Discord.
It seems the only way to solve this here would be an internal Assume.Likely/Unlikely.
There was a problem hiding this comment.
Like I suggested on discord, if we don't actually need multicast to lightup the debugger support at any point in the loop; then cloning is likely your best bet here.
You can rewrite the IL to be better and not the asm default to likely and still be understandable as normal control flow, but its quite a bit harder.
I'm not sure the extra churn is worth it though; the current stuff seems plenty good enough.

Rewrites multicast stubs to byref loops to remove bounds checks and covarianc helpers from every iteration.
This assumes the wrapper struct is the same size as a ref, is such assumption fine for the VM? @jkotas @MichalStrehovsky