Remove mono specific SpanHelpers #78015

BrzVlad · 2022-11-08T10:54:12Z

#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917.

This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.

ghost · 2022-11-08T10:54:27Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917.

This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.

Author:	BrzVlad
Assignees:	BrzVlad
Labels:	`area-System.Memory`
Milestone:	-

srxqds · 2022-11-08T11:54:44Z

can this backport to release/7.0?

vargaz · 2022-11-08T12:12:50Z

There is no need to backport, net7 has a workaround for the perf problems.

kotlarmilos · 2022-11-08T12:50:29Z

Hey @BrzVlad would this PR fix the regressions in #77900?

stephentoub · 2022-11-08T13:46:57Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

                    goto Found6;
-                if (uValue == Unsafe.AddByteOffset(ref searchSpace, offset + 7))
+                if (uValue == Unsafe.AddByteOffset(ref current, 7))


Do these changes actually help? On coreclr at least it's not clear to me just from the diff that it's a net win.
https://www.diffchecker.com/du4Cevvb
Can you share throughput numbers?

There shouldn't be any improvement on coreclr. I wouldn't be surprised if the generated code is more or less identical, since the JIT might make use of common subexpression elimination optimization. Also the code paths tweaked in this PR are for non vectorized cases, which are mostly not hit at all with coreclr.

The BCL change is intended for less optimized compilers to be able to generate good quality code.

There shouldn't be any improvement on coreclr.

I'm more concerned it might regress. At least according to the diff I linked above, the new asm is a bit longer and appears to use a different addressing mode. I don't have a good sense for what that might mean in terms of throughput.

Also the code paths tweaked in this PR are for non vectorized cases, which are mostly not hit at all with coreclr.

Aren't they used for shorter inputs that don't have enough data to fill a vector?

The local perf numbers on some benchmarks are unclear, I'm waiting for a full performance run.

adamsitnik

The SpanHelper changes look OK, but I would like to see benchmark numbers that prove that we really need these gotos.

Thank you for fixing the main codegen issue @BrzVlad ! (Please get another reviewer for this part).

adamsitnik · 2022-11-08T13:48:47Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs


                    offset += 1;
                }
+                return -1;
+            Found7:
+                return (int)(offset + 7);


do we really need these gotos? One of the primary goals of #73768 was to simplify the code and get rid of the gotos (I am writing guidelines for how to use Vector12/256 and my job was to verify that users can write simple but performant code using the new APIs). What difference do the benchmarks report when using goto? If the difference is worth it, could you please log a codegen issue that would allow us to get rid of them?

@adamsitnik Consider for example this very simple benchmark: https://gist.github.com/BrzVlad/59ab5dea7b8dbbd632877f66882c1c6e

On both mono jit and mono interp the goto version is significantly faster (about 35%). Here is the diff on mono interp https://www.diffchecker.com/hsvgFTPt. As described in the commit, the generated code for the hot loop with the goto is more compact and with less branches so it is faster.

On coreclr there is no diference in performance, which I expected. Looking at the generated code is a little bit creepy since the the codegen for both methods is absolutely identical https://www.diffchecker.com/GhauFh1f.

adamsitnik · 2022-11-08T13:50:30Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

                        goto Found1;
-                    if (value.Equals(Unsafe.Add(ref searchSpace, length)))
+                    if (value.Equals(current))


I assume there is a difference between CLR and MonoVM which required you to apply this pattern to all the helper methods here. Do we have an issue that tracks it?

There is no reported issue for some of these changes. I applied the same pattern simply for consistency with the rest of the code. And this would also result in some perf gains on mono in some other ares as well.

vargaz · 2022-11-11T12:32:51Z

The interpreter changes look ok to me.
As for the managed changes, they help less optimizing compilers/interpreters generate better code.

…ssions (dotnet#75917)" This reverts commit 254844a.

Before, for every single element, the address `address + offset + ct` had to be computed. In theory, a smart compiler would be able to reuse `address + offset` value and offset it only by a constant to obtain every single element. Do this explicitly to avoid reliance on advanced optimizations.

This would replace code like ``` load b.neq add ret load b.neq add ret load .... ``` with ``` load b.eq load b.eq load ... ``` This makes the code more compact in the hot loop, reduces overall code size and thus improves performance. This pattern is widely used and it was also used before with Span lookups.

…al branch

Before we were marking bblocks as dead if they had their in_count 0. This is not enough however, since it doesn't account for loops. We now do a full traversal of the bblock graph to detect unreachable bblocks.

Consider for example the following pattern used commonly with conditional branches: ``` br.s [nil <- nil], BB0 ... ceq0.i4 [32 <- 40], br.s [nil <- nil], BB1 BB0: ldc.i4.0 [32 <- nil], BB1: brfalse.i4.s [nil <- 32], BB_EXIT BB2: ldstr [56 <- nil], 2 ``` This commit reorders this code to look like: ``` br.s [nil <- nil], BB0 ... ceq0.i4 [32 <- 40], brfalse.i4.s [nil <- 32], BB_EXIT br.s [nil <- nil], BB2 BB0 ldc.i4.0 [32 <- nil], BB1: brfalse.i4.s [nil <- 32], BB_EXIT BB2: ldstr [56 <- nil], 2 ``` This means we will have duplicated brfalse instructions, but every basic block reaching the conditional branch will have information about the condition. For example ceq0.i4 + brfalse is equivalent to brtrue, ldc.i4.0 + brfalse is equivalent to unconditional branch. After other future optimizations applied on the bblocks graph, like removal, merging and propagation of target, the resulting code in this example would look like: ``` br.s [nil <- nil], BB_EXIT ... brtrue.i4.s [nil <- 40], BB_EXIT BB2: ldstr [56 <- nil], 2 ``` Which is a great simplification over the original code.

… targets Even though they can be become unreachable in the current method, they can still be called when the unoptimized method gets tiered at this point. Add assert to prevent such issues in the future.

If we are unlikely to gain anything from propagating the condition (if we don't have information about any of the condition operand vars), simply avoid the optimization.

If we store in a var and this var is not used and redefined by the end of the basic block, then we can clear the original store.

We detect if a var's value never escapes the definition of a bblock. We mark such vars and clear unused definitions of that var from other bblocks.

If a bblock contains only an unconditional br, then all bblocks branching into it can just call the target directly instead.

src/mono/mono/mini/interp/transform.c

BrzVlad · 2022-12-05T09:09:53Z

The change to the indexing approach in span helpers hot loops has pretty random perf consequences on CoreCLR. In order to reduce side effects of this PR, which has the main purpose of fixing performance of these helpers with mono interpreter, I removed the indexing change and added a few more powerful instructions to the interpreter to compensate for this. Closing this in favor of #79215

BrzVlad requested a review from vargaz as a code owner November 8, 2022 10:54

ghost assigned BrzVlad Nov 8, 2022

ghost added the area-System.Memory label Nov 8, 2022

BrzVlad requested a review from adamsitnik November 8, 2022 10:54

stephentoub reviewed Nov 8, 2022

View reviewed changes

adamsitnik reviewed Nov 8, 2022

View reviewed changes

adamsitnik added tenet-performance Performance related issue area-Codegen-JIT-mono labels Nov 8, 2022

adamsitnik added this to the 8.0.0 milestone Nov 8, 2022

build-analysis bot mentioned this pull request Nov 8, 2022

[wasm][debugger] DebuggerTests.AssignmentTests.InspectVariableBeforeAndAfterAssignment failing #64188

Closed

MihaZupan mentioned this pull request Nov 9, 2022

Add IndexOfAnyValues #78093

Merged

BrzVlad force-pushed the fix-interp-index-of-regression branch 2 times, most recently from d254eac to 9e17559 Compare November 18, 2022 10:27

BrzVlad added 12 commits November 18, 2022 19:50

Revert "[Mono] Restore old code to solve the recent SpanHelpers regre…

d946c3b

…ssions (dotnet#75917)" This reverts commit 254844a.

[mono][interp] Replace compare + brfalse/brtrue with single condition…

fe06a72

…al branch

[mono][interp] Dump in/out links for bblocks during verbose logging

ff36071

[mono][interp] Improve detection of dead bblocks

fb2bcde

Before we were marking bblocks as dead if they had their in_count 0. This is not enough however, since it doesn't account for loops. We now do a full traversal of the bblock graph to detect unreachable bblocks.

[mono][interp] Don't optimize out bblocks that are tiering patchpoint…

c8edcde

… targets Even though they can be become unreachable in the current method, they can still be called when the unoptimized method gets tiered at this point. Add assert to prevent such issues in the future.

[mono][interp] Make bblock reordering more conservative

cffdc02

If we are unlikely to gain anything from propagating the condition (if we don't have information about any of the condition operand vars), simply avoid the optimization.

[mono][interp] Add basic removal of unused defines

5e8571f

If we store in a var and this var is not used and redefined by the end of the basic block, then we can clear the original store.

[mono][interp] Clear unused defines of local only vars

e4abb67

We detect if a var's value never escapes the definition of a bblock. We mark such vars and clear unused definitions of that var from other bblocks.

[mono][interp] Propagate target branches

954483a

If a bblock contains only an unconditional br, then all bblocks branching into it can just call the target directly instead.

BrzVlad force-pushed the fix-interp-index-of-regression branch from 9e17559 to 954483a Compare November 18, 2022 17:51

kotlarmilos reviewed Nov 30, 2022

View reviewed changes

src/mono/mono/mini/interp/transform.c Show resolved Hide resolved

BrzVlad closed this Dec 5, 2022

ghost locked as resolved and limited conversation to collaborators Jan 4, 2023

Remove mono specific SpanHelpers #78015

Remove mono specific SpanHelpers #78015

Uh oh!

Conversation

BrzVlad commented Nov 8, 2022

Uh oh!

ghost commented Nov 8, 2022

Uh oh!

srxqds commented Nov 8, 2022

Uh oh!

vargaz commented Nov 8, 2022

Uh oh!

kotlarmilos commented Nov 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vargaz commented Nov 11, 2022

Uh oh!

Uh oh!

BrzVlad commented Dec 5, 2022

Uh oh!

Uh oh!