Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove mono specific SpanHelpers #78015

Closed
wants to merge 12 commits into from

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Nov 8, 2022

#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917.

This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.

@ghost
Copy link

ghost commented Nov 8, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917.

This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.

Author: BrzVlad
Assignees: BrzVlad
Labels:

area-System.Memory

Milestone: -

@srxqds
Copy link
Contributor

srxqds commented Nov 8, 2022

can this backport to release/7.0?

@vargaz
Copy link
Contributor

vargaz commented Nov 8, 2022

There is no need to backport, net7 has a workaround for the perf problems.

@kotlarmilos
Copy link
Member

Hey @BrzVlad would this PR fix the regressions in #77900?

goto Found6;
if (uValue == Unsafe.AddByteOffset(ref searchSpace, offset + 7))
if (uValue == Unsafe.AddByteOffset(ref current, 7))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these changes actually help? On coreclr at least it's not clear to me just from the diff that it's a net win.
https://www.diffchecker.com/du4Cevvb
Can you share throughput numbers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any improvement on coreclr. I wouldn't be surprised if the generated code is more or less identical, since the JIT might make use of common subexpression elimination optimization. Also the code paths tweaked in this PR are for non vectorized cases, which are mostly not hit at all with coreclr.

The BCL change is intended for less optimized compilers to be able to generate good quality code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any improvement on coreclr.

I'm more concerned it might regress. At least according to the diff I linked above, the new asm is a bit longer and appears to use a different addressing mode. I don't have a good sense for what that might mean in terms of throughput.

Also the code paths tweaked in this PR are for non vectorized cases, which are mostly not hit at all with coreclr.

Aren't they used for shorter inputs that don't have enough data to fill a vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local perf numbers on some benchmarks are unclear, I'm waiting for a full performance run.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SpanHelper changes look OK, but I would like to see benchmark numbers that prove that we really need these gotos.

Thank you for fixing the main codegen issue @BrzVlad ! (Please get another reviewer for this part).


offset += 1;
}
return -1;
Found7:
return (int)(offset + 7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need these gotos? One of the primary goals of #73768 was to simplify the code and get rid of the gotos (I am writing guidelines for how to use Vector12/256 and my job was to verify that users can write simple but performant code using the new APIs). What difference do the benchmarks report when using goto? If the difference is worth it, could you please log a codegen issue that would allow us to get rid of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik Consider for example this very simple benchmark: https://gist.github.com/BrzVlad/59ab5dea7b8dbbd632877f66882c1c6e

On both mono jit and mono interp the goto version is significantly faster (about 35%). Here is the diff on mono interp https://www.diffchecker.com/hsvgFTPt. As described in the commit, the generated code for the hot loop with the goto is more compact and with less branches so it is faster.

On coreclr there is no diference in performance, which I expected. Looking at the generated code is a little bit creepy since the the codegen for both methods is absolutely identical https://www.diffchecker.com/GhauFh1f.

goto Found1;
if (value.Equals(Unsafe.Add(ref searchSpace, length)))
if (value.Equals(current))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is a difference between CLR and MonoVM which required you to apply this pattern to all the helper methods here. Do we have an issue that tracks it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reported issue for some of these changes. I applied the same pattern simply for consistency with the rest of the code. And this would also result in some perf gains on mono in some other ares as well.

@vargaz
Copy link
Contributor

vargaz commented Nov 11, 2022

The interpreter changes look ok to me.
As for the managed changes, they help less optimizing compilers/interpreters generate better code.

@BrzVlad BrzVlad force-pushed the fix-interp-index-of-regression branch 2 times, most recently from d254eac to 9e17559 Compare November 18, 2022 10:27
Before, for every single element, the address `address + offset + ct` had to be computed. In theory, a smart compiler would be able to reuse `address + offset` value and offset it only by a constant to obtain every single element. Do this explicitly to avoid reliance on advanced optimizations.
This would replace code like

```
load
b.neq
  add
  ret
load
b.neq
  add
  ret
load
....
```
with

```
load
b.eq
load
b.eq
load
...
```

This makes the code more compact in the hot loop, reduces overall code size and thus improves performance. This pattern is widely used and it was also used before with Span lookups.
Before we were marking bblocks as dead if they had their in_count 0. This is not enough however, since it doesn't account for loops. We now do a full traversal of the bblock graph to detect unreachable bblocks.
Consider for example the following pattern used commonly with conditional branches:
```
     br.s           [nil <- nil], BB0
     ...
     ceq0.i4        [32 <- 40],
     br.s           [nil <- nil], BB1

BB0: ldc.i4.0       [32 <- nil],
BB1: brfalse.i4.s   [nil <- 32], BB_EXIT

BB2: ldstr          [56 <- nil], 2

```
This commit reorders this code to look like:
```
     br.s           [nil <- nil], BB0
     ...
     ceq0.i4        [32 <- 40],
     brfalse.i4.s   [nil <- 32], BB_EXIT
     br.s           [nil <- nil], BB2

BB0  ldc.i4.0       [32 <- nil],
BB1: brfalse.i4.s   [nil <- 32], BB_EXIT

BB2: ldstr          [56 <- nil], 2

```

This means we will have duplicated brfalse instructions, but every basic block reaching the conditional branch will have information about the condition. For example ceq0.i4 + brfalse is equivalent to brtrue, ldc.i4.0 + brfalse is equivalent to unconditional branch. After other future optimizations applied on the bblocks graph, like removal, merging and propagation of target, the resulting code in this example would look like:
```
     br.s           [nil <- nil], BB_EXIT
     ...
     brtrue.i4.s    [nil <- 40], BB_EXIT
BB2: ldstr          [56 <- nil], 2

```
Which is a great simplification over the original code.
… targets

Even though they can be become unreachable in the current method, they can still be called when the unoptimized method gets tiered at this point.

Add assert to prevent such issues in the future.
If we are unlikely to gain anything from propagating the condition (if we don't have information about any of the condition operand vars), simply avoid the optimization.
If we store in a var and this var is not used and redefined by the end of the basic block, then we can clear the original store.
We detect if a var's value never escapes the definition of a bblock. We mark such vars and clear unused definitions of that var from other bblocks.
If a bblock contains only an unconditional br, then all bblocks branching into it can just call the target directly instead.
@BrzVlad BrzVlad force-pushed the fix-interp-index-of-regression branch from 9e17559 to 954483a Compare November 18, 2022 17:51
@BrzVlad
Copy link
Member Author

BrzVlad commented Dec 5, 2022

The change to the indexing approach in span helpers hot loops has pretty random perf consequences on CoreCLR. In order to reduce side effects of this PR, which has the main purpose of fixing performance of these helpers with mono interpreter, I removed the indexing change and added a few more powerful instructions to the interpreter to compensate for this. Closing this in favor of #79215

@BrzVlad BrzVlad closed this Dec 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants