Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove mono specific SpanHelpers #79215

Merged
merged 13 commits into from
Dec 18, 2022

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Dec 5, 2022

#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917.

This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.

@ghost
Copy link

ghost commented Dec 5, 2022

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

Issue Details

#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917.

This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.

Author: BrzVlad
Assignees: -
Labels:

area-Codegen-Interpreter-mono

Milestone: -

@BrzVlad
Copy link
Member Author

BrzVlad commented Dec 8, 2022

@vargaz Could you take another look at this ?

This would replace code like

```
load
b.neq
  add
  ret
load
b.neq
  add
  ret
load
....
```
with

```
load
b.eq
load
b.eq
load
...
```

This makes the code more compact in the hot loop, reduces overall code size and thus improves performance. This pattern is widely used and it was also used before with Span lookups.
Before we were marking bblocks as dead if they had their in_count 0. This is not enough however, since it doesn't account for loops. We now do a full traversal of the bblock graph to detect unreachable bblocks.
Consider for example the following pattern used commonly with conditional branches:
```
     br.s           [nil <- nil], BB0
     ...
     ceq0.i4        [32 <- 40],
     br.s           [nil <- nil], BB1

BB0: ldc.i4.0       [32 <- nil],
BB1: brfalse.i4.s   [nil <- 32], BB_EXIT

BB2: ldstr          [56 <- nil], 2

```
This commit reorders this code to look like:
```
     br.s           [nil <- nil], BB0
     ...
     ceq0.i4        [32 <- 40],
     brfalse.i4.s   [nil <- 32], BB_EXIT
     br.s           [nil <- nil], BB2

BB0  ldc.i4.0       [32 <- nil],
BB1: brfalse.i4.s   [nil <- 32], BB_EXIT

BB2: ldstr          [56 <- nil], 2

```

This means we will have duplicated brfalse instructions, but every basic block reaching the conditional branch will have information about the condition. For example ceq0.i4 + brfalse is equivalent to brtrue, ldc.i4.0 + brfalse is equivalent to unconditional branch. After other future optimizations applied on the bblocks graph, like removal, merging and propagation of target, the resulting code in this example would look like:
```
     br.s           [nil <- nil], BB_EXIT
     ...
     brtrue.i4.s    [nil <- 40], BB_EXIT
BB2: ldstr          [56 <- nil], 2

```
Which is a great simplification over the original code.
… targets

Even though they can be become unreachable in the current method, they can still be called when the unoptimized method gets tiered at this point.

Add assert to prevent such issues in the future.
If we are unlikely to gain anything from propagating the condition (if we don't have information about any of the condition operand vars), simply avoid the optimization.
If we store in a var and this var is not used and redefined by the end of the basic block, then we can clear the original store.
We detect if a var's value never escapes the definition of a bblock. We mark such vars and clear unused definitions of that var from other bblocks.
If a bblock contains only an unconditional br, then all bblocks branching into it can just call the target directly instead.
This pattern is used in low level unsafe code when using (var + ct1) as an index into an array, where ct2 is the sizeof of array element.

Also fix diplay of two shorts when dumping instructions.
These new instructions can apply addition and multiplication with constant to the offset var.
@BrzVlad BrzVlad merged commit 200a90a into dotnet:main Dec 18, 2022
@radekdoulik
Copy link
Member

Nice! I see interpreter improvements in the SpanHelper bytes measurements. The chars are mixed, improvements in firefox, some slower in chrome. Overall it is great improvement.
image
https://radekdoulik.github.io/WasmPerformanceMeasurements/?startDate=2022-12-05T10%3A43%3A18.830Z&endDate=2022-12-19T10%3A43%3A18.830Z&tasks=&flavors=2%2C3

It also improved Index of chars with the aot/simd. I think that's because the current code uses S.R.I.Vector128, where we have better coverage, compared to the older, which used S.N.Vector.
image
https://radekdoulik.github.io/WasmPerformanceMeasurements/?startDate=2022-12-05T10%3A43%3A18.830Z&endDate=2022-12-19T10%3A43%3A18.830Z&tasks=&flavors=4%2C8

/cc @lewing

@MihaZupan
Copy link
Member

@BrzVlad can you please take a look at the changes in #78861 (they only affect platforms with Sse2/AdvSimd support). Should those be updated to better match what was done in this PR? I'm trying to avoid accidentally undoing the improvements you made here.

@BrzVlad
Copy link
Member Author

BrzVlad commented Dec 22, 2022

@MihaZupan I don't think there are any problems between these 2 PRs. As long as your code is guarded everywhere with IsSupported / IsHardwareAccelerated then there should be changed for mono interpreter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants