Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

JIT: refactor how we do late devirtualization #20553

Merged

Conversation

AndyAyersMS
Copy link
Member

Change late devirtualization to run in a postorder callback during
the tree search to update GT_RET_EXPRs, instead of shoehorning it into
the preorder callback.

This allows the jit to reconsider all calls for late devirtualization,
not just calls that are parents of particular GT_RET_EXPRs. The jit will
take advantage of this in subsequent work that does more aggressive
bottup-up type sharpening.

Reconsidering all calls for late devirt instead of just a select subset
incurs around a 0.5% throughput impact.

To mitigate this, short-circult the tree walk when we no longer see a
GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR
free subtrees.

To make this work we have to make sure all node types propagate flags from
their children. This required some updates to a few node constructors.

Also take advantage of gtRetExprVal to tunnel through GT_RET_EXPR chains
to the final tree to avoid needing multiple overwrites when propagating
the return expression back into the IR.

With these mitigations this change has minimal throughput impact.

Change late devirtualization to run in a postorder callback during
the tree search to update GT_RET_EXPRs, instead of shoehorning it into
the preorder callback.

This allows the jit to reconsider all calls for late devirtualization,
not just calls that are parents of particular GT_RET_EXPRs. The jit will
take advantage of this in subsequent work that does more aggressive
bottup-up type sharpening.

Reconsidering all calls for late devirt instead of just a select subset
incurs around a 0.5% throughput impact.

To mitigate this, short-circult the tree walk when we no longer see a
GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR
free subtrees.

To make this work we have to make sure all node types propagate flags from
their children. This required some updates to a few node constructors.

Also take advantage of `gtRetExprVal` to tunnel through GT_RET_EXPR chains
to the final tree to avoid needing multiple overwrites when propagating
the return expression back into the IR.

With these mitigations this change has minimal throughput impact.
@AndyAyersMS
Copy link
Member Author

@erozen PTAL
cc @dotnet/jit-contrib

No diffs

@mikedn
Copy link

mikedn commented Oct 23, 2018

Reconsidering all calls for late devirt instead of just a select subset
incurs around a 0.5% throughput impact.

With these mitigations this change has minimal throughput impact.

Note that using GenTreeVisitor rather than fgWalkTree tends to offer slightly better performance and it also has the advantage that it encapsulates the logic in class rather than adding more stuff to Compiler. That said, currently there's not a lot of logic so it's probably not worth the trouble. But you could consider doing that if you add more stuff in this area.

@AndyAyersMS
Copy link
Member Author

Looks like there's an issue with SysV conventions. Investigating...

// replacement may have enabled optimizations by providing more
// specific types for trees or variables.
fgWalkTree(&stmt->gtStmtExpr, fgUpdateInlineReturnExpressionPlaceHolder, fgLateDevirtualization,
(void*)this);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment.

@AndyAyersMS
Copy link
Member Author

Ubuntu arm failure here is one we've seen a lot lately:

14:19:08   System.IO.FileLoadException: Could not load file or assembly 'baseservices.threading.XUnitWrapper, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null'. An operation is not legal in the current state. (Exception from HRESULT: 0x80131509 (COR_E_INVALIDOPERATION))

Other failures look like issues with my change. Perhaps another missing flag case.

@AndyAyersMS
Copy link
Member Author

Issue seems to be that the preorder callback mutates the parent node, which screws up invocation of the postorder callback.

                    comp->fgAttachStructInlineeToAsg(parent, tree, retClsHnd);

So need to figure out if / how the walker can recover, or how to defer this update.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 23, 2018

We have (tree = 21, parent = 24)

[000021] --C---------              |  /--*  RET_EXPR  struct(inl return from call [000016])
[000024] -AC---------              \--*  ASG       struct (copy)
[000022] D-----------                 \--*  LCL_VAR   struct V00 loc0

and the ret expr is:

[000160] n--X--------              *  OBJ(16)   struct
[000158] L-----------              \--*  ADDR      byref
[000159] ------------                 \--*  LCL_VAR   struct V00 loc0

and the new parent tree gets detected as a self-assignment and 24 is updated in place with

[000167] ------------              *  NOP       void

so the postorder callback is invoked with pTree pointing at the op2 slot of 167, which is null. We only do this sort of update for (some cases of) multireg returns so it only shows up on some architectures.

If it turned out not to be a self-assignment there would be a GT_BLK here instead, and it appears the preorder and postorder would end up visiting all the right operands the postorder would still end up visiting both operands (since the parent was a GT_ASG and so also a binop).

So likely things would "work" if we just checked for nullptr trees in the postorder callback, but relying on this seems awfully dodgy.

Am going to look at deferring this update to the postorder traversal where it can be done cleanly.

@echesakov
Copy link

@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test

@AndyAyersMS
Copy link
Member Author

Or maybe those clauses for returning structs by value aren't needed anymore....?

@AndyAyersMS
Copy link
Member Author

Hmm. Looks like the GT_ASG to GT_BLK swap is tolerated and even somewhat expected by the current code. And we have access to the parent in the postorder so can just see if it got changed to a GT_NOP. So will go with that for now.

@AndyAyersMS
Copy link
Member Author

3 remaining failures are seemingly unrelated

  • ubuntu x64 .... xunit error I've seen elsewhere:
18:09:17   Unhandled Exception: System.InvalidOperationException: Collection was modified after the enumerator was instantiated.
18:09:17      at System.Collections.Generic.Stack`1.Enumerator.MoveNext()
18:09:17      at Xunit.Sdk.DisposalTracker.Dispose() in C:\projects\xunit\src\xunit.execution\Sdk\DisposalTracker.cs:line 26
  • ubuntu arm64 failed to load a test assembly
18:05:57   xUnit.net Console Runner v2.4.1-pre.build.4059 (64-bit .NET Core 4.6.27024.0)
18:05:58   System.IO.FileLoadException: Could not load file or assembly 'JIT.CheckProjects.XUnitWrapper, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null'. An operation is not legal in the current state. (Exception from HRESULT: 0x80131509 (COR_E_INVALIDOPERATION))
  • windows arm CI issue
17:18:13  > git fetch --tags --progress https://github.com/dotnet/coreclr +refs/heads/*:refs/remotes/origin/* # timeout=90
17:18:15 FATAL: java.nio.channels.ClosedChannelException
17:18:15 java.nio.channels.ClosedChannelException
17:18:15 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 168.61.21.95/168.61.21.95:1984
17:18:15 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1693)

@AndyAyersMS
Copy link
Member Author

@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test
@dotnet-bot test Windows_NT arm64 Cross Checked Innerloop Build and Test

@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib think this is ready.

Just to sum up the above series of notes: the current fgUpdateInlineReturnExpressionPlaceHolder is doing some unexpected and frankly questionable edits to the parent node for certain tree patterns involving multi-reg struct returns. We've gotten away with so far it because we either trigger this visiting the second operand and either overwrite the parent with a node with the same arity (GT_ASG -> GT_BLK) where thew new op1 is the same op1 we've already visited, OR we overwrite with a null arity GT_NOP. AND we also don't have a postorder callback. So everything gets traversed and nothing gets confused.

It does not look that easy to move this logic to the postorder callback where it would be a more natural fit: we'd have to leave ourselves a "todo" flag or something similar. So for now I'm detecting the GT_NOP bashing in the postorder callback and bailing out.

I am hoping that progress on first class structs will remove some of the context-sensitivity from struct typed nodes and allow us to do this sort of update without having to also do edits to parents. So I believe the current workaround is tolerable.

@AndyAyersMS
Copy link
Member Author

windows arm64 failure. Will investigate as it could be related to the change.

09:18:20 
      Interop_COM._NETClients_IDispatch_NETClientIDispatch_NETClientIDispatch_._NETClients_IDispatch_NETClientIDispatch_NETClientIDispatch_cmd [FAIL]
09:18:20         'c:\users\helixa~1\appdata\local\temp\tmp2l86mq' is not recognized as an internal or external command,
09:18:20         operable program or batch file.
09:18:20         
09:18:20 
        Assert failure(PID 9344 [0x00002480], Thread: 4224 [0x1080]): fFixedUp || (pExceptionRecord->ExceptionCode == STATUS_STACK_OVERFLOW)
09:18:20         
09:18:20         CORECLR! GetCLRRuntimeHost + 0x1C3090 (0x00007ffa`834b6480)
09:18:20         NTDLL! chkstk + 0x24C (0x00007ffa`ad4e506c)
09:18:20         NTDLL! RtlUnwindEx + 0x24C (0x00007ffa`ad51609c)
09:18:20         CORECLR! GetCLRRuntimeHost + 0x1B38A8 (0x00007ffa`834a6c98)
09:18:20         CORECLR! GetCLRRuntimeHost + 0x1C32A4 (0x00007ffa`834b6694)
09:18:20         NTDLL! chkstk + 0x1CC (0x00007ffa`ad4e4fec)
09:18:20         NTDLL! RtlPcToFileHeader + 0x418 (0x00007ffa`ad519e18)
09:18:20         NTDLL! RtlRaiseException + 0xC0 (0x00007ffa`ad550b80)
09:18:20         KERNELBASE! RaiseException + 0x60 (0x00007ffa`a9be8d90)
09:18:20         CORECLR! CreateCLRProfiling + 0x2578C8 (0x00007ffa`83c72778)
09:18:20             File: d:\j\workspace\arm64_cross_c---f1ec3c8e\src\vm\exceptionhandling.cpp Line: 1205
09:18:20             Image: C:\j\workspace\arm64_cross_c---5cf2e9f1\bin\tests\Windows_NT.arm64.Checked\Tests\Core_Root\CoreRun.exe
09:18:20         
09:18:20         
09:18:20   
09:18:20   Return code:      1
09:18:20   Raw output file:      C:\j\workspace\arm64_cross_c---5cf2e9f1\bin\tests\Windows_NT.arm64.Checked\Reports\Interop.COM\NETClients\IDispatch\NETClientIDispatch\NETClientIDispatch.output.txt
09:18:20   Raw output:
09:18:20   BEGIN EXECUTION
09:18:20          "C:\j\workspace\arm64_cross_c---5cf2e9f1\bin\tests\Windows_NT.arm64.Checked\Tests\Core_Root\corerun.exe" NETClientIDispatch.exe 
09:18:20         Calling DoubleNumeric_ReturnByRef ...
09:18:20         Test Failure: System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
09:18:20            at NetClient.Program.Validate_Numeric_In_ReturnByRef()
09:18:20            at NetClient.Program.Main(String[] doNotUse)

@AndyAyersMS
Copy link
Member Author

I don't see any diffs for this test (either main assembly or helper dll) using the arm64 altjit. So am going to retry...

@dotnet-bot retest Windows_NT arm64 Cross Checked Innerloop Build and Test

@AndyAyersMS
Copy link
Member Author

Same test failed again and is failing for other PRs. It was just introduced yesterday. Tracking via #20580.

@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib ping .. ?

I have some follow on changes that need this.

@AndyAyersMS AndyAyersMS merged commit dfd0f4f into dotnet:master Oct 26, 2018
@AndyAyersMS AndyAyersMS deleted the RunReturnValuePlaceholderPostorder branch October 26, 2018 17:55
AndyAyersMS added a commit to AndyAyersMS/coreclr that referenced this pull request Oct 26, 2018
If the jit decides it needs a return spill temp, and the return value
has already been spilled to a single-def temp, re-use the existing
for the return temp rather than creating a new one.

In conjunction with dotnet#20553 this allows late devirtualization for calls where
the object in the virtual call is the result of an inline that provides
a better type.

In particular we see this pattern for `ArrayPool<T>.Shared.Rent/Release`.

Closes dotnet#15873.
AndyAyersMS added a commit to AndyAyersMS/coreclr that referenced this pull request Oct 26, 2018
If the jit decides it needs a return spill temp, and the return value
has already been spilled to a single-def temp, re-use the existing
for the return temp rather than creating a new one.

In conjunction with dotnet#20553 this allows late devirtualization for calls where
the object in the virtual call is the result of an inline that provides
a better type, and the objected formerly reached the call via one or more
intermediate temps.

Closes dotnet#15873.
AndyAyersMS added a commit that referenced this pull request Oct 29, 2018
If the jit decides it needs a return spill temp, and the return value
has already been spilled to a single-def temp, re-use the existing
for the return temp rather than creating a new one.

In conjunction with #20553 this allows late devirtualization for calls where
the object in the virtual call is the result of an inline that provides
a better type, and the objected formerly reached the call via one or more
intermediate temps.

Closes #15873.
A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018
Change late devirtualization to run in a postorder callback during
the tree search to update GT_RET_EXPRs, instead of shoehorning it into
the preorder callback.

This allows the jit to reconsider all calls for late devirtualization,
not just calls that are parents of particular GT_RET_EXPRs. The jit will
take advantage of this in subsequent work that does more aggressive
bottup-up type sharpening.

Reconsidering all calls for late devirt instead of just a select subset
incurs around a 0.5% throughput impact.

To mitigate this, short-circult the tree walk when we no longer see a
GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR
free subtrees.

To make this work we have to make sure all node types propagate flags from
their children. This required some updates to a few node constructors.

There is also an odd quirk in the preorder callback where it may overwrite
the parent node (from GT_ASG to GT_BLK or GT_NOP). If the overwrite creates
a GT_NOP, the postorder callback may see a null tree. Tolerate this for now
by checking for that case in the postorder callback.

Also take advantage of `gtRetExprVal` to tunnel through GT_RET_EXPR chains
to the final tree to avoid needing multiple overwrites when propagating
the return expression back into the IR.

With these mitigations this change has minimal throughput impact.
A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018
If the jit decides it needs a return spill temp, and the return value
has already been spilled to a single-def temp, re-use the existing
for the return temp rather than creating a new one.

In conjunction with dotnet#20553 this allows late devirtualization for calls where
the object in the virtual call is the result of an inline that provides
a better type, and the objected formerly reached the call via one or more
intermediate temps.

Closes dotnet#15873.
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Change late devirtualization to run in a postorder callback during
the tree search to update GT_RET_EXPRs, instead of shoehorning it into
the preorder callback.

This allows the jit to reconsider all calls for late devirtualization,
not just calls that are parents of particular GT_RET_EXPRs. The jit will
take advantage of this in subsequent work that does more aggressive
bottup-up type sharpening.

Reconsidering all calls for late devirt instead of just a select subset
incurs around a 0.5% throughput impact.

To mitigate this, short-circult the tree walk when we no longer see a
GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR
free subtrees.

To make this work we have to make sure all node types propagate flags from
their children. This required some updates to a few node constructors.

There is also an odd quirk in the preorder callback where it may overwrite
the parent node (from GT_ASG to GT_BLK or GT_NOP). If the overwrite creates
a GT_NOP, the postorder callback may see a null tree. Tolerate this for now
by checking for that case in the postorder callback.

Also take advantage of `gtRetExprVal` to tunnel through GT_RET_EXPR chains
to the final tree to avoid needing multiple overwrites when propagating
the return expression back into the IR.

With these mitigations this change has minimal throughput impact.




Commit migrated from dotnet/coreclr@dfd0f4f
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
If the jit decides it needs a return spill temp, and the return value
has already been spilled to a single-def temp, re-use the existing
for the return temp rather than creating a new one.

In conjunction with dotnet/coreclr#20553 this allows late devirtualization for calls where
the object in the virtual call is the result of an inline that provides
a better type, and the objected formerly reached the call via one or more
intermediate temps.

Closes dotnet/coreclr#15873.

Commit migrated from dotnet/coreclr@ccc18a6
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants