-
Notifications
You must be signed in to change notification settings - Fork 2.6k
JIT: refactor how we do late devirtualization #20553
JIT: refactor how we do late devirtualization #20553
Conversation
Change late devirtualization to run in a postorder callback during the tree search to update GT_RET_EXPRs, instead of shoehorning it into the preorder callback. This allows the jit to reconsider all calls for late devirtualization, not just calls that are parents of particular GT_RET_EXPRs. The jit will take advantage of this in subsequent work that does more aggressive bottup-up type sharpening. Reconsidering all calls for late devirt instead of just a select subset incurs around a 0.5% throughput impact. To mitigate this, short-circult the tree walk when we no longer see a GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR free subtrees. To make this work we have to make sure all node types propagate flags from their children. This required some updates to a few node constructors. Also take advantage of `gtRetExprVal` to tunnel through GT_RET_EXPR chains to the final tree to avoid needing multiple overwrites when propagating the return expression back into the IR. With these mitigations this change has minimal throughput impact.
@erozen PTAL No diffs |
Note that using |
Looks like there's an issue with SysV conventions. Investigating... |
// replacement may have enabled optimizations by providing more | ||
// specific types for trees or variables. | ||
fgWalkTree(&stmt->gtStmtExpr, fgUpdateInlineReturnExpressionPlaceHolder, fgLateDevirtualization, | ||
(void*)this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment.
Ubuntu arm failure here is one we've seen a lot lately:
Other failures look like issues with my change. Perhaps another missing flag case. |
Issue seems to be that the preorder callback mutates the parent node, which screws up invocation of the postorder callback.
So need to figure out if / how the walker can recover, or how to defer this update. |
We have (tree = 21, parent = 24)
and the ret expr is:
and the new parent tree gets detected as a self-assignment and 24 is updated in place with
so the postorder callback is invoked with If it turned out not to be a self-assignment there would be a GT_BLK here instead, and it appears the preorder and postorder would end up visiting all the right operands the postorder would still end up visiting both operands (since the parent was a GT_ASG and so also a binop). So likely things would "work" if we just checked for nullptr trees in the postorder callback, but relying on this seems awfully dodgy. Am going to look at deferring this update to the postorder traversal where it can be done cleanly. |
@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test |
Or maybe those clauses for returning structs by value aren't needed anymore....? |
Hmm. Looks like the GT_ASG to GT_BLK swap is tolerated and even somewhat expected by the current code. And we have access to the parent in the postorder so can just see if it got changed to a GT_NOP. So will go with that for now. |
3 remaining failures are seemingly unrelated
|
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test |
@dotnet/jit-contrib think this is ready. Just to sum up the above series of notes: the current It does not look that easy to move this logic to the postorder callback where it would be a more natural fit: we'd have to leave ourselves a "todo" flag or something similar. So for now I'm detecting the GT_NOP bashing in the postorder callback and bailing out. I am hoping that progress on first class structs will remove some of the context-sensitivity from struct typed nodes and allow us to do this sort of update without having to also do edits to parents. So I believe the current workaround is tolerable. |
windows arm64 failure. Will investigate as it could be related to the change.
|
I don't see any diffs for this test (either main assembly or helper dll) using the arm64 altjit. So am going to retry... @dotnet-bot retest Windows_NT arm64 Cross Checked Innerloop Build and Test |
Same test failed again and is failing for other PRs. It was just introduced yesterday. Tracking via #20580. |
@dotnet/jit-contrib ping .. ? I have some follow on changes that need this. |
If the jit decides it needs a return spill temp, and the return value has already been spilled to a single-def temp, re-use the existing for the return temp rather than creating a new one. In conjunction with dotnet#20553 this allows late devirtualization for calls where the object in the virtual call is the result of an inline that provides a better type. In particular we see this pattern for `ArrayPool<T>.Shared.Rent/Release`. Closes dotnet#15873.
If the jit decides it needs a return spill temp, and the return value has already been spilled to a single-def temp, re-use the existing for the return temp rather than creating a new one. In conjunction with dotnet#20553 this allows late devirtualization for calls where the object in the virtual call is the result of an inline that provides a better type, and the objected formerly reached the call via one or more intermediate temps. Closes dotnet#15873.
If the jit decides it needs a return spill temp, and the return value has already been spilled to a single-def temp, re-use the existing for the return temp rather than creating a new one. In conjunction with #20553 this allows late devirtualization for calls where the object in the virtual call is the result of an inline that provides a better type, and the objected formerly reached the call via one or more intermediate temps. Closes #15873.
Change late devirtualization to run in a postorder callback during the tree search to update GT_RET_EXPRs, instead of shoehorning it into the preorder callback. This allows the jit to reconsider all calls for late devirtualization, not just calls that are parents of particular GT_RET_EXPRs. The jit will take advantage of this in subsequent work that does more aggressive bottup-up type sharpening. Reconsidering all calls for late devirt instead of just a select subset incurs around a 0.5% throughput impact. To mitigate this, short-circult the tree walk when we no longer see a GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR free subtrees. To make this work we have to make sure all node types propagate flags from their children. This required some updates to a few node constructors. There is also an odd quirk in the preorder callback where it may overwrite the parent node (from GT_ASG to GT_BLK or GT_NOP). If the overwrite creates a GT_NOP, the postorder callback may see a null tree. Tolerate this for now by checking for that case in the postorder callback. Also take advantage of `gtRetExprVal` to tunnel through GT_RET_EXPR chains to the final tree to avoid needing multiple overwrites when propagating the return expression back into the IR. With these mitigations this change has minimal throughput impact.
If the jit decides it needs a return spill temp, and the return value has already been spilled to a single-def temp, re-use the existing for the return temp rather than creating a new one. In conjunction with dotnet#20553 this allows late devirtualization for calls where the object in the virtual call is the result of an inline that provides a better type, and the objected formerly reached the call via one or more intermediate temps. Closes dotnet#15873.
Change late devirtualization to run in a postorder callback during the tree search to update GT_RET_EXPRs, instead of shoehorning it into the preorder callback. This allows the jit to reconsider all calls for late devirtualization, not just calls that are parents of particular GT_RET_EXPRs. The jit will take advantage of this in subsequent work that does more aggressive bottup-up type sharpening. Reconsidering all calls for late devirt instead of just a select subset incurs around a 0.5% throughput impact. To mitigate this, short-circult the tree walk when we no longer see a GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR free subtrees. To make this work we have to make sure all node types propagate flags from their children. This required some updates to a few node constructors. There is also an odd quirk in the preorder callback where it may overwrite the parent node (from GT_ASG to GT_BLK or GT_NOP). If the overwrite creates a GT_NOP, the postorder callback may see a null tree. Tolerate this for now by checking for that case in the postorder callback. Also take advantage of `gtRetExprVal` to tunnel through GT_RET_EXPR chains to the final tree to avoid needing multiple overwrites when propagating the return expression back into the IR. With these mitigations this change has minimal throughput impact. Commit migrated from dotnet/coreclr@dfd0f4f
If the jit decides it needs a return spill temp, and the return value has already been spilled to a single-def temp, re-use the existing for the return temp rather than creating a new one. In conjunction with dotnet/coreclr#20553 this allows late devirtualization for calls where the object in the virtual call is the result of an inline that provides a better type, and the objected formerly reached the call via one or more intermediate temps. Closes dotnet/coreclr#15873. Commit migrated from dotnet/coreclr@ccc18a6
Change late devirtualization to run in a postorder callback during
the tree search to update GT_RET_EXPRs, instead of shoehorning it into
the preorder callback.
This allows the jit to reconsider all calls for late devirtualization,
not just calls that are parents of particular GT_RET_EXPRs. The jit will
take advantage of this in subsequent work that does more aggressive
bottup-up type sharpening.
Reconsidering all calls for late devirt instead of just a select subset
incurs around a 0.5% throughput impact.
To mitigate this, short-circult the tree walk when we no longer see a
GTF_CALL flag. This prevents us from walking through CALL and RET_EXPR
free subtrees.
To make this work we have to make sure all node types propagate flags from
their children. This required some updates to a few node constructors.
Also take advantage of
gtRetExprVal
to tunnel through GT_RET_EXPR chainsto the final tree to avoid needing multiple overwrites when propagating
the return expression back into the IR.
With these mitigations this change has minimal throughput impact.