-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: Add runtime async transformation #114861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
// true if this is an Await that we can optimize | ||
// | ||
bool Compiler::impMatchAwaitPattern(const BYTE* codeAddr, const BYTE* codeEndp, int* configVal) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we detect the pattern where one Async method awaits another and turn it into async call.
T result = AsyncHelpers.Await<T>(AsyncMehtod());
becomes
// call Async variant, skip wrapping/unwrapping the result through `Task<T>`
T result = async_call RtAsyncMehtod();
(added a comment to reference from another PR)
cc @dotnet/jit-contrib No diffs beyond some minor TP diffs. All JIT-EE changes are stubbed out in this PR to validate no diffs. The implementations will come with #114675, but if this PR merges before that one I will submit a follow-up that updates the JIT-EE interface and moves the stubs to the EE side. There is a high-level overview of the transformation written above and I have tried hard to break the transformation itself into smaller pieces that are hopefully easier to understand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good, I didn't see anything that looked like it needed addressing before merging.
For OSR methods you might make it clearer that the IL offset is the offset that inspires the OSR method, not the IL offset of any suspension point; I was confused by this for a while.
You mentioned somewhere that you needed to disable some or all of escape analysis to prevent creation of byrefs...? I didn't see that here.
@@ -1358,6 +1363,11 @@ int LinearScan::BuildCall(GenTreeCall* call) | |||
buildInternalRegisterUses(); | |||
|
|||
// Now generate defs and kills. | |||
if (call->IsAsync() && compiler->compIsAsync() && !call->IsFastTailCall()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something written up about async tall calls work (or don't)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing explicit, but the support mostly just falls out with the non-standard calling convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume only Async->Async kind of call can be tail-called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is an added check for that somewhere in morph.cpp.
return; | ||
} | ||
|
||
for (BasicBlock* block : Blocks()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are killing off all CSEs at each async call why do we need this per-block scanning? Shouldn't the right availability just fall out from the async call kills?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the two computations are both needed:
This part is computing the correct CseGen/CseOut set for each basic block. A block with an async call may generate a byref CSE, but only if that byref CSE is after the async call in execution order. With each of those sets we compute the correct in/out CSE sets for every basic block with a data flow afterwards.
Later we also iterate the IR and want to track all available CSEs at all points in the IR. For that we start with with the CseIn sets computed by this data flow at the beginning of each basic block, and then we track kills/defs manually while going through the IR. That part once again needs to have the correct killing/generation logic (with a higher level of detail).
src/coreclr/jit/lower.cpp
Outdated
|
||
GenTree* next = asyncCont->gtNext; | ||
|
||
// When the ASYNC_CONTINUATION was created as a result of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this is doing a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Async resumption stubs are generated in the same way as instantiating stubs by the VM: they are normal IL stubs that use a calli
instruction to call the target, and the target has a non-standard calling convention. For runtime async functions the target not only has an extra parameter, it also has an extra GC return value.
The extra GC return can be accessed by the IL code with the AsyncHelpers.AsyncCallContinuation()
intrinsic mentioned here. It's used by the async infrastructure to know whether the resumed function suspended again, or whether it finished running and its result needs to be propagated.
The problem without the logic here is that the JIT does not know that the function has the additional GC return value, so it does not set up GC tracking properly. The backend knows about the extra return value based on GTF_CALL_M_ASYNC
set on the call node. So the logic here uses the presence of AsyncHelpers.AsyncCallContinuation
to make the determination.
This is quite hacky, once again boiling down to our representational difficulties around nodes defining multiple values. My long term ambition in this area is to unify it with the representation that physical promotion eventually will use to define structs returned by calls in multiple registers, and then replace the intrinsic by an intrinsic type instead. So that for instance the IL code would be represented as something like:
ValueWithReturnedContinuation<T> result = calli();
Continuation continuation = result.Continuation;
T value = result.Value;
instead of the current
T result = calli();
Continuation continuation = AsyncHelpers.AsyncCallContinuation();
The former should be less fragile than the latter (which we can accidentally split up or handled incorrectly), but today will spill to stack in a number of cases.
I will expand on the comment.
src/coreclr/jit/block.h
Outdated
@@ -480,15 +481,15 @@ enum BasicBlockFlags : uint64_t | |||
// For example, the top block might or might not have BBF_GC_SAFE_POINT, | |||
// but we assume it does not have BBF_GC_SAFE_POINT any more. | |||
|
|||
BBF_SPLIT_LOST = BBF_GC_SAFE_POINT | BBF_NEEDS_GCPOLL | BBF_HAS_JMP | BBF_KEEP_BBJ_ALWAYS | BBF_CLONED_FINALLY_END | BBF_RECURSIVE_TAILCALL, | |||
BBF_SPLIT_LOST = BBF_GC_SAFE_POINT | BBF_NEEDS_GCPOLL | BBF_HAS_JMP | BBF_KEEP_BBJ_ALWAYS | BBF_CLONED_FINALLY_END | BBF_RECURSIVE_TAILCALL | BBF_ASYNC_RESUMPTION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit odd to have BBF_ASYNC_RESUMPTION
in split/lost like this? Is there some convention you're following where a block must have just one async call and it's at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this seems wrong... This should only be in the split gained flags (the flag is an overapproximation, just to silence the "jump into try region" assert).
} | ||
else | ||
{ | ||
inf.DataSize = layout->GetSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for mixed gc/non-gc structs we'll still have space for the gc parts (presumably zeroed) in the nongc-data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. That's probably something we could try optimizing, but likely in the long term we'll generate a data type that stores structs in their natural format anyway.
} | ||
else | ||
{ | ||
inf.Alignment = m_comp->info.compCompHnd->getClassAlignmentRequirement(layout->GetClassHandle()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does alignment actually matter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for arm32 doubles/floats, probably. We could mark the loads in the restore path as unaligned instead.
JITDUMP(" Creating resumption " FMT_BB " for state %u\n", resumeBB->bbNum, stateNum); | ||
|
||
unsigned resumeByteArrLclNum = BAD_VAR_NUM; | ||
if (layout.DataSize > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave a note that we need to restore non gc data first and gc data afterwards, so the gc data doesnt' get overwritten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add
It's disabled here: |
@@ -686,7 +686,7 @@ bool LinearScan::isContainableMemoryOp(GenTree* node) | |||
// mask - the mask (set) of registers. | |||
// currentLoc - the location at which they should be added | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Comment about returning the RefTypeKill refposition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add that as part of a follow-up.
@@ -594,6 +594,8 @@ OPT_CONFIG_INTEGER(JitDoIfConversion, "JitDoIfConversion", 1) | |||
OPT_CONFIG_INTEGER(JitDoOptimizeMaskConversions, "JitDoOptimizeMaskConversions", 1) // Perform optimization of mask | |||
// conversions | |||
|
|||
RELEASE_CONFIG_INTEGER(JitOptimizeAwait, "JitOptimizeAwait", 1) // Perform optimization of Await intrinsics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming this will be OFF by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ON by default. This is what allows to bypass materialization of Task
when one Async method awaits another. In production it is supposed to be always on.
The optimization is in theory optional though, and disabling it has revealed bugs in the past. It can also be useful for measuring the performance impact while developing, or narrowing issues specific to this optimization.
The plan is to eventually remove the knob or make it Checked.
Tracked in dotnet/runtimelab#3012
(need to port the issue to the main repo once code moves, applies to other active/tracking Async issues in runtimelab as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If potential regressing of existing scenarios is a concern - this knob/optimization will only have effect inside Async methods. If no Async methods, then no effect, so it is safe to leave on ON.
if (compIsAsync() && JitConfig.JitOptimizeAwait())
{
isAwait = impMatchAwaitPattern(codeAddr, codeEndp, &configVal);
}
. . .
I went through lsra, codegen and emitter changes and they look good to me. |
Add the JIT parts of runtime async. This introduces a new transformation that runs before lowering and that transforms the function into a state machine, with states set up to allow suspension and resumption at async calls (the suspension points).
Suspension is indicated by the callee returning a continuation object using a new calling convention. When suspension happens, the caller similarly suspends by capturing all live state and saving it in a continuation object. These continuation objects are then linked together and continue to be propagated back to the callers until we finally get to a caller that is not async anymore (i.e. that expects to see a
Task
/Task<T>
). The VM synthesizes aTask
/Task<T>
wrapper for the async functions that hide away the management the continuation objects. See #114675 for more details around this and around the calling convention.The continuation objects can later be used to resume the function at the point where suspension happened. This is accomplished by async functions also taking a continuation object as a parameter. When such a parameter is passed (i.e. it is non-null), the JIT will restore all live state from the continuation and resume from the correct location. The continuations store a state number so that resumption can know where to resume.
The full resumption process also involves an IL stub called the async resumption stub. This stub is responsible for calling the async function with the right stack frame setup for arguments and simultaneously passing a non-null continuation. The stack frame setup done by the IL async resumption stub is important as the JIT uses this space to restore live parameters from the continuation.
Continuations similarly support propagation of the return values from the callee and of potential exceptions thrown by the callee. Return values are stored in a known location in the continuation object, and the async resumption stubs are responsible for propagating these values into the next continuation when suspension/resumption has happened. The JIT's resumption code will fetch the return value from the known location and copy it to the right place in the caller. Similarly, exceptions are kept in a known place and are handled by being rethrown from the right location when present.
OSR functions come with complications. Since they rely on frame setup done by the corresponding tier-0 method the resumption of these methods needs to happen through the tier 0 method. When OSR methods suspend they store their IL offset in the continuation object, while tier 0 methods with patchpoints will store -1 in the continuation object. The VM then always resumes these methods in the tier 0 method, which knows to use the IL offset to determine whether resumption should happen in the tier 0 method or whether control needs to continue into an OSR method.