Description
Starting a new thread for this, because the previous threads have become long and confused by rollbacks and cherrypicks. Previous history here:
-
[bug] clang miscompiles coroutine awaiter, moving write across a critical section #56301: this issue was fixed for a reproducer I filed, but not for all programs/optimization levels (see below).
-
[bug] clang incorrectly uses coroutine frame for scratch space after suspending #65018: a regression caused by the first try at fixing [bug] clang miscompiles coroutine awaiter, moving write across a critical section #56301. It was fixed by reverting.
I've found another case where clang incorrectly writes to the coroutine frame after suspending, introducing a data race. To recap, clang must not write to the coroutine frame after suspending, because another thread may have already resumed/destroyed the coroutine by the time the write happens.
Here is the program, a slight variant of the one in #65018 (with a fix for the "caller always leaks" issue discussed in #65030):
#include <coroutine>
// A simple awaiter type with an await_suspend method that can't be
// inlined.
struct Awaiter {
const int& x;
bool await_ready() { return false; }
std::coroutine_handle<> await_suspend(const std::coroutine_handle<> h);
void await_resume() {}
};
struct MyTask {
// A lazy promise with an await_transform method that supports awaiting
// integer references using the Awaiter struct above.
struct promise_type {
MyTask get_return_object() {
return MyTask{
std::coroutine_handle<promise_type>::from_promise(*this),
};
}
std::suspend_always initial_suspend() { return {}; }
std::suspend_always final_suspend() noexcept { return {}; }
void unhandled_exception();
auto await_transform(const int& x) { return Awaiter{x}; }
};
std::coroutine_handle<> h;
};
// A global array of integers.
int g_array[32];
// A coroutine that awaits each integer in the global array.
MyTask FooBar() {
for (const int& x : g_array) {
co_await x;
}
}
Compiler Explorer isn't yet up to date with recent commits, but I built clang manually at b32aa72 and then ran ./bin/clang -std=c++20 -O0 -S -masm=intel foo.cc
. This gives me the following output:
; (In the preamble before modifying rdi)
;
; Stash the address 64 bytes into the coroutine frame in the local thread's
; stack.
mov rax, rdi
add rax, 64
mov qword ptr [rbp - 96], rax # 8-byte Spill
[...]
; Call await_suspend, then store its result on the local thread's stack.
call _ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE@PLT
.Ltmp16:
mov qword ptr [rbp - 200], rax # 8-byte Spill
jmp .LBB19_17
.LBB19_17:
; Reload `&coroutine_frame + 64` into rdi.
mov rdi, qword ptr [rbp - 96] # 8-byte Reload
; Reload the result of await_suspend into rax.
mov rax, qword ptr [rbp - 200] # 8-byte Reload
; Write the result of await_suspend to `&coroutine_frame + 64`.
; ------> This is the bug!
mov qword ptr [rdi], rax
; Call std::coroutine_handle::address, handing it a `this` pointer that points
; at the coroutine frame. Use the result as an indirect jump target.
call _ZNKSt7__n486116coroutine_handleIvE7addressEv
mov rdi, rax
mov rax, qword ptr [rax]
add rsp, 256
pop rbp
.cfi_def_cfa rsp, 8
jmp rax # TAILCALL