[Clang] Forward incoming Indirect parameters across musttail calls#199351
[Clang] Forward incoming Indirect parameters across musttail calls#199351xroche wants to merge 26 commits into
Conversation
When a C function passes a struct by value to a musttail callee, Clang's frontend implements "by value" with a local alloca + memcpy + pass-pointer pattern (the byval-temp). The alloca lives in the caller's frame, which the tail call deallocates before the callee dereferences the pointer. The callee reads freed stack memory. Reproduces on RV64, AArch64, ARM, LoongArch64, and SystemZ at every optimization level. This is the argument-side analog of the SRet forwarding fix in a96c14e ("[Clang] Always forward sret parameters to musttail calls", Kiran 2024-08-19). For musttail calls in the ABIArgInfo::Indirect case, when the call argument's source LValue resolves to a forwarded incoming Indirect parameter of the current function with a matching ABI shape, forward the incoming llvm::Argument directly instead of creating a byval-temp. Falls through to the existing byval-temp path for any other source. Three safety guards in the helper: - ABI-attribute match (Verifier V7). Refuse to forward when the incoming parameter and the call slot disagree on byval. - noalias deduplication. If the user writes `musttail callee(a, a)` with `a` a noalias Indirect parameter, do not forward the same Argument to both slots; pre-fix gave two distinct allocas and aliasing must not regress. - AddrSpaceCast peek-through. EmitParmDecl wraps incoming Indirect parameters in addrspacecast on NVPTX/AMDGPU/SPIR. Peek through one cast; do NOT unwrap loads (a load through a local alloca means the source is a local and the fix must not engage). Scope: this PR fixes the C source case. C++ source for the same construct routes through CXXConstructExpr + EmitAnyExprToTemp which materializes an agg.tmp before EmitCall runs. The fix correctly falls through in that case (the source is the local alloca, not the Argument) but does not eliminate the dangle. A follow-up PR will plumb IsMustTail through EmitCallArg to cover the C++ case. Test: clang/test/CodeGen/musttail-indirect-arg.c covers plain forward, two-arg forward, swapped args, mixed direct+indirect, modify-then- forward, and negative cases (local source, computed copy, non- musttail). Runs on riscv64, aarch64, loongarch64, s390x. Fixes llvm#56908. Helps llvm#116568 llvm#157814 llvm#46402 llvm#190429 llvm#56435 llvm#72555. Complement of the backend fix in llvm#185094 (RISC-V) and 0be65ba (LoongArch). Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reduce comment volume on getForwardableIncomingMustTailArg and the call site to match the SRet precedent (a96c14e). Same code, shorter comments. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Restrict the helper to ABIArgInfo::Indirect (not IndirectAliased).
CallSlotInfo.getIndirectByVal() asserts on IndirectAliased; falling
through to the byval-temp path is the safe behavior for that case.
- Tighten CHECK patterns: anchor the forwarded operand by IR name
(e.g. `@C1({{.*}} %a)`) and exclude `alloca [32 x i8]` in addition
to `alloca %struct.Big` so a future ABI change that picks a
different temp type cannot mask a regression.
- Add P5 cross-BB musttail (musttail behind a branch) and P7 same-
Argument-to-two-slots (pins the noalias dedup behavior under the
Linux C ABI where incoming Indirect params are not noalias).
- Reword P6 (caller modifies the parameter then musttails): this is
a positive case, not a negative one. The fix correctly engages and
eliminates the byval-temp.
No functional change to the fix shape; reviewer-flagged hardening.
Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@llvm/pr-subscribers-clang @llvm/pr-subscribers-clang-codegen Author: Xavier Roche (xroche) ChangesWhen a C function passes a struct by value to a This is the argument-side analog of the SRet musttail fix in Three guards: ScopeC source only. C++ for the same construct lowers through Tests
Local sanity check on the minimal C reproducer at Fixes #56908. Partially addresses #46402 (RV64/ARM64 manifestation; X86 was addressed by #190540). Related: #56435, #72555, #84090, #116568, #157814, #190429. Backend complement of #185094 (RISC-V) and Assisted-by: Claude (Anthropic) Full diff: https://github.com/llvm/llvm-project/pull/199351.diff 2 Files Affected:
diff --git a/clang/lib/CodeGen/CGCall.cpp b/clang/lib/CodeGen/CGCall.cpp
index 40cc275d40273..ad402c459b892 100644
--- a/clang/lib/CodeGen/CGCall.cpp
+++ b/clang/lib/CodeGen/CGCall.cpp
@@ -5448,6 +5448,45 @@ static unsigned getMaxVectorWidth(const llvm::Type *Ty) {
return MaxVectorWidth;
}
+/// Returns the incoming llvm::Argument of the current function if this
+/// musttail call argument forwards an incoming Indirect parameter with a
+/// matching ABI shape; nullptr to fall through to the byval-temp path.
+/// Argument-side analog of a96c14eeb8fc (SRet musttail forwarding).
+static llvm::Argument *getForwardableIncomingMustTailArg(
+ CodeGenFunction &CGF, const CallArg &CallArgument,
+ const ABIArgInfo &CallSlotInfo,
+ llvm::SmallPtrSetImpl<llvm::Argument *> &AlreadyForwarded) {
+ Address SrcAddr = Address::invalid();
+ if (CallArgument.hasLValue())
+ SrcAddr = CallArgument.getKnownLValue().getAddress();
+ else if (CallArgument.getKnownRValue().isAggregate())
+ SrcAddr = CallArgument.getKnownRValue().getAggregateAddress();
+ else
+ return nullptr;
+ llvm::Value *SrcPtr = SrcAddr.emitRawPointer(CGF);
+
+ // Peek through one AddrSpaceCastInst (NVPTX / AMDGPU / SPIR wrap incoming
+ // Indirect params via EmitParmDecl). Do not unwrap loads: a load through
+ // a local alloca means the source is a local.
+ if (auto *ASC = llvm::dyn_cast<llvm::AddrSpaceCastInst>(SrcPtr))
+ SrcPtr = ASC->getOperand(0);
+
+ auto *IncomingArg = llvm::dyn_cast<llvm::Argument>(SrcPtr);
+ if (!IncomingArg || IncomingArg->getParent() != CGF.CurFn)
+ return nullptr;
+
+ // Verifier V7 requires matching ABI attributes across musttail.
+ if (IncomingArg->hasByValAttr() != CallSlotInfo.getIndirectByVal())
+ return nullptr;
+
+ // Do not forward the same noalias Argument to two slots in one call.
+ if (IncomingArg->hasNoAliasAttr() &&
+ !AlreadyForwarded.insert(IncomingArg).second)
+ return nullptr;
+
+ return IncomingArg;
+}
+
RValue CodeGenFunction::EmitCall(const CGFunctionInfo &CallInfo,
const CGCallee &Callee,
ReturnValueSlot ReturnValue,
@@ -5571,6 +5610,10 @@ RValue CodeGenFunction::EmitCall(const CGFunctionInfo &CallInfo,
// markers that need to be ended right after the call.
SmallVector<CallLifetimeEnd, 2> CallLifetimeEndAfterCall;
+ // Tracks incoming Arguments already forwarded by a musttail Indirect arg,
+ // for noalias deduplication in getForwardableIncomingMustTailArg.
+ llvm::SmallPtrSet<llvm::Argument *, 4> ForwardedMustTailArgs;
+
// Translate all of the arguments as necessary to match the IR lowering.
assert(CallInfo.arg_size() == CallArgs.size() &&
"Mismatch between function signature & arguments.");
@@ -5643,6 +5686,23 @@ RValue CodeGenFunction::EmitCall(const CGFunctionInfo &CallInfo,
case ABIArgInfo::Indirect:
case ABIArgInfo::IndirectAliased: {
assert(NumIRArgs == 1);
+
+ // For musttail, forward an incoming Indirect parameter directly. A
+ // local alloca would dangle after the tail call. Mirrors the SRet
+ // forwarding above (a96c14eeb8fc). Limited to Indirect (not
+ // IndirectAliased) so the byval-ness check in the helper does not
+ // assert on getIndirectByVal().
+ if (IsMustTail && ArgInfo.isIndirect()) {
+ if (llvm::Argument *FwdArg = getForwardableIncomingMustTailArg(
+ *this, *I, ArgInfo, ForwardedMustTailArgs)) {
+ llvm::Value *Val = FwdArg;
+ if (ArgHasMaybeUndefAttr)
+ Val = Builder.CreateFreeze(Val);
+ IRCallArgs[FirstIRArg] = Val;
+ break;
+ }
+ }
+
if (I->isAggregate()) {
// We want to avoid creating an unnecessary temporary+copy here;
// however, we need one in three cases:
diff --git a/clang/test/CodeGen/musttail-indirect-arg.c b/clang/test/CodeGen/musttail-indirect-arg.c
new file mode 100644
index 0000000000000..0e7493f9e38b7
--- /dev/null
+++ b/clang/test/CodeGen/musttail-indirect-arg.c
@@ -0,0 +1,118 @@
+// Test that Clang forwards incoming Indirect parameters across musttail calls
+// instead of creating a byval-temp alloca that would dangle after the tail call
+// deallocates the caller's frame.
+//
+// Companion to musttail-sret.cpp (commit a96c14eeb8fc): same idea, applied to
+// incoming arguments rather than the sret return slot.
+
+// RUN: %clang_cc1 -triple=riscv64-linux-gnu %s -emit-llvm -O1 -o - | FileCheck %s --check-prefix=COMMON
+// RUN: %clang_cc1 -triple=aarch64-linux-gnu %s -emit-llvm -O1 -o - | FileCheck %s --check-prefix=COMMON
+// RUN: %clang_cc1 -triple=loongarch64-linux-gnu %s -emit-llvm -O1 -o - | FileCheck %s --check-prefix=COMMON
+// RUN: %clang_cc1 -triple=s390x-linux-gnu %s -emit-llvm -O1 -o - | FileCheck %s --check-prefix=COMMON
+
+// A struct large enough to land on the indirect-arg path on RV64 (>2*XLEN=16
+// bytes), AArch64 (>16 bytes), LoongArch64, SystemZ.
+struct Big {
+ unsigned long long a, b, c, d;
+};
+
+// Plain forward: caller(B) musttails callee(B). The fix should emit no
+// alloca for the forwarded arg; the call should forward the incoming
+// parameter %a.
+struct Big C1(struct Big a);
+struct Big P1(struct Big a) {
+ __attribute__((musttail)) return C1(a);
+}
+// COMMON-LABEL: define {{.*}} @P1(
+// COMMON-NOT: = alloca {{.*}}struct.Big
+// COMMON-NOT: = alloca [32 x i8]
+// COMMON: musttail call {{.*}} @C1({{.*}} %a)
+
+// Two indirect args, same forwarding: each forwards its own incoming param.
+struct Big C2(struct Big a, struct Big b);
+struct Big P2(struct Big a, struct Big b) {
+ __attribute__((musttail)) return C2(a, b);
+}
+// COMMON-LABEL: define {{.*}} @P2(
+// COMMON-NOT: = alloca {{.*}}struct.Big
+// COMMON: musttail call {{.*}} @C2({{.*}} %a, {{.*}} %b)
+
+// Swapped args: caller(a, b) musttails callee(b, a). Each forwarded slot
+// must resolve to the correct incoming Argument, not by position.
+struct Big C3(struct Big x, struct Big y);
+struct Big P3(struct Big a, struct Big b) {
+ __attribute__((musttail)) return C3(b, a);
+}
+// COMMON-LABEL: define {{.*}} @P3(
+// COMMON-NOT: = alloca {{.*}}struct.Big
+// COMMON: musttail call {{.*}} @C3({{.*}} %b, {{.*}} %a)
+
+// Mixed direct + indirect: only the indirect arg is affected by the fix.
+struct Big C4(int n, struct Big a);
+struct Big P4(int n, struct Big a) {
+ __attribute__((musttail)) return C4(n, a);
+}
+// COMMON-LABEL: define {{.*}} @P4(
+// COMMON-NOT: = alloca {{.*}}struct.Big
+// COMMON: musttail call {{.*}} @C4({{.*}} %n, {{.*}} %a)
+
+// Caller modifies the parameter before the musttail. Clang lowers the
+// write through the incoming pointer, and the fix forwards the same
+// pointer to the callee. No byval-temp.
+struct Big C5(struct Big a);
+struct Big P5(struct Big a) {
+ a.a += 1;
+ __attribute__((musttail)) return C5(a);
+}
+// COMMON-LABEL: define {{.*}} @P5(
+// COMMON-NOT: = alloca {{.*}}struct.Big
+// COMMON: musttail call {{.*}} @C5({{.*}} %a)
+
+// musttail behind a branch: the forwarded pointer must remain live across
+// the basic block transition. Tests that the helper does not assume the
+// musttail is in the entry block.
+struct Big C6(struct Big a, int cond);
+struct Big P6(struct Big a, int cond) {
+ if (cond)
+ __attribute__((musttail)) return C6(a, cond);
+ return a;
+}
+// COMMON-LABEL: define {{.*}} @P6(
+// COMMON-NOT: = alloca {{.*}}struct.Big
+// COMMON: musttail call {{.*}} @C6({{.*}} %a,
+
+// Same Argument forwarded to two slots: the helper engages for both. The
+// noalias deduplication, if it ever fired, would force the second slot
+// back to a byval-temp; but incoming Indirect params under the Linux C
+// ABI are not noalias, so both slots forward %a directly. This pins the
+// behavior so a future change introducing noalias on Indirect params
+// would surface here. (musttail requires matching prototypes, so caller
+// and callee both take two Big args.)
+struct Big C7(struct Big x, struct Big y);
+struct Big P7(struct Big a, struct Big b) {
+ __attribute__((musttail)) return C7(a, a);
+}
+// COMMON-LABEL: define {{.*}} @P7(
+// COMMON-NOT: = alloca {{.*}}struct.Big
+// COMMON: musttail call {{.*}} @C7({{.*}} %a, {{.*}} %a)
+
+// Negative: local source. Caller takes Big a, but musttails with a LOCAL
+// Big initialized in caller's frame. The byval-temp must remain because the
+// source lives in caller's frame and would dangle if forwarded.
+struct Big C8(struct Big a);
+struct Big P8(struct Big a) {
+ struct Big local = {1, 2, 3, 4};
+ __attribute__((musttail)) return C8(local);
+}
+// COMMON-LABEL: define {{.*}} @P8(
+// COMMON: = alloca
+// COMMON: musttail call {{.*}} @C8(
+
+// Non-musttail tail call: the fix must NOT engage. Existing path emits
+// the byval-temp as before, no musttail in the IR.
+struct Big C9(struct Big a);
+struct Big P9(struct Big a) {
+ return C9(a);
+}
+// COMMON-LABEL: define {{.*}} @P9(
+// COMMON-NOT: musttail
|
|
@kiran-isaac, @efriedma-quic: this is the arg-side equivalent of your SRet musttail fix (a96c14e / #104795). Same shape in EmitCall, applied to ABIArgInfo::Indirect call args. Tests modeled on musttail-sret.cpp. Reviewers welcome! |
For C++ struct-by-value arguments, the call argument is a CXXConstructExpr invoking the implicit copy constructor, not the LValueToRValue cast that EmitCallArg uses for C. The trivial copy materializes an agg.tmp before EmitCall runs, so the helper in EmitCall (getForwardableIncomingMustTailArg) sees a local alloca and falls through. Detect the trivial-copy CXXConstructExpr whose source is a ParmVarDecl of the current function inside a musttail call (signaled by CodeGenFunction::MustTailCall being non-null) and take the addUncopiedAggregate path so the LValue of the original parameter reaches EmitCall. The forwarding helper then forwards the incoming Argument, no agg.tmp. Limited to trivially-copyable types so we never elide observable copy constructors or skip destruction. Non-trivial copy ctors / dtors still take the existing materialize-then-copy path. Test: clang/test/CodeGen/musttail-indirect-arg.cpp (companion to the C test) covers plain forward, two-arg, swapped, and verifies the trivial-copy elision does NOT engage for non-trivial copy ctors or non-musttail tail calls. clang/test/CodeGen + CodeGenCXX (7307 tests) clean. Runtime cross-arch sweep on a C++ minimal repro passes on x86_64, riscv64, aarch64, arm, loongarch64, s390x; the same repro fails on RV64/AArch64 without this commit, confirming the new path catches the C++ bug shape. Fixes the C++ scope hole called out in the prior commit (11cdbbf). Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the C++ analogs of P5/P6/P7/P8 from the C test that the C++ test was missing per code review: - P5: modify-then-musttail (trivial-copy elision still engages). - P6: cross-BB musttail (elision works across basic blocks). - P7: same Argument forwarded to two slots (pins noalias dedup behavior; incoming Indirect params are not noalias under Linux C++). - P8: source is a non-parameter local (elision must NOT engage, byval-temp pattern remains). Existing non-musttail test renumbered P5 -> P9. Closes the test-coverage gap flagged by the second review on commit 4359083. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review on commit 9b1aa8b (efriedma-quic): the prior code optimized only the "source is incoming Indirect param" case, and test P7 asserted a miscompile (forwarding the same incoming pointer to two call slots aliases two by-value parameters, which violates the C ABI rule of distinct storage per slot). For each musttail Indirect slot i, the i-th incoming Indirect parameter's pointer (CurFn->arg_begin()+FirstIRArg, distinct and well-defined by musttail prototype-match) is the destination. If the source LValue already equals that pointer, pass it. Otherwise the value is copied into the incoming-param storage and that pointer is passed. A single-phase implementation is unsound for permutations like C(b, a) where slot-by-slot in-place writes clobber sources later slots need. Two phases: - Phase 1, in the per-arg loop: when source != destination, capture the source value into a scratch alloca in this frame. - Phase 2, after the per-arg loop: write each scratch into its matching incoming-param destination. Scratches live in this frame and die at tail-call teardown; they are consumed before the tail jump so this is safe. Destinations live in the caller's caller's frame and survive the tail call. EmitCallArg broadens the C++ trivial-copy gate from "ParmVarDecl of this function" to "param or local of this function" so the source LValue reaches EmitCall in both cases. Tests rewritten in spec-driven form: forward, distinct, swap, modify-then-forward, cross-BB, same-arg-twice/thrice, local source, mixed direct+indirect, many-args with stack spill, over-aligned struct, C++ member function, and non-trivial copy ctor as a negative. Local validation: check-clang 50997/52167. Runtime probe under QEMU on aarch64, riscv64, arm-linux-gnueabihf, and s390x passes; the same probe fails on aarch64 and riscv64 with pre-patch clang, exposing the alias and swap miscompiles. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
The previous P3 and P17 CHECK patterns were too permissive: they would pass for the older single-phase emit that wrote in place during the per-arg loop and would have miscompiled C(b, a) (both slots end up with orig_b) and C(a, a, a) (callee sees aliased addresses). Asserts the `%musttail.copy = alloca` scratch buffer in both tests. The two-phase emit allocates at least one scratch when any non-identity copy is needed; the single-phase emit allocates none. Verified the scratch survives at -O0 through -O3 on RV64/AArch64/LoongArch64/s390x. Runtime coverage of these cases lives in llvm-test-suite (llvm#410); the IR-level assertion catches the regression without depending on that project. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
… local sources Five C runtime tests under SingleSource/Regression/C/musttail/ that expose the by-value-parameter-storage class of bug for musttail calls passing struct args indirect. Each test fails on fork-clang on aarch64, riscv64, and s390x at -O2/-O3 (the test-suite's typical opt level), and most fail at -O0 as well; -O1 is generally hidden by optimizer ABI assumptions. - alias-distinct-addresses.c -- C(a, a). Write to x, read y via volatile. If aliased, the write changes y's contents. - triple-alias.c -- C(a, a, a). Same probe for two-of-three aliasing. - local-source.c -- C(local). Local computed from a volatile seed to defeat constant folding. Catches the byval-temp dangle. - mixed-sources.c -- C(local, a). Exercises both phases of the two-phase emit in one call. - swap-alias.c -- C(b, a, a). Combines reorder with aliasing; would catch a fix that handles each separately but mishandles their interaction. The probes use write-then-read-via-volatile so the optimizer cannot reuse prior register values or assume noalias by ABI. Each test returns a distinct rc per failure mode for triage. Companion to llvm/llvm-project#199351.
The previous fall-through under musttail + Indirect silently re-entered the byval-temp path when the source had no LValue and no aggregate RValue address (rare; e.g. a scalar RValue the ABI classifies as Indirect). That re-introduced the dangling-alloca bug this PR fixes. Refuse the case cleanly with an error. Codegen still completes so the diagnostic surfaces normally; the queued error prevents the broken IR from reaching the backend, matching the err_musttail_noexcept_mismatch pattern earlier in the function. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
The CHECK lines that pinned the scratch as 'alloca struct.Big' fail when SROA promotes the alloca into an SSA load/store of a vector type (observed in CI builds against current main). The scratch variable name %musttail.copy<N> survives in either shape, so match the name with any suffix and drop the per-slot memcpy/memmove asserts for P17 that SROA also collapses. The scratch presence is the load-bearing check; the one-pass-no-scratch regression still fails it. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ct-arg
Replace each per-operand `{{.*}}` with `{{[^,]*}}` so a wildcard cannot
span a comma and hide an aliased slot or a smuggled scratch pointer.
Pin the P3 swap and P17 triplicate data-flow with a FileCheck variable:
the value loaded from %a is the one stored into the destination slot, and
the memmove direction is fixed. This catches an in-place clobber the prior
"a scratch exists" check accepted. IR is identical across all four triples;
each new directive was confirmed to fail on the corresponding bug shape.
Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The musttail Indirect-arg path forwards the i-th incoming parameter pointer (CurFn->arg_begin()+FirstIRArg) as the call slot. That pointer is an SSA Argument, never poison, so the maybe_undef freeze on it was dead defensiveness. Pass it directly. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The musttail Indirect-arg fix forwarded a trivially-copyable record argument's source LValue to EmitCall instead of materializing an agg.tmp, but gated it on MustTailCall. The C path right above it (the CK_LValueToRValue case) already forwards unconditionally, so the gate was the only thing keeping the C++ trivial copy/move case from doing the same. Drop the MustTailCall and storage gates. A trivial copy/move constructor whose source is a variable now forwards that variable's LValue for any call; EmitCall makes the real copy at the Indirect/byval boundary, dropping the redundant agg.tmp. A same-type guard rejects a stripped derived-to-base cast that would otherwise slice at a wrong offset. The source is kept to a direct variable reference (the common case); broader lvalue sources stay on the existing path. Test updates reflect the elided agg.tmp/copy. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A musttail Indirect argument is forwarded through the matching incoming parameter, which needs an in-memory source. A wide _BitInt has scalar evaluation kind, so the argument is a scalar value with no source storage to forward, and the call is rejected. Document the diagnostic. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A CUDA device_builtin_surface_type/texture_type is trivially copyable, so the generalized forwarding would hand its global LValue to the call. On device these types are coerced to an i64 handle, and the global is lowered to that handle in EmitAggregateCopy via the NVPTX target hook. Forwarding bypasses that copy, so the direct-coerce read loads the raw global bytes instead of the handle, dropping the texsurf.handle.internal intrinsic. Exclude both types, mirroring the EmitAggregateCopy special case, so they keep the temp-materialization path. Also update CodeGenSYCL/kernel-caller-entry-point.cpp: the kernel-launch wrappers forward their trivially-copyable functor argument on the Linux host ABI, dropping the agg.tmp copy. The lambdas with a non-trivial destructor are not trivially copyable and keep the copy. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rding Under -fobjc-gc, a trivially-copyable struct with an object member needs the objc_memmove_collectable write barrier when copied by value, emitted in EmitAggregateCopy. Such a record is still isTriviallyCopyableType, so the generalized forwarding handed its source LValue to the call and the byval boundary passed it without a copy, dropping the barrier. Exclude records with an object member under GC, mirroring the EmitAggregateCopy special case alongside CUDA surface/texture. Regression test in CodeGenObjCXX/gc.mm: removing the guard drops the barrier on the forwarded path. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # clang/test/CodeGen/aapcs64-align.cpp
| !type->isCUDADeviceBuiltinSurfaceType() && | ||
| !type->isCUDADeviceBuiltinTextureType() && | ||
| !(getLangOpts().getGC() != LangOptions::NonGC && | ||
| type->getAsRecordDecl()->hasObjectMember())) { |
There was a problem hiding this comment.
I'm a little worried this is going to fall out of date... does this list just come from digging through CodeGenFunction::EmitAggregateCopy? Is there anywhere else that uses a similar check?
There was a problem hiding this comment.
Yes, I copied EmitAggregateCopy's special cases, and I found no existing shared check for "this copy is a plain memcpy". I will factorize this, the risks of drift is concerning.
A "cleaner" alternative would be to route the argument copy through EmitAggregateCopy itself so no list is needed, but might be more complex ?
| } | ||
| } | ||
|
|
||
| // C++ analog of the CK_LValueToRValue case above: a trivial copy/move |
There was a problem hiding this comment.
Maybe we should split this off into a followup patch, so it's easier to bisect if there are any issues.
There was a problem hiding this comment.
Humm yes. I could keep the musttail-gated forwarding here (the musttail fix needs it for C++ arguments) and move the "forward for all calls" generalization, with its test churn and the exclusion-list question, to a follow-up PR. WHat do you think ?
Reverts the generalization of the EmitCallArg trivial-copy LValue forwarding to all calls (cb7efcb) and the ObjC GC object-member exclusion it required (5be3d32), restoring the MustTailCall gate. Requested in review: the all-calls behavior change is easier to bisect as its own patch, and the exclusion-list question belongs with it. Under the musttail gate the GC exclusion is not load-bearing: the Indirect-path copies go through EmitAggregateCopy, which emits the write barrier, and a Direct-classified load never writes a collectable slot. The CUDA surface/texture exclusion (2a770e5) is kept, narrowed to device compilation: on NVPTX those types classify as Direct, so a forwarded lvalue would load raw record bytes instead of the handle that EmitAggregateCopy materializes. The generalization moves to a follow-up PR together with its test updates and trivial-copy-arg-forward.cpp. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The trivial-copy forwarding in EmitCallArg required the source to be a DeclRefExpr to a local or parameter, after stripping implicit casts; the stripping is what forced that narrow match, since it could drop a derived-to-base adjustment. Per review, match any same-type glvalue without stripping casts: member accesses, dereferences, and globals now forward too, and a derived-to-base source keeps its adjusted base subobject address. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per review, make clear the no-addressable-source rejection is an implementation limit that could be supported, not a fundamental one. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Forwarding defers the source byte read to the call boundary, after the other arguments have been evaluated. For a source whose address computation has side effects or reads mutable state (a dereference or a call), that splits the argument's evaluation around the other arguments', an interleaving C++17 [expr.call]/8 forbids. For a pure chain (a variable, dot-member access, derived-to-base adjustment) the deferral is equivalent to initializing that argument last, which is a legal ordering. Restrict the match to pure chains; impure sources fall back to the temp path, which reads at argument position and still routes through the incoming parameter without dangling. This also keeps the sanitizer checks the temp path emits for dereferenced sources. Assisted-by: Claude (Anthropic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a function passes a struct by value to a
musttailcallee, Clang implements "by value" with a localalloca + memcpy + pass-pointer(the byval-temp). That alloca lives in the caller's frame, which the tail call deallocates before the callee dereferences the pointer. Reproduces on RV64, AArch64, ARM, LoongArch64, and SystemZ at every optimization level.This is the argument-side analog of the SRet musttail fix in
a96c14eeb8fc. For each Indirect call sloti, route the value through the i-th incoming Indirect parameter's pointer (CurFn->arg_begin() + FirstIRArg), which musttail's prototype-match invariant makes distinct per slot and well-defined in the caller's caller's frame. Identity sources pass through directly; others copy into that storage first. The copy runs in two phases so permutations likeC(b, a)don't clobber sources that later slots still read: phase 1 captures each non-identity source into a scratch alloca, phase 2 writes the scratches into their destinations before the tail jump.In C++ the by-value argument is a trivial copy construction rather than an lvalue-to-rvalue cast, so under musttail EmitCallArg forwards any same-type glvalue source instead of materializing the temp. Arguments with no addressable source (a wide
_BitIntscalar, for example) are diagnosed instead of silently dangling; this is an implementation limit a later change could lift. Generalizing the forwarding beyond musttail was split out to a follow-up PR at reviewer request.Fixes #56908. Partially addresses #46402 (RV64/ARM64; X86 was addressed by #190540). Related: #56435, #72555, #84090, #116568, #157814, #190429.
Assisted-by: Claude (Anthropic)