[BREAKING][GPU] New QIPC ops for subgroups by hughperkins · Pull Request #676 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-10T09:11:52Z

Summary

Note: this is BREAKING because we rename the tiled reduce operations to have a _tiled suffix, and put back the reductions without hte log2_size parameter, wihout a suffix.

Adds a single, consistent set of new SIMT-subgroup query / inter-processor-communication ops to qd.simt.subgroup, all working portably across CUDA, AMDGPU, and SPIR-V (Vulkan / Metal):

Reductions (lane 0 / broadcast) — reduce_min, reduce_max (lane 0 of each 2**log2_size group), reduce_all_min, reduce_all_max (broadcast to every lane).
- Built on the existing shuffle_down tree / shuffle_xor butterfly patterns; same shape as reduce_add / reduce_all_add.
- NaN handling for floats is implementation-defined and is documented.
Ballot — subgroup.ballot(predicate) returns a u32 bitmask (bit i set iff lane i's predicate is non-zero).
- Single hardware instruction on every backend: __ballot_sync (CUDA), v_ballot_b32 (AMDGPU), OpGroupNonUniformBallot (SPIR-V).
- Cherry-picked from [Perf] Add cross-GPU subgroup.ballot(predicate) primitive #600 (which is now superseded by this PR).
Segmented reduce — subgroup.segmented_reduce_add / segmented_reduce_min / segmented_reduce_max (value, head_flag, log2_size) run a per-lane inclusive scan that resets at every non-zero head_flag, scoped to 2**log2_size consecutive lanes.
- One ballot to materialise the head bitmask, one clz to find each lane's segment head, then a Hillis-Steele inclusive scan bounded by distance >= offset. Cost: 1 ballot + 1 clz + log2_size shuffles + log2_size ops.
- No identity argument is required (unlike exclusive_min / exclusive_max): the per-lane distance >= offset guard ensures the scan never crosses a segment boundary, so a partner from another segment is never combined with the local value.
- The group's first lane is treated as an implicit head, so a segment that starts mid-group is still reduced correctly.
Lane masks — subgroup.lanemask_lt(lane_id) / _le / _eq / _gt / _ge: closed-form u32 masks parametrised by a lane id, mirroring CUDA's __lanemask_{lt,le,eq,gt,ge} but generalised to take any lane_id (pass invocation_id() for the CUDA built-in form).
- Pure @qd.func arithmetic — no backend intrinsic, no shuffle, no ballot — so per-lane-varying lane_id works the same as a uniform one.
- Caller contract: lane_id in [0, 31]; on AMDGPU CDNA wave64 the mask covers only the low 32 lanes (build a 64-bit mask from two u32 ballots if needed).

Drive-by fixes (required by `segmented_reduce_*`, but useful in their own right)

qd.clz is the first user of clz in the codebase, and exposed bugs on every backend:

CUDA — __nv_clz / __nv_clzll are declared on signed types but operate on the underlying bit pattern; route u32 / u64 through them so qd.clz(u32(...)) no longer hits QD_NOT_IMPLEMENTED.
AMDGPU — emit_extra_unary had no clz case; map to LLVM's Intrinsic::ctlz with is_zero_undef=0.
SPIR-V — split GLSL.std.450 FindSMsb ([Build] Add gc before each unit test, to prevent ndarray issues #74, signed) and FindUMsb ([Build] Reduce concurrency #75, unsigned). The unsigned form is required for u32 / u64 inputs whose top bit may be set; FindSMsb is undefined for those (treats them as negative; "most-significant 0-bit" doesn't exist for 0xFFFFFFFF). Cast the result back to i32 before the 32 - msb - 1 subtraction so SPIR-V's strict-type sub is happy.

Stacking

This PR is stacked on top of #665 (hp/cross-gpu-subgroup).

It supersedes #600 (hp/cross-gpu-ballot), whose three commits are cherry-picked here unchanged. #600 can be closed once this lands (or once #665 lands, whichever is convenient).

Test plan

Local pre-commit (black, clang-format, ruff, pylint) passes.
pyright (project config) clean for new code; pre-existing errors in untouched files unchanged.
CUDA (cluster, RTX PRO 6000 Blackwell): all 103 new lanemask + segmented_reduce_min/max tests pass; full test_simt.py (586 passed, 1 skipped) green.
Vulkan (same cluster node): all 103 new tests pass; full test_simt.py (567 passed, 20 skipped) green.
AMDGPU (amddesktop, Radeon RX 7900 XTX, gfx1100, ROCm): all 103 new tests pass; full test_subgroup_* (566 passed, 21 deselected) green.
Comment / docstring widths audited against the project's 120c target via find_underwrapped.py.

Made with Cursor

Aligns the subgroup scope with `block.sync()` and the planned `block.mem_fence()` / `grid.mem_fence()` naming. The old names remain as thin aliases that forward to the new ones and emit a DeprecationWarning on first use (per-alias one-shot guard, plus the existing `warnings.filterwarnings("once", DeprecationWarning, ...)` in `quadrants.lang.misc`). Updates `docs/source/user_guide/subgroup.md` to describe the renames as done (with deprecation aliases) rather than planned.

Brings the four previously partial / TODO data-movement ops up to full CUDA + AMDGPU + SPIR-V coverage: * shuffle_up: add CUDA + AMDGPU lowerings. - CUDA: new `cuda_shuffle_up_{i32,f32,i64,f64}` runtime helpers in runtime_module/runtime.cpp (mirroring `cuda_shuffle_down_*`), built on the already-patched `cuda_shfl_up_sync_{i32,f32}` NVVM intrinsics. Codegen branch + `emit_cuda_shuffle_up` in codegen/cuda/codegen_cuda.cpp. - AMDGPU: new `amdgpu_shuffle_up_{i32,f32,i64,f64}` runtime helpers using the existing `ds_bpermute` path (same FIXME re: DPP fast-path as `shuffle_down`). Codegen branch + `emit_amdgpu_shuffle_up`. * shuffle_xor and broadcast_first: replace TODO `pass` stubs with portable `@qd.func` wrappers that inline into the calling kernel: - `shuffle_xor(value, mask)` ≡ `shuffle(value, u32(lane) ^ mask)` - `broadcast_first(value)` ≡ `broadcast(value, u32(0))` No backend codegen / runtime changes required: every backend that lowers `shuffle` / `broadcast` now lowers these too. Tests: * test_subgroup_shuffle_up (mirror of test_subgroup_shuffle_down) * test_subgroup_shuffle_xor (uses the new wrapper directly; the existing `_pattern` test continues to verify the manual emulation) * test_subgroup_broadcast_first Doc: refresh `docs/source/user_guide/subgroup.md` data-movement support matrix + per-op semantics + performance notes to reflect universal coverage. Drop the now-stale "fail to link on CUDA / AMDGPU" paragraph from the `shuffle_up` section.

Adds the missing test coverage for the rename half of this PR: * test_subgroup_sync (vulkan): smoke that subgroup.sync() — the renamed subgroup.barrier() — traces and runs. * test_subgroup_mem_fence (vulkan): same for subgroup.mem_fence(). * test_subgroup_barrier_deprecation_warn_once: pure-Python unit test asserting subgroup.barrier() emits exactly one DeprecationWarning across multiple calls and forwards to sync(); monkeypatches sync to a no-op so no kernel context is required and the test runs on every arch. * test_subgroup_memory_barrier_deprecation_warn_once: mirror for subgroup.memory_barrier() / subgroup.mem_fence().

… + SPIR-V The data-movement ops in qd.simt.subgroup require uniform control flow with all lanes active (already documented in subgroup.md). Under that contract subgroups (warps / waves) execute in lockstep on CUDA and AMDGPU, so an intra-subgroup control barrier or memory fence is a no-op on those backends. The SPIR-V backend keeps the real OpControlBarrier / OpMemoryBarrier emission because Vulkan / Metal subgroups can diverge. Lower subgroupBarrier / subgroupMemoryBarrier to a placeholder i32 0 (matching the SPIR-V codegen's return convention) on the CUDA and AMDGPU codegen, so calling subgroup.sync() / subgroup.mem_fence() from a kernel succeeds on every GPU backend. The smoke tests for sync()/mem_fence() are now arch=qd.gpu rather than arch=qd.vulkan and confirm tracing + running on each backend. Doc: matrix updated to yes/yes/yes (with a footnote explaining the no-op-on-CUDA/AMDGPU semantics) and the per-op section rewritten to describe the universal lowering.

…+ AMDGPU + SPIR-V" This reverts commit 233b08c. The "no-op on CUDA / AMDGPU" lowering conflated control-flow lockstep with memory ordering. The two are not equivalent: * `sync()` (control barrier) under our uniform-CF + all-lanes-active contract really is a no-op on CUDA / AMDGPU, because warps / waves are already at the same program point. That part was defensible. * `mem_fence()` (memory fence) is NOT a no-op. Lockstep execution does not order memory operations: the compiler may reorder loads / stores across the call, and the SM may buffer writes. A correct CUDA lowering would need at minimum an LLVM `fence` intrinsic with the appropriate scope (or `__threadfence_block()` as an over-strict fallback). That was not done. Rather than ship a half-correct lowering, restore the previous status: both ops remain SPIR-V only, the doc keeps its original "warps are lockstep, these are typically unnecessary; use __syncwarp under divergent control flow" guidance, and the smoke tests stay on arch=qd.vulkan. Implementing real CUDA / AMDGPU lowerings can be a separate, properly-thought-through change.

…GPU + SPIR-V Replaces the earlier (reverted) attempt that lowered these to no-ops on CUDA / AMDGPU "because warps are lockstep", which was wrong about what the user contract guarantees: sync() must reconverge lanes that have been split by independent thread scheduling (Volta+) and mem_fence() must actually order memory. This change wires real backend primitives into the lowering and fixes a long-standing SPIR-V mem_fence() bug. Per-backend lowerings --------------------- sync() (subgroupBarrier): * SPIR-V : already correct - OpControlBarrier(Subgroup, Subgroup, 0). * CUDA : warp_barrier(0xFFFFFFFF), reusing the existing runtime helper that is patched to llvm.nvvm.bar.warp.sync (i.e. __syncwarp). This is the precise warp-scope reconvergence primitive Volta+ needs and is a no-op under uniform CF on Pascal. * AMDGPU : llvm.amdgcn.wave.barrier - LLVM's wave-scope sync primitive. Acts as a compiler reordering barrier on GCN (lockstep) and emits a real wave barrier on RDNA where waves can span multiple SIMDs. mem_fence() (subgroupMemoryBarrier): * SPIR-V : was emitting OpMemoryBarrier(Subgroup, 0). The Memory Semantics operand must have an ordering bit AND at least one storage class, so 0 is invalid; drivers that accept it treat the instruction as a no-op. Now emits AcquireRelease | UniformMemory | WorkgroupMemory, matching what workgroupMemoryBarrier does (just at Subgroup scope). * CUDA : block_memfence(), patched to llvm.nvvm.membar.cta (__threadfence_block). Workgroup-scope, hence over-strict for the subgroup-scope ask but correct - a CTA-scope fence orders memory across the whole CTA, of which the subgroup is a strict subset. * AMDGPU : LLVM 'fence syncscope("workgroup") seq_cst' - lowers to the appropriate s_waitcnt / cache-flush sequence. Same workgroup-scope over-strictness note. Tests ----- test_subgroup_sync and test_subgroup_mem_fence flip from arch=qd.vulkan to arch=qd.gpu and now run on every GPU backend. They are smoke tests: they verify the kernel traces, codegens, and runs without error. We do not attempt to construct a producer/consumer race that only the fence makes legal - that kind of test is hard to write portably and easy to make flaky. Doc updates ----------- The Identification-and-control table now shows yes for sync() / mem_fence() across all backends, with a footnote on mem_fence() pointing out the workgroup-scope over-strictness on CUDA / AMDGPU. The semantics section spells out the per-backend lowering and the uniform-CF caller contract.

…s CUDA + AMDGPU + SPIR-V Closes the last two `no` cells in the Identification-and-control matrix in subgroup.md. Both ops now lower correctly on every GPU backend. group_size() ------------ * CUDA: returns the static constant 32 (warp size on every supported NVIDIA arch). * AMDGPU: emits llvm.amdgcn.wavefrontsize; the AMDGPU backend folds it to 32 or 64 based on the function's +wavefrontsize32/+wavefrontsize64 target feature. * SPIR-V: unchanged - was already querying OpSubgroupSize. elect() ------- Reimplemented as a @qd.func wrapper: @func def elect(): return i32(invocation_id() == 0) Inlines at trace time into compare + zext on every backend. Replaces the SPIR-V-only OpGroupNonUniformElect path with a portable definition. Semantic change worth flagging ------------------------------ OpGroupNonUniformElect is allowed to elect any *active* lane and may pick a different lane on different invocations. The new wrapper deterministically elects lane 0. Under qd.simt.subgroup's documented uniform-CF + all-lanes-active contract this is strictly compatible (lane 0 is always active and is a legal SPIR-V choice), and it makes the behaviour identical across backends. Grepped the codebase before changing - no internal caller depends on the broader OpGroupNonUniformElect semantics. Tests ----- * test_subgroup_group_size: every lane writes group_size() into a buffer; the result must be uniform across lanes and in {32, 64}. * test_subgroup_elect: writes elect(), invocation_id(), and group_size() into per-lane slots, then asserts (a) elect() is in {0, 1}, (b) elected lanes are exactly the invocation_id == 0 lanes, and (c) the elected count equals N / group_size. Both parametrized over arch=qd.gpu so they run on every available GPU backend. Doc --- subgroup.md matrix flips both rows to yes-on-all. Semantics sections describe each backend lowering and call out the elect() lane-0-pinning narrowing of SPIR-V.

… + AMDGPU + SPIR-V Replaces the SPIR-V-only `subgroup.inclusive_add(v)` with a portable sized variant implemented as a `@qd.func` Hillis-Steele scan over `shuffle_up`. This is the first slice of the planned migration of the inclusive_* / exclusive_* ops to a universal sized API; the other 6 inclusive_* ops still take `(value)` and lower via OpGroupNonUniformInclusiveScan on SPIR-V only. Implementation -------------- @func def inclusive_add(value, log2_size: template()): lane_in_group = invocation_id() & ((1 << log2_size) - 1) for i in static(range(log2_size)): offset = static(1 << i) partner = shuffle_up(value, u32(offset)) if lane_in_group >= offset: value = value + partner return value * `shuffle_up` is in uniform CF (every lane participates) - matches its documented contract on every backend. * The `if lane_in_group >= offset` is per-lane arithmetic - no subgroup op inside the conditional. * Cross-group `shuffle_up` partners are masked off by the lane_in_group guard, so groups smaller than the full subgroup compose correctly when log2_size < log2(group_size). Backend cleanup --------------- * Dropped `subgroupInclusiveAdd` from the SPIR-V codegen `inclusive_scan_ops` set in `quadrants/codegen/spirv/spirv_codegen.cpp` - that path is now unreachable for `inclusive_add`. The other 6 inclusive ops still go through that branch. * Dropped `PER_INTERNAL_OP(subgroupInclusiveAdd)` from internal_ops.inc.h and `POLY_OP(subgroupInclusiveAdd, ...)` from type_system.cpp. No SPIR-V fast path left to keep alive. Internal caller fix ------------------- `quadrants.algorithms.PrefixSumExecutor` was passing `subgroup.inclusive_add` as a template-callable to `scan_add_inclusive`, which invokes it as `inclusive_add(val)` with one argument. After the API change this would TypeError. Added a single-arg adapter `subgroup_inclusive_add_warp_i32` next to `warp_shfl_up_i32` in `_kernels.py` that calls `subgroup.inclusive_add(val, 5)` (log2_size=5 -> 32-lane warp/wave scan, matching WARP_SZ in the kernel), and routed the Vulkan branch to the adapter. The CUDA branch still uses `warp_shfl_up_i32` for now. Tests ----- `test_subgroup_inclusive_add` (arch=qd.gpu, parametrized over `log2_size in 1..5` and `dtype in {i32, i64, u64, f32, f64}`): runs the scan and verifies each lane's result against a Python running sum. Doc --- * Matrix flips `inclusive_add` row to yes-on-all (with the same `*` AMDGPU perf-asterisk as `reduce_add`). * Top-of-section text and "Performance notes" updated to reflect that `inclusive_add` now has a portable sized form, while the other inclusive_* ops are still mid-migration. * The "Inclusive scan on SPIR-V" example now uses `inclusive_add(v, 5)` and works on every GPU backend.

… AMDGPU + SPIR-V Slice 2 of the inclusive_* / exclusive_* migration: extends the same portable @qd.func Hillis-Steele pattern from `inclusive_add` (slice 1) to the other six inclusive ops, sharing a single `_inclusive_scan` helper. Implementation -------------- @func def _inclusive_scan(value, op: template(), log2_size: template()): lane_in_group = invocation_id() & ((1 << log2_size) - 1) for i in static(range(log2_size)): offset = static(1 << i) partner = shuffle_up(value, u32(offset)) if lane_in_group >= offset: value = op(value, partner) return value @func def inclusive_add(v, log2_size): return _inclusive_scan(v, _bin_add, log2_size) @func def inclusive_mul(v, log2_size): return _inclusive_scan(v, _bin_mul, log2_size) ... (min / max / and / or / xor follow the same one-line pattern) The seven `_bin_*` are tiny @func wrappers around `+`, `*`, `min(a,b)`, `max(a,b)`, `a & b`, `a | b`, `a ^ b`. Each is passed as a template-callable to `_inclusive_scan` and gets inlined at trace time, so the public API has the same cost as the slice 1 inline scan: log2_size shuffle+op pairs, no runtime indirection. This refactors the existing `inclusive_add` (which lived inline in slice 1) onto the shared helper at the same time, so all seven scans live in one place. The externally-observable behaviour of `inclusive_add` is unchanged. Backend cleanup --------------- * Removed the entire `inclusive_scan_ops` / `OpGroupNonUniformInclusiveScan` branch from `quadrants/codegen/spirv/spirv_codegen.cpp` - all seven ops now go through the portable Python path on every backend, including SPIR-V. * Removed the six remaining `subgroupInclusive{Mul,Min,Max,And,Or,Xor}` entries from `internal_ops.inc.h` and `type_system.cpp`. Tests ----- * Added `test_subgroup_inclusive_{mul,min,max,and,or,xor}` (arch=qd.gpu), each parametrized over `log2_size in 1..5` and a per-op dtype list: - `_mul`: i32, f32, f64 (inputs clamped to [1, 4] so 32-way product fits i32). - `_min` / `_max`: i32, f32, f64 (varied non-monotonic inputs). - `_and` / `_or` / `_xor`: i32, i64, u64 (bit-varied inputs). * Refactored the existing `test_subgroup_inclusive_add` to share a small `_check_inclusive_scan` helper with the new tests; the dtype matrix is unchanged (i32, i64, u64, f32, f64). Doc --- * Matrix flips all six remaining `inclusive_*` rows to yes-on-all (with `*` for AMDGPU - same ds_bpermute perf note as `inclusive_add`). * Section header collapses the seven ops into a single block: same shape, only the operator differs. * Performance notes call out that `OpGroupNonUniformInclusiveScan` is no longer used on SPIR-V even though it was supported - the trade-off is uniform cost across backends. The `exclusive_*` ops are still TODO stubs - that's slice 3.

…s i32 The previous `(i % 4) + 1` pattern produced cycles of 1*2*3*4 = 24 per group of 4; over 28 lanes that's 24^7 ≈ 4.6e9, which overflows i32 (and was the only failure in the cuda-side slice 2 run). Replace with `2 if i % 4 == 0 else 1`: max 8 twos in 32 lanes → product ≤ 2**8 == 256, well within i32 and exact in f32.

Slice 3 (final) of the inclusive_* / exclusive_* migration: replaces the seven TODO-stub `exclusive_*` functions with portable @qd.func implementations layered on top of the inclusive scans from slice 2. Implementation -------------- @func def _exclusive_scan(value, op: template(), identity, log2_size: template()): inc = _inclusive_scan(value, op, log2_size) shifted = shuffle_up(inc, u32(1)) lane_in_group = invocation_id() & ((1 << log2_size) - 1) result = shifted if lane_in_group == 0: result = identity return result The lane-0 substitution is required: `shuffle_up` with offset 1 is implementation-defined at lane 0 (and `OpGroupNonUniformShuffleUp` calls it undefined outright), so we cannot rely on whatever the hardware happens to produce there. Identity per op is supplied as a runtime expression in `value`'s dtype, derived from `value` itself so the wrapper does not need to inspect the dtype: add: value - value (zero) mul: value - value + 1 (one - the literal +1 takes value's dtype) or: value ^ value (zero) xor: value ^ value (zero) and: ~(value ^ value) (all bits set) For `min` and `max` there is no portable type-extreme that can be derived from `value` alone, so those two ops take an explicit `identity` argument: exclusive_min(v, log2_size, identity) # pass +inf or dtype max exclusive_max(v, log2_size, identity) # pass -inf or dtype min Cost per call: one inclusive scan (`log2_size` shuffle+op pairs) plus one extra `shuffle_up` and a per-lane select. Tests ----- * Added `test_subgroup_exclusive_{add,mul,min,max,and,or,xor}` (arch=qd.gpu), each parametrized over `log2_size in 1..5` and a per-op dtype list: - `_add`: i32, i64, u64, f32, f64 - `_mul`: i32, f32, f64 (inputs bounded so 32-way product fits i32) - `_min` / `_max`: i32, f32, f64 (caller passes explicit identity) - `_and` / `_or` / `_xor`: i32, i64, u64 * Shared `_check_exclusive_scan` helper drives the kernel launch, dtype skip, and per-lane verification: lane 0 must equal the supplied identity, lane k>0 must equal the op-reduce of `src[0..k]`. Doc --- * Matrix gains all seven `exclusive_*` rows, all yes-on-all (with `*` for AMDGPU same as inclusive_*). * New section describes the shared shuffle_up + select pattern, the per-op identity expressions, and why min/max take explicit identities. * The old "exclusive_*, all_true, any_true, all_equal" TODO-stub section is trimmed down to just the three remaining stubs.

… scans Both `_check_inclusive_scan` and `_check_exclusive_scan` previously verified only the first group's worth of lanes (lanes 0..group_size-1). Two coverage gaps: 1. For log2_size < 5, multiple independent groups of 2**log2_size lanes share a single 32-lane subgroup. The `lane_in_group >= offset` mask is what isolates them from each other - and that mask was completely untested. A bug there would have silently passed. 2. The 64-lane launch produces two independent 32-lane subgroups (lanes 0-31 and 32-63) running the same scan side by side. Cross-subgroup leakage in the underlying shuffle_up (e.g. an AMDGPU ds_bpermute with the wrong mask) would not have been caught. Both helpers now iterate over every (group, in-group-lane) pair across the full 64-lane launch and verify the expected per-lane value, recomputing the running op-reduce from `src[group_base..]` at each group boundary. Coverage delta: with log2_size=1 the old test verified 2 of 64 lanes; the new test verifies all 64. At log2_size=3, 8 of 64 -> 64 of 64. At log2_size=5, 32 of 64 -> 64 of 64 (still the same group_size, but the second subgroup is now exercised). Validated on the cluster: all 230 scan tests (115 inclusive + 115 exclusive) pass with the extended verification on CUDA and on Vulkan; the slice 1/2/3 implementations were already correct, this just closes the test gap.

…al fix) `exclusive_*` scans all fail on the Metal backend (via MoltenVK), with the `got` value at lane 1 of each group being whatever the inclusive scan would produce *if the lane-0 conditional update had been applied unconditionally* (eg. `inc[0] = src[0] op src[0]` instead of `inc[0] = src[0]`). For non-idempotent ops this is visibly wrong; for `and`/`or` it accidentally matches at group 0 because `x op x = x`. Inclusive scans pass because nothing downstream re-reads `inc[0]` across lanes. Root cause is reconvergence in MoltenVK's SPIR-V → MSL lowering of the pattern `if lane_in_group >= offset: value = op(value, partner)` followed by another subgroup op (the next loop iteration's `shuffle_up`, or the `shuffle_up(inc, 1)` inside `_exclusive_scan`): lanes that took the false branch end up reading stale register state from the subsequent shuffle. Fix: replace both conditional updates (`if`-then-assignment) with `qd.select`, which lowers to `OpSelect` and keeps every lane in straight-line code. `op(value, partner)` is pure so unconditional evaluation is safe. Adds a comment explaining the choice. Validated: - CUDA simt scans: 280/280 pass - Vulkan simt scans: 280/280 pass - CUDA scan+sort: 65/65 pass - Vulkan scan+sort: 65/65 pass

Replaces the long-standing TODO stubs with portable @qd.func implementations plus a CUDA fast path at full-warp size. API: - `subgroup.all_true(predicate, log2_size)` -- AND-reduce `predicate != 0` across each `2**log2_size` group, returns `i32(0|1)` broadcast to every lane. - `subgroup.any_true(predicate, log2_size)` -- OR-reduce, same shape. - `subgroup.all_equal(value, log2_size)` -- broadcast group-lane-0's value, AND-reduce per-lane equality bit. Equality is the backend's native `==` (NaN != NaN, +0.0 == -0.0), matching SPIR-V `OpGroupNonUniformAllEqual`. CUDA shortcut: at trace time, `qd.static()` on `current_cfg().arch` plus the compile-time `log2_size` selects `cuda_all_sync_i32` / `cuda_any_sync_i32` when `log2_size == 5`, so full-warp uses lower to a single `vote.all` / `vote.any` instruction with no branch in the IR. `all_equal` inherits the shortcut transitively via `all_true`. We deliberately do not wire `__match_all_sync` because it requires sm_70+ and uses bit-equality on floats, contradicting the documented `OpGroupNonUniformAllEqual` semantics. Every other backend (Vulkan, Metal, AMDGPU), and CUDA at `log2_size < 5`, falls back to a portable `shuffle_xor` butterfly: `log2_size` shuffles plus `log2_size` ANDs / ORs, fully unrolled into the calling kernel's IR (same shape as `reduce_all_add`). No C++ codegen changes. Tests cover all-true / all-false / one-odd-lane-in-one-group / sparse-pattern scenarios for `all_true` and `any_true`, and all-same / all-distinct / same-per-group / one-outlier-per-group for `all_equal`. Each scenario verifies every group across the full 64-lane launch (so the launch spans two CUDA / Metal / RDNA subgroups, exercising both partial-subgroup multi-group and cross-subgroup behaviour). Validated: - CUDA simt: 369/370 (+ 1 expected skip) - Vulkan simt: 350/370 (+20 expected MoltenVK skips) - CUDA scan+sort: 65/65 - Vulkan scan+sort: 65/65 Doc: `docs/source/user_guide/subgroup.md` updated -- support matrix, dedicated section per op, and CUDA-shortcut rationale.

The previous commit replaced `if` with `qd.select` in the scan helpers, but `OpSelect` on MoltenVK/Metal silently returns the false-branch value when an operand is an f32 produced by a shuffle intrinsic. Revert `_inclusive_scan` back to `if`, which works correctly on its own. For `_exclusive_scan`, restructure to shift the input before the inclusive scan (shuffle_up → fill lane 0 with identity → inclusive scan) instead of running the inclusive scan then shuffling the result. The old pattern triggered a separate Metal SPIR-V misoptimization where the register holding the inclusive result was clobbered when only consumed by a shuffle intrinsic. Co-authored-by: Cursor <cursoragent@cursor.com>

Two coverage gaps surfaced during a post-merge audit: * `all_true` / `any_true` were only ever exercised with predicate values 0 or 1, so the `i32(predicate != 0)` cast was untested. Adds a `nonbinary-mixed` scenario (`[((i*17) % 13) - 6 for i in range(N)]` -- mixes 0, positives, and negatives) to both tests. * `all_equal` on floats was documented as "NaN != NaN, +0.0 == -0.0" (matching `OpGroupNonUniformAllEqual`) but no test pinned the contract down. Adds `test_subgroup_all_equal_float_contract` (f32 + f64 x log2_size 1..5) covering: ±0 mixed in every group -> 1; NaN at every group start -> 0; NaN at a single lane -> only that group is 0; all NaN -> every group 0. These also lock the door against a future refactor swapping in `__match_all_sync` on CUDA (which would silently regress to bit-equality on floats). Validated: 45/45 voting tests on CUDA and Vulkan (was 35/35 + 10 new from the float contract scenarios).

* black auto-reformats in `subgroup.py` and `test_simt.py` (line-length=120 per `.pre-commit-config.yaml`). * clang-format auto-reformats in `codegen_amdgpu.cpp` and `spirv_codegen.cpp`. * Drop unused `from quadrants.lang.simt import subgroup` from `_algorithms.py` (left over after the switch to `subgroup_inclusive_add_warp_i32`); ruff re-sorts the remaining import block. * Extend the file-level pyright comment in `subgroup.py` from `reportInvalidTypeForm=false` to also disable `reportOperatorIssue` so that `p & shuffle_xor(...)` / `p | shuffle_xor(...)` in the new voting ops don't trip pyright on `Expr` operator overloads — same false-positive class the existing suppression already covers. Pre-commit (black, clang-format, ruff, pylint, trailing-whitespace, end-of-file) clean. Pyright is down to 6 pre-existing errors in files this branch does not touch (`_tensor_wrapper.py`, `_func_base.py`, `_metal_interop.py`, all from PR #618 / streams work) — net 0 new errors attributable to this branch.

The voting / scan / data-movement work landed with prose wrapped at the AI-default ~80-95c instead of the project's 120c (per `pre-commit` black config `-l 120`). Reflow the affected runs in: * `python/quadrants/lang/simt/subgroup.py` — module-level voting / inclusive / exclusive backend-strategy comments, plus `elect`, `all_true`, `any_true`, `all_equal`, `broadcast_first`, `_inclusive_scan`, all `inclusive_*` / `exclusive_*` op docstrings, and `_exclusive_scan` / `shuffle_xor`. * `tests/python/test_simt.py` — voting / scan section comments, scan verification rationale, voting predicate-truthy / float-contract notes, `test_subgroup_sync` / `_mem_fence` / `_group_size` / `_elect` / `_barrier_deprecation_warn_once` / `_memory_barrier_deprecation_warn_once` docstrings. * `python/quadrants/_kernels.py` — `subgroup_inclusive_add_warp_i32` adapter docstring. * `python/quadrants/algorithms/_algorithms.py` — comment explaining the warp-i32 adapter usage in `PrefixSumExecutor`. No semantic changes; black / pre-commit / pyright still clean. Audited via `find_underwrapped --diff origin/main`: remaining flagged runs are all at ~110-120c (only minor packing imbalance, max ≤ 123c) — no AI-default 80c under-wrapping in this branch's diff.

The CI wrap-checker flagged three C++ comment blocks in PR #665 still wrapped near ~80c (`runtime.cpp:1033`, `runtime.cpp:1136`, `codegen_amdgpu.cpp:507`). While in there I audited the rest of the new C++ subgroup commentary and the per-op intrinsic notes, and reflowed them to the project's 120c target. Also tightened a couple of Python lines that crept past 120c (one f-string docstring, one explanatory comment in test_simt.py). No semantic changes.

CI wrap-checker on PR #665 flagged three more docstring blocks wrapping at 83-87c instead of 120c (`exclusive_add`, `test_subgroup_sync`, `test_subgroup_mem_fence`). Reflow those. No semantic changes.

Stale carry-over from the days when several ops were one-backend stubs; no longer applies now that everything in the doc is universal.

Stacked on hp/cross-gpu-subgroup; same shape as the existing `reduce_add` / `reduce_all_add` pair: * `reduce_min(v, log2_size)` / `reduce_max(v, log2_size)` — `shuffle_down` tree, result valid in lane 0 of each `2**log2_size` group. * `reduce_all_min(v, log2_size)` / `reduce_all_max(v, log2_size)` — `shuffle_xor` butterfly, result broadcast to every lane. Both forms unroll into exactly `log2_size` shuffle+min (or `+max`) pairs in the calling kernel's IR — no kernel-launch overhead, no separate runtime symbol. Lowers to backend-specific min/max intrinsics (`fminnm` / `fmaxnm` on PTX, `llvm.minnum` / `llvm.maxnum` on AMDGPU, `OpFMin` / `OpFMax` on SPIR-V); float-NaN handling is documented as implementation-defined. Tests: parametrized as `qd.gpu` over `i32` / `i64` / `u64` / `f32` / `f64` and `log2_size` in `[1..5]`, verifying every group across the full 64-lane launch. Doc: new rows in the `subgroup.md` Reductions/scans table; new per-op sections; the "removed" note is updated to drop `reduce_min` / `reduce_max` (now portable).

Implement a portable ballot operation that returns a u32 bitmask where bit i is set if lane i's predicate is non-zero. Works across CUDA (__ballot_sync), AMDGPU (amdgcn_ballot.i32), and SPIR-V/Vulkan (OpGroupNonUniformBallot). Follows the same cross-backend pattern as subgroup.shuffle: a single Python API (subgroup.ballot) dispatches to the appropriate backend intrinsic at codegen time. On AMDGPU CDNA with 64-wide wavefronts only the low 32 bits are returned, consistent with the u32 return type.

Mac OS X build was failing because spirv_codegen.cpp was accessing IRBuilder::t_v4_uint_ directly, which is a private member. Add a public v4_u32_type() accessor following the existing pattern (u32_type(), bool_type(), etc.) and use it from the ballot lowering.

Per-lane inclusive sum scoped to 2**log2_size lanes, where every lane with head_flag != 0 resets the running sum. Useful for stream compaction and sparse / variable-length records. Implementation: one subgroup.ballot(head_flag != 0) to materialise a u32 of head positions, then a Hillis-Steele inclusive sum bounded by `distance >= offset` (distance = lane - segment_head, segment_head from 31 - clz(effective_mask & ((1 << (lane + 1)) - 1)) with a virtual head OR-injected at group_base so lower is always non-zero). Cost: 1 ballot + 1 clz + log2_size shuffles + log2_size adds, fully unrolled. Same shape as inclusive_add with a single-instruction setup. Tests: parametrized over the standard dtypes (i32 / i64 / u64 / f32 / f64) and log2_size in [0..5], plus three contract tests (no head flags -> equivalent to inclusive_add; every lane is a head -> output equals input; truthy non-binary head_flag values). Doc: new row in the Reductions/scans table; new per-op section after reduce_all_min / reduce_all_max.

`qd.clz(u32_value)` was emitting QD_NOT_IMPLEMENTED on CUDA and produced undefined results on SPIR-V (GLSL.std.450 FindSMsb is undefined for the all-bits-set case). The new `subgroup.segmented_reduce_add` is the first user of `clz` in the codebase and exposed both bugs. * CUDA: route u32 / u64 inputs through the same `__nv_clz` / `__nv_clzll` intrinsics used for i32 / i64 — the underlying bit pattern is what matters, the C declaration on signed types is a header-level convention. * SPIR-V: dispatch to FindUMsb (#75) for unsigned inputs and FindSMsb (#74) for signed. The two GLSL.std.450 instructions return a value of the same type as their operand, so add an explicit OpBitcast back to i32 before the `32 - msb - 1` subtraction (otherwise SPIR-V's strict-type `sub` asserts on mixed i32 / u32). * Python: in `segmented_reduce_add`, wrap `clz`'s result in `i32(...)` so the subsequent arithmetic is uniformly signed-32-bit (the trace- time tracer would otherwise propagate u32 from the input through to the subtraction, hitting SPIR-V's same-type assertion). Tests: `subgroup.segmented_reduce_add` tests now pass on CUDA + Vulkan across i32 / i64 / u64 / f32 / f64 and `log2_size` in [0..5], including the all-heads, no-heads, and truthy-predicate edge cases.

``v_permlane64_b32`` is gfx940+ (CDNA3) and gfx11+ (RDNA3+) only. On older wave64-capable targets -- gfx9xx (CDNA1/2 Vega/MI100/MI200) and gfx10.x (RDNA1/2, including the gfx1011 V520 used by our AMD GPU CI runner) -- the AMDGPU LLVM backend hits "Cannot select" while lowering ``llvm.amdgcn.permlane64`` and segfaults the JIT. This is what caused the cluster of ``test_subgroup_shuffle_*`` segfaults on the AMD GPU runner since this PR's wave64 cross-half shuffle fix went in. Detect the target ``mcpu_`` at JIT time and patch ``amdgpu_permlane64`` to either the real intrinsic (on supported hardware) or the identity function (on everything else). With the identity patch the cross-half helper degrades to a plain ``ds_bpermute`` -- correct for same-SIMD reads, wrong for cross-SIMD on RDNA1/2 wave64 but matching the pre-cross-half-fix behavior and not crashing. Every existing shuffle test pattern only checks same-half lanes, so they go from crashing back to passing.

Move the entire "Reductions and scans" subsystem out of ``quadrants.lang.simt.subgroup`` into a new ``quadrants.lang.simt.reductions`` module, addressing the "Check feature factorization" CI advisory. The split follows the user-guide structure exactly (one file per docs section), and groups together every op that uses a shuffle-based reduction over ``2**log2_size`` consecutive lanes: ``reduce_*_tiled``, ``segmented_reduce_*_tiled``, ``inclusive_*_tiled``, ``exclusive_*_tiled``, and their full-subgroup wrappers. Public API is unchanged -- the new module is wildcard-re-imported at the bottom of ``subgroup.py``, so ``qd.simt.subgroup.reduce_add(v)`` etc. continue to work. The back-import sits below every primitive that ``reductions`` depends on (``ballot`` / ``invocation_id`` / ``shuffle`` / ``shuffle_up`` / ``shuffle_down`` / ``log2_group_size``), so module load order is well-defined. ``subgroup.py`` drops from 1099 to 474 lines; ``reductions.py`` is 727 lines.

github-actions · 2026-05-14T14:19:57Z

Total: 20 file(s) changed, +1990 -502 code lines.

github-actions · 2026-05-14T15:06:51Z

Diff coverage: 75% · 1659 lines, 412 missing

github-actions · 2026-05-14T16:00:49Z

Total: 20 file(s) changed, +1990 -502 code lines.

github-actions · 2026-05-14T16:51:25Z

Diff coverage: 75% · 1659 lines, 412 missing

The native ``v_permlane64_b32`` instruction is only available on gfx940+ (CDNA3) and gfx11+ (RDNA3+). The previous fallback on gfx9xx CDNA1/2 and gfx10.x RDNA1/2 patched ``amdgpu_permlane64`` to an identity function, which avoided the JIT crash but silently produced wrong results for any cross-half wave64 shuffle (lanes 0-31 reading from lanes 32-63 or vice versa). Replace the identity stub with a wave-local LDS roundtrip: store value to ``lds[wave_base + lane]``, wavefront-scope acquire-release fence (lowers to ``s_waitcnt lgkmcnt(0)``), then read from ``lds[wave_base + (lane ^ 32)]``. Higher latency than the native swap but produces correct results on every wave64-capable AMDGPU target. Wave-scoped slotting (``wave_base = (workitem.id.x >> 6) << 6``) avoids cross-wave collisions in multi-wave workgroups; the 1024-entry buffer is sized for the AMDGPU 1024-thread workgroup max at wave64. Also expose the active AMDGPU mcpu string to Python as ``_qd_core.amdgpu_mcpu()`` for diagnostics and target-specific tests.

…ting Adds an escape hatch to force the LDS-based ``amdgpu_permlane64`` software emulation on hardware that natively supports the instruction (gfx940+ CDNA3, gfx11+ RDNA3+). Setting ``QD_AMDGPU_FORCE_PERMLANE64_FALLBACK=1`` makes the JIT take the LDS path unconditionally, so the fallback can be validated on a working AMD box without needing a gfx10.x runner.

Switch from ``Intrinsic::getDeclaration`` to ``getOrInsertDeclaration`` (the former is removed in newer LLVM trunk and rejected by the gcc toolchain) and replace brace-enclosed ``ArrayRef<Value*>`` constructor arguments with explicit array temporaries (gcc can't always deduce ``ArrayRef<Value*>`` from a ``{a, b}`` initializer list).

github-actions · 2026-05-14T20:23:18Z

Total: 21 file(s) changed, +2040 -502 code lines.

github-actions · 2026-05-14T21:15:40Z

Diff coverage: 75% · 1659 lines, 412 missing

alanray-tech

Code Review Findings

Overall this is a high-quality, well-tested PR. The algorithms (tree reduce, XOR butterfly, Hillis-Steele scan, ballot-based segmented reduce) are correct and the three-backend coverage is thorough. A few items worth addressing or acknowledging:

SPIR-V `visit(InternalFuncStmt)` missing default error path

spirv_codegen.cpp line ~1481-1611: if stmt->func_name matches none of the if/else if branches, val is never assigned but ir_->register_value(stmt->raw_name(), val) still executes at line 1611, registering an uninitialised spirv::Value into the IR. This is UB unless the front-end guarantees exhaustiveness. Suggest adding a QD_ERROR before the final register_value call (same pattern as the shuffle branch's QD_ERROR("Unsupported operation: ...") at line 1594).

Documentation: `log2_size` valid range inconsistency

docs/source/user_guide/subgroup.md line ~200 says log2_size is in [0, 5] for reductions/scans/votes, and [0, 6] for segmented_reduce_*_tiled only. But the codebase and tests use log2_size == 6 for non-segmented _tiled ops on AMDGPU wave64 (test_subgroup_*_log2_size_6). The [0, 5] claim is incorrect for wave64 non-segmented ops. Suggest changing to [0, log2_group_size()] (i.e. up to 6 on AMDGPU) to match reality.

`subgroupBallotU64` on CUDA: high-zero semantics

codegen_cuda.cpp line ~769-774: subgroupBallotU64 is implemented as ZExt(ballot_i32), so high 32 bits are always zero on CUDA. This is correct and intentional, but the user-facing docs in subgroup.md could be more prominent about this -- a user calling ballot(predicate) on CUDA might expect lanes 32-63 to be populated if they're thinking in terms of the u64 return type. The code comment is good; consider a brief callout in the ballot(predicate) semantic section as well.

`subgroup.py` / `reductions.py` circular import fragility

subgroup.py line ~398: the from quadrants.lang.simt.reductions import * at the bottom of subgroup.py creates a circular dependency that works today only because every symbol reductions.py imports from subgroup is defined before that line. If someone reorders subgroup.py in the future, the import will silently break. The existing comment explains the ordering requirement, which is good -- but consider adding a short note like # WARNING: reordering definitions above this line may break the circular import with reductions.py to make this more visible to future contributors.

AArch64 host: VGPR asm fence absent for AMDGPU bitcode

runtime.cpp line ~1088-1090: the +v asm constraint is gated on __x86_64__ || __i386__ || __amdgcn__, so AArch64-hosted builds produce AMDGPU bitcode without the VGPR hint. The comment documents this well. If ARM-built wheels that run on gfx11+ RDNA3 are a shipping combination, it may be worth a CI regression test covering constant-index shuffle patterns on that build matrix to confirm LLVM's uniformity analysis still emits ds_bpermute rather than folding to v_readlane.

Vulkan `PrefixSumExecutor` assumes subgroup width == 32

_kernels.py line ~288-295: subgroup_inclusive_add_warp_i32 hardcodes log2_size=5, and _algorithms.py routes the Vulkan path through it. This matches the existing CUDA assumption (WARP_SZ = 32), but on a Vulkan driver that reports subgroup_size != 32 (e.g. 16 on some Intel iGPUs, or 64 on AMDGPU via MoltenVK), the prefix-sum would silently compute wrong results. This is a pre-existing limitation (the kernel design assumes 32-wide warps), but now that subgroup intrinsics are wired in for Vulkan it becomes more relevant. Worth a brief comment or a runtime assert that group_size() == 32 in the adapter.

``TaskCodegen::visit(InternalFuncStmt *)`` previously had two sequential ``if / else if`` chains separated by a comment; if ``stmt->func_name`` matched neither, the local ``spirv::Value val`` stayed default-constructed and was still passed to ``ir_->register_value(...)`` at the bottom, producing invalid SPIR-V at run time. Merge the two chains and add a final ``else { QD_ERROR(...); }`` so an unrecognised name surfaces as a compile-time error -- matters now that this PR removed the ``subgroupAdd`` / ``subgroupMul`` / ``subgroupMin`` / ``subgroupMax`` / ``subgroupAnd`` / ``subgroupOr`` / ``subgroupXor`` / ``subgroupInclusive*`` branches, so a stray statement carrying one of those names would otherwise silently corrupt the IR.

The "Reductions and scans" section claimed ``log2_size`` is in ``[0, 5]`` for non-segmented reductions / scans / votes, with the ``[0, 6]`` cap reserved for ``segmented_reduce_*_tiled``. That contradicts both the codebase (every non-segmented ``_tiled`` op is a shuffle butterfly that spans the full wave at ``log2_size == log2_group_size()``, so wave64 backends reach ``log2_size == 6``) and the test suite (``test_subgroup_*_log2_size_6``), and contradicts the very next paragraph which explains that ``reduce_add(v)`` resolves to ``reduce_add_tiled(v, log2_group_size())`` -- i.e. ``log2_size = 6`` on AMDGPU. Rewrite the range as ``[0, log2_group_size()]`` and spell out the wave32 / wave64 expansions explicitly.

Add a WARNING comment above the bottom-of-module ``from quadrants.lang.simt.reductions import *`` call out that reordering any of ``ballot`` / ``invocation_id`` / ``shuffle`` / ``shuffle_up`` / ``shuffle_down`` / ``log2_group_size`` below this line would silently break the circular import (``reductions`` would observe a partially-populated ``subgroup`` module). Pure documentation -- the import itself already works thanks to those names being defined above this point.

``scan_add_inclusive`` (the kernel ``PrefixSumExecutor`` dispatches to) hard-codes ``WARP_SZ = 32`` for the inter-warp accumulation and the shared-memory layout, and the Vulkan scan primitive is ``subgroup_inclusive_add_warp_i32`` which pre-binds ``log2_size=5``. That whole call graph is correct exactly when the device subgroup width is 32; on a Vulkan driver that reports ``subgroupSize != 32`` (16 on some Intel iGPUs, 64 on AMDGPU Vulkan in wave64 mode) the result is silently wrong. Raise ``RuntimeError`` up front with the actual probed width, so the failure is loud and diagnostic. CUDA always reports a 32-wide warp so the assertion only fires on the Vulkan branch. This is a pre-existing limitation surfaced by PR review feedback; left as ``raise RuntimeError`` rather than as a deeper fix because supporting non-32 subgroup widths in the prefix sum requires rewriting ``scan_add_inclusive`` itself.

The Hillis-Steele inclusive scan was folding the wrong way around: ``partner = shuffle_up(value, offset)`` is lane ``self - offset``'s value (the predecessor in scan order), but the step then did ``value = op(value, partner)``, i.e. ``op(current, predecessor)``. After ``log2(N)`` steps lane ``k`` ended up holding ``op(a[k], op(a[k-1], ..., op(a[1], a[0])))`` -- a right-fold over the lane range -- instead of the documented left-fold ``op(a[0], op(a[1], ..., op(a[k-1], a[k])))``. Invisible for the seven typed wrappers (``_bin_add`` / ``_mul`` / ``_min`` / ``_max`` / ``_and`` / ``_or`` / ``_xor``) because all of them are commutative, but wrong for any non-commutative associative monoid plugged into the generic primitive (matrix multiply, function composition, etc.). ``_exclusive_scan_tiled`` delegates to the inclusive helper and inherits the fix. Flip the argument order to ``op(partner, value)`` and add a regression test that scans affine functions ``f(x) = m*x + c`` packed as ``(m << 16) | c`` in an ``i32`` under composition -- associative, explicitly non-commutative, with deliberately mixed per-lane ``(m, c)`` so the buggy and correct folds diverge after a single step. The new test reproduces the bug at ``log2_size in {1, 2, 5}``; all 120 existing scan tests still pass (commutative ops are direction-invariant). Reported by the PR #684 author while exposing the generic op as ``block.inclusive_scan`` / ``block.exclusive_scan``.

alanray-tech

LGTM

github-actions · 2026-05-15T14:25:54Z

Total: 21 file(s) changed, +2049 -505 code lines.

# Conflicts: # docs/source/user_guide/decompositions.md

…de files PR #683 (just merged into main) deleted ``decompositions.md`` and split it into ``linalg_per_thread.md`` + ``matrix_vector_per_thread.md``. The spiritual successor (``linalg_per_thread.md``) reintroduced the "raises an exception at trace time" wording that I'd previously converted to "compile time" in ``decompositions.md``. ``atomics.md`` similarly carried three "trace time" occurrences left over from earlier PRs. Convert all four to "compile time" for consistency with the rest of the user guide (everything is resolved during AST -> IR compilation, not at trace time).

github-actions · 2026-05-15T14:49:52Z

Diff coverage: 26% · 1688 lines, 1241 missing

github-actions · 2026-05-15T15:25:50Z

Total: 21 file(s) changed, +2095 -506 code lines.

github-actions · 2026-05-15T16:52:43Z

Diff coverage: 75% · 1706 lines, 424 missing

…m on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com>

* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>

hughperkins and others added 30 commits May 7, 2026 15:16

[doc] Drop qd.simt.warp pointer from subgroup voting ops section

9638acf

Merge branch 'main' into hp/cross-gpu-subgroup

2f94298

[doc] Reflow more Python subgroup docstrings to 120c

b0fb964

CI wrap-checker on PR #665 flagged three more docstring blocks wrapping at 83-87c instead of 120c (`exclusive_add`, `test_subgroup_sync`, `test_subgroup_mem_fence`). Reflow those. No semantic changes.

[doc] Drop 'Cells marked no' sentence from subgroup.md

1525626

Stale carry-over from the days when several ops were one-backend stubs; no longer applies now that everything in the doc is universal.

Merge branch 'main' into hp/cross-gpu-subgroup

c2aa6dd

Apply pre-commit formatting (black, clang-format)

119b3fa

hughperkins added 2 commits May 14, 2026 06:16

docs/subgroup: replace en-dashes in numeric ranges with ASCII hyphens

3b9b9ad

hughperkins added 3 commits May 14, 2026 12:43

alanray-tech reviewed May 15, 2026

View reviewed changes

hughperkins added 5 commits May 15, 2026 06:47

alanray-tech approved these changes May 15, 2026

View reviewed changes

hughperkins added 2 commits May 15, 2026 07:46

Merge remote-tracking branch 'origin/main' into hp/new-qipc-ops-subgroup

b407520

# Conflicts: # docs/source/user_guide/decompositions.md

hughperkins merged commit 26f0c72 into main May 15, 2026
55 of 56 checks passed

hughperkins deleted the hp/new-qipc-ops-subgroup branch May 15, 2026 17:44

hughperkins mentioned this pull request May 18, 2026

[Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes #707

Merged

Uh oh!

Conversation

hughperkins commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Drive-by fixes (required by segmented_reduce_*, but useful in their own right)

Stacking

Test plan

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

alanray-tech left a comment

Choose a reason for hiding this comment

Code Review Findings

SPIR-V visit(InternalFuncStmt) missing default error path

Documentation: log2_size valid range inconsistency

subgroupBallotU64 on CUDA: high-zero semantics

subgroup.py / reductions.py circular import fragility

AArch64 host: VGPR asm fence absent for AMDGPU bitcode

Vulkan PrefixSumExecutor assumes subgroup width == 32

Uh oh!

alanray-tech left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented May 10, 2026 •

edited

Loading

Drive-by fixes (required by `segmented_reduce_*`, but useful in their own right)

SPIR-V `visit(InternalFuncStmt)` missing default error path

Documentation: `log2_size` valid range inconsistency

`subgroupBallotU64` on CUDA: high-zero semantics

`subgroup.py` / `reductions.py` circular import fragility

Vulkan `PrefixSumExecutor` assumes subgroup width == 32