[Math] Make bitop operations portable cross-gpu by hughperkins · Pull Request #662 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-08T21:34:05Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Adds popcnt and clz support on the AMDGPU backend, matching CUDA's dtype set: popcnt on i32, u32, i64, u64; clz on i32, i64. The lowerings emit the portable llvm.ctpop and llvm.ctlz intrinsics, which the AMDGPU LLVM backend further lowers to native bit-count instructions. The 64-bit results are truncated to i32 to match the documented i32 return type and CUDA's libdevice helpers (__nv_popcll / __nv_clzll). u32 / u64 inputs to clz remain rejected, matching the CUDA support matrix. Removes the AMDGPU pytest.xfail branches in test_popcnt and test_clz. Verified on MI300X (gfx942): tests/python/test_unary_ops.py passes (42 passed, 1 skipped, 1 xfail unrelated to this change).

Updates docs/source/user_guide/math.md so the support matrix and lowering notes reflect that AMDGPU now supports popcnt (i32, u32, i64, u64) and clz (i32, i64), with the same u32 / u64 clz restriction as CUDA. Drops the FIXME (AMDGPU) paragraph.

hughperkins · 2026-05-08T21:35:11Z

@@ -131,9 +128,6 @@ def test_u64(x: qd.uint64) -> qd.int32:

 @test_utils.test(arch=[qd.cpu, qd.metal, qd.cuda, qd.amdgpu, qd.vulkan])


why filter at all?

hughperkins · 2026-05-08T21:35:15Z

@@ -131,9 +128,6 @@ def test_u64(x: qd.uint64) -> qd.int32:

 @test_utils.test(arch=[qd.cpu, qd.metal, qd.cuda, qd.amdgpu, qd.vulkan])


why filter at all?

clz() counts leading zeros over the unsigned bit pattern: clz(0xFFFFFFFF) must be 0, clz(0) must be 32, etc. The previous lowering used FindSMsb, which returns -1 on negative inputs (it finds the MSB of the absolute value, ignoring the sign bit), so clz(-1) yielded 32 on Vulkan / Metal. FindUMsb gives the matching semantics; CUDA __nv_clz and the LLVM ctlz intrinsic already operate on the unsigned bit pattern.

GLSL.std.450 FindUMsb is defined for 32-bit integers only, so 64-bit inputs cannot be lowered to a single ext-inst. Decompose into hi/lo i32 halves, run FindUMsb on each, and select the right half: if hi != 0: clz = 31 - FindUMsb(hi) else: clz = 32 + (31 - FindUMsb(lo)) FindUMsb returns -1 on a zero input, so the all-zero case naturally gives clz(0) == 64. The 32-bit width result is then cast back to the operand's width (i64) so the value registered for this stmt matches the type the rest of the pipeline expects. Vulkan device support for 64-bit ints is already gated by shaderInt64 in vulkan_device_creator.cpp; this change does not introduce a new runtime precondition.

Adds clz(0), clz(-1), clz(0x7FFFFFFF) to test_clz on i32, and a new test_clz_i64 function covering 0, 1, top-half-only, low-half-only, both-halves-non-zero, and -1 (all bits set). Pairs with the SPIR-V FindUMsb fix and the SPIR-V / AMDGPU i64 lowerings: pre-fix, the SPIR-V path returned 32 for clz(-1) and crashed (or silently truncated) on i64 input. Verified on CPU + AMDGPU (gfx942 / MI300X). Vulkan / Metal coverage to be exercised separately.

Updates the math.md support matrix and lowering descriptions to reflect: - SPIR-V row for clz changes from "32-bit only; 64-bit input silently truncated" to "i32, i64 (FindUMsb-based)", with a footnote describing the hi/lo decomposition and the shaderInt64 precondition. - 32-bit SPIR-V lowering rephrased to FindUMsb (matches the corrected unsigned-bit-pattern semantics). - Adds the explicit clz(-1) == 0 and clz(0x7FFFFFFF) == 1 examples to the qd.math.clz section. - Adds a portability note about preferring 32-bit reductions before clz on the SPIR-V hot path, since the i64 lowering costs ~two FindUMsb calls plus an OpSelect.

github-actions · 2026-05-08T22:23:55Z

Total: 3 file(s) changed, +68 -9 code lines.

github-actions · 2026-05-08T23:13:47Z

Diff coverage: 100% · 17 lines, 0 missing

Bare @test_utils.test() expands to every backend that is_arch_supported on the host, instead of an enumerated list. Net effects: - test_popcnt picks up qd.metal (was previously missing it, the only reason a separate test_popcnt_metal existed). - test_clz / test_clz_i64 stay at the same coverage but lose the enumerated-list maintenance burden as backends are added. Removes test_popcnt_metal: its i32 / u32 assertions are a strict subset of the (now metal-inclusive) test_popcnt body, so keeping it would only duplicate the kernel compile on metal hosts.

github-actions · 2026-05-09T10:10:54Z

Total: 3 file(s) changed, +69 -24 code lines.

github-actions · 2026-05-09T10:54:25Z

Diff coverage: 100% · 18 lines, 0 missing

Three multi-line comment runs were wrapped at ~80c (the AI default) instead of the project's 120c target (`.clang-format ColumnLimit: 120`): - spirv_codegen.cpp: 5-line FindUMsb-vs-FindSMsb rationale -> 4 lines. - test_unary_ops.py: 3-line sign-bit / all-bits-set comment -> 2 lines. - test_unary_ops.py: 3-line test_clz_i64 header comment -> 2 lines. No semantic change. Found via find_underwrapped.py against the hp/amdgpu-popcnt-clz diff.

github-actions · 2026-05-09T12:51:13Z

Total: 3 file(s) changed, +69 -24 code lines.

github-actions · 2026-05-09T13:37:29Z

Diff coverage: 100% · 18 lines, 0 missing

hughperkins · 2026-05-09T15:51:49Z

 ```

-For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`.
+For `qd.u32` input on CUDA or AMDGPU, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`.


seems like this is not in fact general yet...

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5ca838341

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T15:52:41Z

+        llvm_val[stmt] = builder->CreateTrunc(pop64, llvm::Type::getInt32Ty(*llvm_context));
+        stmt->ret_type = PrimitiveType::i32;


Fix 64-bit bit-count typing before AMDGPU codegen

For AMDGPU kernels where a 64-bit popcnt/clz result is used by another expression (for example qd.math.popcnt(x_i64) + 1), the frontend has already typed the unary result and any promoted constants as 64-bit before this codegen path runs. Emitting an i32 value here and only mutating stmt->ret_type locally leaves downstream BinaryOpStmts with mismatched LLVM operands (i32 from this node vs i64 from the promoted peer), so these new AMDGPU lowerings can fail to compile outside the direct-return cases covered by the tests; the return type needs to be established during type checking/lowering or the produced value needs to match the existing IR type consistently.

Useful? React with 👍 / 👎.

The 32-bit clz path used to call FindUMsb returning dst_type and then mix it with i32 immediates (`bitcnt - msb - 1`), which only typechecked when dst_type happened to be i32. For u32 inputs that path would assert in OpISub on the bitcnt / msb type mismatch. Run all clz arithmetic in i32 and only cast to dst_type at the end. This puts u32 / i32 / u64 / i64 on a single dispatch (by bit width); the same pattern the 64-bit path already used. No semantic change for the previously-supported i32 / i64 cases.

The bullets duplicated content already in the support matrix and the per-op descriptions; the SPIR-V 64-bit cost note in particular re-states what the hi/lo decomposition paragraph spells out. Removing the section to keep math.md focused on a single canonical description of each op.

hughperkins · 2026-05-09T16:14:25Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83a0e9ad2b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T16:18:11Z

+        llvm_val[stmt] = builder->CreateTrunc(pop64, llvm::Type::getInt32Ty(*llvm_context));
+        stmt->ret_type = PrimitiveType::i32;


Set bit-count result types before AMDGPU codegen

When a 64-bit qd.math.popcnt() result is used by another expression on AMDGPU, the frontend has already typed the unary expression as the 64-bit operand type, but this branch only later replaces the generated LLVM value with an i32. A case like qd.math.popcnt(x) + qd.u64(1) will therefore feed an i32 LLVM value into a binary op that was type-checked as 64-bit, producing invalid IR/type mismatches; the same pattern exists in the new 64-bit clz branch below. The result type needs to be inferred as i32 before downstream statements are type-checked, or codegen must continue to produce a value matching the existing 64-bit ret_type.

Useful? React with 👍 / 👎.

alanray-tech · 2026-05-10T18:11:28Z

Potential type mismatch when 64-bit popcnt/clz result feeds into a compound expression

The CUDA and AMDGPU codegen paths for 64-bit popcnt/clz truncate the LLVM value to i32 and mutate stmt->ret_type to PrimitiveType::i32 at codegen time. However, the type-checking pass (type_check.cpp) has already set ret_type to the operand type (i64) and used that to make type-promotion decisions for downstream statements.

This means a compound expression like:

@qd.kernel
def broken(x: qd.int64) -> qd.int64:
    return qd.math.popcnt(x) + qd.i64(1)

will fail at LLVM IR generation: the type checker saw popcnt as i64, decided no promotion was needed for the + i64(1), and emitted no cast_value. At codegen time, CreateAdd(i32, i64) is called, which triggers an LLVM assertion because add requires both operands to have the same type. The same issue applies to clz.

The existing tests only cover direct-return cases (return qd.math.popcnt(x)), which work by accident because the codegen-time ret_type mutation makes the downstream cast_value a no-op.

Note: this is not new to this PR — the CUDA popcnt i64 path has the same pattern pre-existing. This PR replicates it for AMDGPU and extends it to clz.

Proposed fix: set ret_type = PrimitiveType::i32 for popcnt/clz in the type-checking pass (type_check.cpp), so all downstream type promotion and codegen operate on the correct type from the start. The per-file changes would be:

quadrants/transforms/type_check.cpp — In visit(UnaryOpStmt*), add a special case after the existing logic_not handling:
```
} else if (stmt->op_type == UnaryOpType::popcnt || stmt->op_type == UnaryOpType::clz) {
    stmt->ret_type = PrimitiveType::i32;
}
```
This mirrors how logic_not already forces ret_type = u1 regardless of operand width.

quadrants/codegen/llvm/codegen_llvm.cpp (CPU backend) — The llvm.ctpop.i64 / llvm.ctlz.i64 intrinsics return i64, but ret_type is now i32. Add a trunc for 64-bit inputs:

else if (op == UnaryOpType::popcnt) {
    llvm_val[stmt] = builder->CreateIntrinsic(llvm::Intrinsic::ctpop, {input_type}, {input});
    if (input_type->isIntegerTy(64)) {
        llvm_val[stmt] = builder->CreateTrunc(llvm_val[stmt], llvm::Type::getInt32Ty(*llvm_context));
    }
}

Same pattern for clz.

quadrants/codegen/cuda/codegen_cuda.cpp — Remove the stmt->ret_type = PrimitiveType::i32 lines (type_check already handles this). __nv_popcll / __nv_clzll already return i32 from libdevice, so no other change needed.
quadrants/codegen/amdgpu/codegen_amdgpu.cpp — Remove the stmt->ret_type = PrimitiveType::i32 lines. Keep the trunc (since llvm.ctpop.i64 returns i64 but ret_type is now i32).
quadrants/codegen/spirv/spirv_codegen.cpp — For popcnt, OpBitCount on an i64 operand returns i64; add a cast to i32 after it (the clz path already computes in i32 and casts to dst_type at the end, which will now be i32 — no change needed there).

tests/python/test_unary_ops.py — Add compound-expression tests to cover the previously-broken case:

@qd.kernel
def test_popcnt_compound_i64(x: qd.int64) -> qd.int64:
    return qd.math.popcnt(x) + qd.i64(1)

@qd.kernel
def test_clz_compound_i64(x: qd.int64) -> qd.int64:
    return qd.math.clz(x) + qd.i64(1)

This approach is consistent with how logic_not already forces ret_type = u1 in type_check.cpp, and ensures all backends agree on the return type from the earliest stage of the pipeline.

Regression sentinel for the i32 return-type normalisation in e23c9f8: if the inferred ret_type for popcnt / clz / ffs were the operand's type instead of i32, promotion of `op(x: i64) + i64(1)` to i64 would skip the i32 -> i64 cast, and CUDA / AMDGPU codegen (which truncates the libdevice / llvm.ctpop result to i32) would emit `Add(i32, i64)` and trip an LLVM "operand type mismatch" assertion. The direct-return tests above don't compose the result with another operand, so they hide this class of bug; this test exercises the path that would actually crash pre-fix. Surfaces a real failure mode flagged in PR #662 review.

Previously the inferred ret_type for popcnt / clz was the operand's type, and the CUDA / AMDGPU codegens overrode it to i32 mid-codegen for the 64-bit cases. SPIR-V and x64 didn't override at all, so qd.math.popcnt(qd.u64(x)) returned u64 on Vulkan / Metal / CPU but i32 on CUDA / AMDGPU — same kernel source, different return types per backend. Worse, because the CUDA / AMDGPU override happens at codegen time (after the type-checking pass that would normally insert promotion casts), a compound expression like `popcnt(x: i64) + i64(1)` skips the i32 -> i64 cast that type promotion would have emitted, and codegen then issues `Add(i32, i64)` and trips an LLVM "operand types must match" assertion at IR construction time. The direct-return tests in test_unary_ops.py mask this because they don't compose the result with any other operand. Centralise the decision in transforms/type_check.cpp: popcnt / clz now have ret_type == i32 (or tensor-of-i32) regardless of operand width on every backend. This matches CUDA libdevice (__nv_popc / __nv_clz both return int) and the AMDGPU SALU bit-count / leading-zero ops (s_bcnt1_i32_b64, s_flbit_i32_b64) which already produce i32 in hardware. The unified result is always non-negative and fits in 7 bits (counts <= 64), so truncation on the wider backends is free, and downstream type promotion now sees the correct i32 type and can emit the right cast_value before codegen runs. * type_check.cpp: set ret_type = i32 for popcnt / clz (incl. tensor case). * codegen_llvm.cpp (x64): trunc llvm.ctpop / llvm.ctlz to i32 to match. * spirv_codegen.cpp: cast OpBitCount result down to i32; the existing i32 result inside the clz path now binds to dst_type == i32 directly (cast becomes a no-op). * math.md: document the unified i32 return; expand the support matrix to include the x64 column for completeness. Issue surfaced in PR #662 review.

Regression sentinel for the i32 return-type normalisation in the previous commit: if the inferred ret_type for popcnt / clz were the operand's type instead of i32, promotion of `op(x: i64) + i64(1)` to i64 would skip the i32 -> i64 cast, and CUDA / AMDGPU codegen (which truncates the libdevice / llvm.ctpop result to i32) would emit `Add(i32, i64)` and trip an LLVM "operand type mismatch" assertion. The direct-return tests above don't compose the result with another operand, so they hide this class of bug; this test exercises the path that would actually crash pre-fix. Surfaces a real failure mode flagged in PR #662 review.

github-actions · 2026-05-10T20:19:33Z

Total: 7 file(s) changed, +123 -29 code lines.

CUDA libdevice already encodes the ffs(0) == 0 contract for both 32-bit (__nv_ffs) and 64-bit (__nv_ffsll), so each path is a one-line call(). LLVM IR is signless for integers so u32 / u64 route to the same intrinsic as their signed counterparts (matching the clz pattern landed in #662). ret_type is forced to i32 since both __nv_ffs and __nv_ffsll return int.

Regression sentinel for the i32 return-type normalisation in e23c9f8: if the inferred ret_type for popcnt / clz / ffs were the operand's type instead of i32, promotion of `op(x: i64) + i64(1)` to i64 would skip the i32 -> i64 cast, and CUDA / AMDGPU codegen (which truncates the libdevice / llvm.ctpop result to i32) would emit `Add(i32, i64)` and trip an LLVM "operand type mismatch" assertion. The direct-return tests above don't compose the result with another operand, so they hide this class of bug; this test exercises the path that would actually crash pre-fix. Surfaces a real failure mode flagged in PR #662 review.

hughperkins · 2026-05-10T20:39:29Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 417369ffbf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T20:42:52Z

+    Accepts ``i32``, ``u32``, ``i64`` and ``u64`` on every supported backend; the result is always returned as an
+    ``i32`` in ``[0, bitwidth(a)]`` (``clz(0) == bitwidth(a)``). The count is over the unsigned bit pattern, so
+    ``clz(-1) == 0`` regardless of input signedness.


Fix qd.python clz semantics before documenting support

This new contract also covers users running kernels under qd.init(qd.python): Func.__call__ directly invokes the Python function on that backend, so qd.math.clz still executes the _clz helper below. That helper is hard-coded to 32 bits and treats negative values as having 32 leading zeros, so the newly documented cases like clz(-1) == 0 and 64-bit inputs still produce wrong results on the Python backend (e.g. clz(-1) returns 32). Please update the Python fallback or exclude qd.python from the promised support.

Useful? React with 👍 / 👎.

Previously the Python fallback in `_unary_operation` ran a hard-coded 32-bit helper for both ops: _clz(-1) == 32 # should be 0 (sign bit is set) _clz(0) == 32 # only right for i32; for i64 should be 64 _clz(1 << 40) == 0 # silently truncated to 32-bit _popcnt(-1) == 1 # because bin(-1) is '-0b1'; should be 32 / 64 This path is reached on the `qd.python` backend (kernels run as plain Python, no C++ runtime) and on ad-hoc module-level calls; native backends compile via IR + codegen and never see it. The bugs were silent — wrong results, no exception — which is the worst failure mode for a function whose new contract explicitly covers `clz(-1) == 0` and i64 inputs. These ops fundamentally can't have a correct pure-Python fallback: bit-count results depend on the operand's bitwidth, and once the value reaches here it's a plain Python int with no width attached. So replace both helpers with a NotImplementedError pointing the user at a real backend, rather than continue to silently return wrong answers. Surfaced in PR #662 review (the codex bot comment on `clz`); popcnt has the same shape of bug and gets the same treatment for consistency.

github-actions · 2026-05-10T20:54:15Z

Diff coverage: 100% · 46 lines, 0 missing

CUDA libdevice already encodes the ffs(0) == 0 contract for both 32-bit (__nv_ffs) and 64-bit (__nv_ffsll), so each path is a one-line call(). LLVM IR is signless for integers so u32 / u64 route to the same intrinsic as their signed counterparts (matching the clz pattern landed in #662). ret_type is forced to i32 since both __nv_ffs and __nv_ffsll return int.

Regression sentinel for the i32 return-type normalisation in e23c9f8: if the inferred ret_type for popcnt / clz / ffs were the operand's type instead of i32, promotion of `op(x: i64) + i64(1)` to i64 would skip the i32 -> i64 cast, and CUDA / AMDGPU codegen (which truncates the libdevice / llvm.ctpop result to i32) would emit `Add(i32, i64)` and trip an LLVM "operand type mismatch" assertion. The direct-return tests above don't compose the result with another operand, so they hide this class of bug; this test exercises the path that would actually crash pre-fix. Surfaces a real failure mode flagged in PR #662 review.

github-actions · 2026-05-10T21:25:01Z

Total: 7 file(s) changed, +129 -36 code lines.

github-actions · 2026-05-10T22:17:19Z

Diff coverage: 96% · 50 lines, 2 missing

alanray-tech · 2026-05-11T16:19:28Z

LGTM, ok to merge

github-actions · 2026-05-11T16:56:10Z

Total: 7 file(s) changed, +129 -36 code lines.

CI's wrap check (claude-4.6-opus running ask-mode) flagged three comment lines at 99-102c that should have packed to 120c, plus a few more in the same diff with similar slack. Reflow the affected runs greedily; the remaining flagged runs after this commit are all >=113c (already at-target).

CUDA libdevice already encodes the ffs(0) == 0 contract for both 32-bit (__nv_ffs) and 64-bit (__nv_ffsll), so each path is a one-line call(). LLVM IR is signless for integers so u32 / u64 route to the same intrinsic as their signed counterparts (matching the clz pattern landed in #662). ret_type is forced to i32 since both __nv_ffs and __nv_ffsll return int.

Regression sentinel for the i32 return-type normalisation in e23c9f8: if the inferred ret_type for popcnt / clz / ffs were the operand's type instead of i32, promotion of `op(x: i64) + i64(1)` to i64 would skip the i32 -> i64 cast, and CUDA / AMDGPU codegen (which truncates the libdevice / llvm.ctpop result to i32) would emit `Add(i32, i64)` and trip an LLVM "operand type mismatch" assertion. The direct-return tests above don't compose the result with another operand, so they hide this class of bug; this test exercises the path that would actually crash pre-fix. Surfaces a real failure mode flagged in PR #662 review.

github-actions · 2026-05-11T17:42:52Z

Total: 7 file(s) changed, +129 -36 code lines.

github-actions · 2026-05-11T18:46:10Z

Diff coverage: 96% · 50 lines, 2 missing

…p merge Merging hp/cross-gpu-subgroup (which also brought in the clz/popcnt generalisation from #662) into hp/new-qipc-ops-subgroup left conflicts in amdgpu / cuda / spirv codegen for the popcnt/clz path. I resolved those by taking `--theirs` on each file (the bit-op refactor strictly supersedes the HEAD version), but the three files *also* contained the ballot codegen added in 89772e8 — and that change wasn't on `hp/cross-gpu-subgroup`, so taking `--theirs` silently dropped the ``subgroupBallotU32`` / ``subgroupBallotU64`` visitor branches on all three backends. Symptom: every ``test_subgroup_ballot_*`` test crashed the Metal / Vulkan (MoltenVK) workers with [W spirv_codegen.cpp:spriv_message_consumer@3106] input [23:0:0] Id is 0 [W spirv_codegen.cpp:run@3256] SPIRV optimization failed [E spirv_codegen.cpp:run@3297] SPIR-V optimization failed for 'foo_c94_0_0' — with the missing visitor branch, the InternalFuncStmt visitor falls through without assigning `val`, so a default-constructed `spirv::Value` (id=0) is registered for the ballot result, and spirv-opt rightly rejects the resulting module on validation. Linux CUDA / Vulkan / AMDGPU passed because they go through the LLVM codegen, where the bug manifests as `QD_NOT_IMPLEMENTED` only if the test reaches it — and the cluster run that gated the merge happened to schedule the ballot tests on CUDA first, where the corresponding LLVM branch had *also* been dropped but the test crash mode is different (kernel build failure, which the parametric expansion would have eventually flagged once the SPIR-V failures stopped masking everything; see PR #676 CI). Fix: re-add the exact same branches (cherry-picked from 89772e8) for all three backends: * AMDGPU: ``subgroupBallotU32`` → ``call("amdgpu_ballot_i32", ...)``; ``subgroupBallotU64`` → ``call("amdgpu_ballot_u64", ...)`` (runtime stub + ``llvm.amdgcn.ballot.i64`` intrinsic patch were already preserved in `runtime/llvm/runtime_module/runtime.cpp` and `runtime/llvm/llvm_context.cpp`). * CUDA: ``subgroupBallotU32`` → ``call("cuda_ballot_i32", ...)``; ``subgroupBallotU64`` → same call zext'd to i64 (CUDA warps are 32 lanes; high half of the u64 result is always zero by construction). * SPIR-V: ``subgroupBallotU32`` → ``OpGroupNonUniformBallot`` + extract component 0 of the uvec4; ``subgroupBallotU64`` → extract components 0 and 1 and pack as ``u64(hi) << 32 | u64(lo)`` (wave32 hosts naturally get hi=0 since lanes 32+ do not contribute). No functional change vs. 89772e8; this is purely undoing the accidental delete from the merge resolution.

* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>

hughperkins added 2 commits May 8, 2026 21:32

hughperkins commented May 8, 2026

View reviewed changes

hughperkins added 4 commits May 8, 2026 21:47

hughperkins marked this pull request as ready for review May 9, 2026 15:48

hughperkins marked this pull request as draft May 9, 2026 15:48

hughperkins commented May 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Hugh Perkins (deskai7) added 6 commits May 9, 2026 16:05

[CUDA] Accept u32 / u64 for qd.math.clz

38a814a

[AMDGPU] Accept u32 / u64 for qd.math.clz

e5fa910

[Tests] Add u32 / u64 cases for qd.math.clz

2310a0e

[Docs] qd.math.clz now accepts u32 / u64 on every backend

78e65af

hughperkins marked this pull request as ready for review May 9, 2026 16:14

Merge branch 'main' into hp/amdgpu-popcnt-clz

83a0e9a

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Hugh Perkins and others added 3 commits May 10, 2026 12:32

Merge branch 'main' into hp/amdgpu-popcnt-clz

417369f

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

alanray-tech approved these changes May 11, 2026

View reviewed changes

Merge branch 'main' into hp/amdgpu-popcnt-clz

550983d

hughperkins merged commit cfaa700 into main May 11, 2026
55 checks passed

hughperkins deleted the hp/amdgpu-popcnt-clz branch May 11, 2026 19:01

		@@ -131,9 +128,6 @@ def test_u64(x: qd.uint64) -> qd.int32:

		@test_utils.test(arch=[qd.cpu, qd.metal, qd.cuda, qd.amdgpu, qd.vulkan])

		llvm_val[stmt] = builder->CreateTrunc(pop64, llvm::Type::getInt32Ty(*llvm_context));
		stmt->ret_type = PrimitiveType::i32;

Uh oh!

Conversation

hughperkins commented May 8, 2026

Brief Summary

Walkthrough

Uh oh!

hughperkins May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins May 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

hughperkins May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

alanray-tech commented May 10, 2026

Potential type mismatch when 64-bit popcnt/clz result feeds into a compound expression

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

hughperkins commented May 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

alanray-tech commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants