[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement#551
Conversation
453d0a8 to
416d869
Compare
There was a problem hiding this comment.
💡 Codex Review
quadrants/quadrants/codegen/llvm/codegen_llvm.cpp
Lines 2600 to 2601 in 453d0a8
In TaskCodeGenLLVM::visit(FuncCallStmt), the generated caller invokes the callee with a fresh RuntimeContext and then immediately continues (call(llvm_func, new_ctx)) without checking or forwarding new_ctx->cpu_assert_failed. On CPU, assertions now rely on this flag to abort execution, so an out-of-bounds/assert failure inside @qd.real_func is swallowed at the call boundary and the caller keeps running with invalid state. This makes debug assertions inside real functions ineffective and can reintroduce post-assert memory faults instead of cleanly terminating the kernel.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
416d869 to
f213cd6
Compare
c8f36e6 to
e76a5a0
Compare
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
quadrants/codegen/spirv/spirv_codegen.cpp:2320-2347— SPIR-V codegen caches a single per-taskinvoc_id * strideSSA id inad_stack_heap_thread_base_{float,int}_and emits the underlyingOpIMulviair_->mul(...)into the current insertion block at the first AdStackAllocaStmt visit site (spirv_codegen.cpp:2324-2351). When a task contains multiple independent blocks — e.g. sibling inner range-fors that are each their own IB, each carrying its own f32 loop-carried variable —auto_diff.cpp's per-IB pipeline runsBackupSSA::run(ib)independently for each IB, so each AdStackAllocaStmt is hoisted (at most) to its own IB root. The first visit emits the OpIMul inside IB1's body block; the second visit reuses the cached SSA id from a block that does not dominate IB2, violating SPIR-V §2.16. Fix mirrors the LLVM backend'sensure_ad_stack_heap_base_llvm()(codegen_llvm.cpp:2166-2186): emit the OpIMul at the task function's entry/dispatch-entry via an insertion-point save/restore, not at the first alloca visit site.Extended reasoning...
What the bug is
get_ad_stack_heap_thread_base_float()/get_ad_stack_heap_thread_base_int()cache a single SSA id per task and emit the backinginvoc_id * strideOpIMulviair_->mul(...), which commits the instruction to the IR builder's current insertion block. Emission is triggered eagerly fromvisit(AdStackAllocaStmt)at the first alloca visit; every subsequent Push/Pop/LoadTop/AccAdjoint re-reads the cached id viaad_stack_heap_{float,int}_ptr(). The cached id therefore dominates all downstream uses only if the first visit happens inside a block that structurally dominates every other AdStackAllocaStmt of the same heap kind.The code comment in
spirv_codegen.cpp:2326-2335claims this holds because the first visit "lives in the dispatch body that dominates all inner loop bodies". That premise is what the bug contradicts.How the premise breaks: multi-IB kernels
Reverse-mode AD's pipeline (
quadrants/transforms/auto_diff.cpp:2726-2755) identifies multiple independent blocks and runsPromoteSSA2LocalVar/ReplaceLocalVarWithStacks/MakeAdjoint/BackupSSAper-IB. For a kernel shaped like:for i in outer: # struct-for (outer) for j in range(n): # inner range-for #1 -> IB1 = its body v = x[i, j] # AllocaStmt at IB1 root for _ in range(k): # dynamic inner v = qd.sin(v) out_a[i] += v for j in range(n): # inner range-for #2 -> IB2 = its body w = y[i, j] # AllocaStmt at IB2 root for _ in range(k): w = qd.cos(w) out_b[i] += w
IdentifyIndependentBlocksgives IB1 = inner-loop-1's body and IB2 = inner-loop-2's body (each has its own global atomic on a different output, so each qualifies as a smallest IB).BackupSSA::run(ib)usesindependent_block = ib, so the hoisted backup AdStackAllocaStmt is inserted at that IB's position 0 — not at a task-wide root that dominates both IBs.In IR order, SPIR-V codegen then visits:
start_label(inner1_body_label)at the inner1 RangeForStmt header.visit(AdStackAllocaStmt_v)at IB1 root. Callsget_ad_stack_heap_thread_base_float(), which routesir_->mul(...)throughDEFINE_BUILDER_BINARY_USIGN_OP(mul, Mul)->make_value(OpIMul, ...)->make_inst, committing theOpIMultocurr_label_ == inner1_body_label. Caches the result SSA id.- Exit inner1.
start_label(inner2_body_label). visit(AdStackAllocaStmt_w)at IB2 root. Cache hit — returns the SSA id defined in step 2.visit(AdStackPushStmt)forwinside the inner dynamic loop of inner2 callsad_stack_heap_float_ptr(...), which doesir_->add(base, ...)in inner2's body. TheOpIAddhas an operand (the cached base) whose defining instruction lives ininner1_body_label.
inner1_body_labelandinner2_body_labelare sibling children of the outer for-loop's merge/header — neither dominates the other. SPIR-V §2.16.2 rejects this;spirv-valprints a non-dominating-use error and drivers can TDR silently.Why the refutations don't cover this
Both refutations correctly identify that
BackupSSA::generic_visithoists AdStackAllocaStmts toindependent_blockwhen a cross-block reference is detected — and this is sufficient for the narrow mutually-exclusive-if-branches within a single IB shape: MakeAdjoint creates a reversenew_ifsibling to the forwardif_stmtat the IB root, references fromnew_if's branches fall outside the forward if-branch'sleaf_to_rootchain, and the backup is inserted at IB root viaindependent_block->insert(std::move(backup_stack_alloca), 0)(auto_diff.cpp:2595). For that shape the bug report's claim is indeed partially wrong.But the hoist is scoped to one IB at a time. When the kernel has sibling inner loops whose bodies are each IBs, each invocation of
BackupSSA::run(ib)hoists its allocas to its own root — not to a task-wide block. The two resulting AdStackAllocaStmts live in sibling, mutually-non-dominating blocks. That is exactly the shape where the cachedinvoc_id * strideSSA id violates dominance.The refutation about
test_adstack_if_cond_snapshot_adaptive_sizingdoesn't disprove this shape either: that test uses anif/elif/elif/elseon a single carried variable (outputs[i_inner, i_batch]), so there is only one adstack kind and one alloca site. It produces no sibling-alloca pair and does not stress the cache.Why the existing comment doesn't save this
The implementation's own inline comment at
spirv_codegen.h:219-225defends eager-at-alloca-site emission with:Emitted eagerly from
visit(AdStackAllocaStmt)so theOpIMullives in the alloca's enclosing block, which strictly dominates every sibling inner loop that later references the cached SSA id.That invariant relies on the alloca's enclosing block being task-global — i.e. the dispatch-body/function-entry block. With per-IB BackupSSA, the enclosing block is the IB root, which is task-global only when the kernel happens to have exactly one IB. The comment's invariant is therefore an accidental property of the test corpus, not a pipeline guarantee.
The LLVM backend already diagnosed the exact same concern and solved it explicitly:
TaskCodeGenLLVM::ensure_ad_stack_heap_base_llvm()incodegen_llvm.cpp:2166-2186emits the base load atentry_blockvia anllvm::IRBuilderBase::InsertPointGuard, with a comment calling out "two sibling adstacks under different branches of anifwould tripverifyFunctionwith a non-dominating use". The SPIR-V side should mirror this.Impact
spirv-valrejects the produced SPIR-V with a non-dominating-operand error (SPIR-V §2.16.2).- Native Metal / Vulkan drivers vary: some refuse to compile the shader, others miscompile silently.
- This is triggered by a natural reverse-mode AD shape — two accumulators with their own dynamic loops in the same kernel — and is not exercised by any of the PR's new SPIR-V heap-adstack tests.
Step-by-step proof
Consider the kernel above, with
n = 4, k = 3, compiled withad_stack_experimental_enabled=True.IdentifyIndependentBlocks::run(root)returns{inner1_body_block, inner2_body_block}because each inner body is the smallest IB with a qualifying global atomic.- For
ib = inner1_body_block:ReplaceLocalVarWithStacksreplacesAllocaStmt_vin place withAdStackAllocaStmt_v(at inner1_body position 0, since it was the first user stmt).MakeAdjointemits reverse code (new_forwith body referencingAdStackAllocaStmt_v) appended toinner1_body_block.BackupSSAexamines reverse ops whoseop->parentis inner1_body_block. Hereinner1_body_blockis in each reverse stmt'sleaf_to_root, so no hoist fires.AdStackAllocaStmt_vstays at inner1_body position 0.
- For
ib = inner2_body_block: symmetric.AdStackAllocaStmt_wends up at inner2_body position 0. - SPIR-V codegen's
run()pre-scans IR (spirv_codegen.cpp:131-168) to sizead_stack_heap_per_thread_stride_float_. Both allocas are f32 withmax_sizebounded by the bounded-loop analyzer (k = 3 each), so stride ends up at ~12 f32 elements. - Code emission walks outer struct-for, enters inner1.
visit(RangeForStmt)callsstart_label(body_label_inner1). Nowcurr_label_ = body_label_inner1. visit(AdStackAllocaStmt_v)atspirv_codegen.cpp:2420callsget_ad_stack_heap_thread_base_float()which emitsOpIMul %u32 %invoc_id %strideunderbody_label_inner1and caches the SSA id as%base_ssa.- Exits inner1.
visit(RangeForStmt)for inner2 callsstart_label(body_label_inner2).curr_label_ = body_label_inner2. visit(AdStackAllocaStmt_w)at the same line. Cache hit: returns%base_ssa(defined inbody_label_inner1).- Any later
visit(AdStackPushStmt)/visit(AdStackLoadTopStmt)onwcallsad_stack_heap_float_ptr(offset, count)which executesir_->add(%base_ssa, offset_val)underbody_label_inner2. - The
OpIAddreferences%base_ssawhose definingOpIMulis inbody_label_inner1. In the CFG,body_label_inner1is not on every path tobody_label_inner2(they are sibling loop bodies under the outer struct-for header), so it does not dominate the use. spirv-val's structured-dominance pass rejects the module.
Fix
Mirror the LLVM backend. Add a one-shot
ensure_ad_stack_heap_thread_base_{float,int}()that:- Saves the current insertion point (e.g.
ir_->save_insert_point()or an equivalent). - Switches to the function's entry/dispatch-body block (the block right after the offloaded task's function header; equivalent to LLVM's
entry_block). - Emits the
OpUConvert/OpIMul. - Restores the original insertion point.
- Caches the result.
Call it from
visit(AdStackAllocaStmt)and bothad_stack_heap_{float,int}_ptrlazily. This guarantees theOpIMullives in a block that dominates every other block in the function regardless of how many IBs the task contains. -
🔴
quadrants/codegen/llvm/codegen_llvm.cpp:2579-2585— The PR adds a CPU assertion-propagation mechanism (cpu_assert_failed) but explicitly acknowledges via FIXME (codegen_llvm.cpp:2579-2584) that it is not propagated out of@qd.real_funccallees. An OOB/assertion inside a real_func on CPU sets the flag on the callee'snew_ctxalloca, but the caller never reads it — subsequent tasks continue running on possibly-corrupted data, which is exactly the silent-segfault class the PR is meant to prevent. Fix by zero-initializingnew_ctx.cpu_assert_failedbefore the call, checking it after, propagating toget_context()->cpu_assert_failed, and emitting an earlyret voidon failure — all three steps are enumerated in the FIXME.Extended reasoning...
What the bug is
The PR's central mechanism — setting
cpu_assert_failed=1insidequadrants_assert_format_ctxand having the kernel launcher break out of the task loop — fails when the assertion fires inside a@qd.real_funccallee on CPU. The callee correctly writes to its context, but the caller's context is never updated.The specific code path
At
quadrants/codegen/llvm/codegen_llvm.cpp:2585,visit(FuncCallStmt)allocates the callee's context viacreate_entry_block_alloca(RuntimeContext)and only initializes theruntimefield on line 2586. The call is then emitted on line 2600 viacall(llvm_func, new_ctx), with no post-call propagation.Inside the real_func body compilation (
stmt->func->ir->accept(this)on line 2575), anyAssertStmtroutes throughuse_ctx_variant=true(sincearch_is_cpu) and callsquadrants_assert_format_ctxwithget_context() == get_arg(0), which is the caller'snew_ctxpointer. When the assertion fires,runtime.cpp:845writesnew_ctx->cpu_assert_failed = 1andcodegen_llvm.cpp:1182emits an earlyCreateRetVoid.Why existing code doesn't prevent it
Back in the caller's task body, the flag on
new_ctxis never copied into the caller's context. The outerlaunch_offloaded_tasksloop inquadrants/runtime/cpu/kernel_launcher.cpp:13-22only checksctx.get_context().cpu_assert_failed— but that context belongs to the task-level scope, not the real_func call. Regular@qd.funcis AST-inlined so it does not hit this path; only@qd.real_funccallees do.Additionally,
new_ctxis rawcreate_entry_block_allocastorage. The C++ in-class initializerint32_t cpu_assert_failed{0}inprogram/context.honly applies to C++ constructions, not LLVM allocas — so the slot starts with uninitialized stack bytes. This is currently latent (nothing reads it back), but it means step (1) of the fix is load-bearing once post-call propagation is added.Impact
An OOB/assertion inside a reverse-mode or any other
@qd.real_funcon CPU silently fails to terminate the kernel. Subsequent tasks in the samelaunch_offloaded_tasksloop continue running on possibly-corrupted data — exactly thetest_ndarray_oob_cpu_*/test_do_while_oob_does_not_loop_foreverregression the new mechanism is meant to eliminate. None of the tests added in this PR exercise a real_func callee (all use@qd.kernelor@qd.func), so CI does not catch the gap.How to fix
The FIXME itself enumerates the three steps:
- Zero-init
new_ctx->cpu_assert_failedafter theRuntimeContext_set_runtimecall (LLVMCreateStoreof a constant zero to thecpu_assert_failedfield ofnew_ctx). - After
call(llvm_func, new_ctx), loadnew_ctx->cpu_assert_failedand compare against zero. - If non-zero, propagate via
get_context()->cpu_assert_failed = 1and emitCreateRetVoidon the caller side, matching the patternvisit(AssertStmt)already uses at lines 1175-1183.
Proof via a concrete example
Consider a kernel that calls a
@qd.real_funcwhich reads an ndarray out of bounds, then the kernel body writes to an unrelated field afterward:@qd.real_func def oob_reader(a: qd.types.ndarray(dtype=qd.f32, ndim=1)) -> qd.f32: return a[100] # a.shape == (4,), fires OOB assert @qd.kernel def k(a: qd.types.ndarray(dtype=qd.f32, ndim=1), b: qd.types.ndarray(dtype=qd.f32, ndim=1)): for i in range(4): v = oob_reader(a) b[i] = v # executes even after the assert in oob_reader fires
Step-by-step at runtime with
debug=True, check_out_of_bound=True:kenters its task function;ctxis the outer task'sRuntimeContextwithcpu_assert_failed=0(cleared bylaunch_offloaded_tasksline 9).visit(FuncCallStmt)emitted:new_ctx = alloca RuntimeContext(line 2585);RuntimeContext_set_runtime(new_ctx, runtime)(line 2586).new_ctx->cpu_assert_failedis stack garbage but unread.call(oob_reader, new_ctx)jumps into the callee.- Inside
oob_reader, the OOBAssertStmtfires.use_ctx_variantis true.get_context()returnsget_arg(0) == new_ctx.quadrants_assert_format_ctx(new_ctx, false, ...)setsnew_ctx->cpu_assert_failed = 1(runtime.cpp:845) and returns 1. - The callee's
visit(AssertStmt)epilogue (lines 1175-1183) sees the non-zero return, emitsret void. Control returns to the caller. - The caller does not check
new_ctx->cpu_assert_failed— execution continues.b[i] = vis written (with whatever garbagevholds from the early-returned callee). - The for-loop in
kiterates; the next iteration callsoob_readeragain (same behaviour). - Eventually the task returns.
launch_offloaded_taskschecksctx.get_context().cpu_assert_failed— still 0, because nothing touched the outer context. The loop does not break. If there are more offloaded tasks, they also run. - The
debug=Truepost-taskcheck_runtime_errordoes eventually surface the assertion (viaruntime->error_codeset byquadrants_assert_formatitself), but only after every subsequent task has already executed on corrupted state.
With the three-step fix, step 6 becomes: load
new_ctx->cpu_assert_failed(reading the zero-init-then-maybe-set-to-1 slot), branch to a propagate block that stores 1 intoget_context()->cpu_assert_failedand emitsret void, matching the invariant that every other assert-propagation site already upholds. - Zero-init
|
Question (I dont have astrong opinion on this point, but just posing the question)
|
I'm not a huge fan of allowing system dependencies. By forcing our own version of MoltenVK, we can guarantee that it works. We do not support any other version than the one shipping with Quadrants and I don't think we want to explore such opportunity. If a dev wants to try some specific version, it is very easy to change it in the CMake file and delete the cache to force re-downloading the updated version. So I don't see any limitation for not supporting system-wise homebrew version. Except that, well, it forces downloading something, so it takes more time to bootstrap build env for the first time. |
Question: is waht we are downloading a binary, or source code? If source code, what is the impact on build time of using downloaded source code vs brew? (as an example of something we'd want to avoid: building LLVM SDK takes a looonnnnggg time, and a lot of effort; so best pre-built I feel; I dont know about the moltenvk sdk, hence posing the quetion) |
We are installing the entire SDK. Exactly as we do on windows. This provides sources and SOME precompiled binaries. Actually all we use is the pre-compiled binary of moltenVK in this case, and we do not build it. So no impact on build it. |
Ok. And what about Mac SIP? How are we avoiding triggering that? Have you tested this build on your own Mac locally, and you confirm no SIP issues? |
I don't know what SIP means, but I can confirm it build locally and run without issue. We are bundling the dylib of moltenvk directly in the wheels (which was the pre-existing behaviour before this PR), so what we are distributing is reasonably standalone). It would be completely standalone if we properly run delocate audit tool on wheels before pushing on pypi to ensure proper name mangling. I think it is not prevent all possible cases of symbol collisions but it is fine is the vast majority of cases. |
6c6d25d to
14ae74f
Compare
44679dd to
8123ff1
Compare
ef0b228 to
938b32a
Compare
8123ff1 to
526292d
Compare
938b32a to
29b4c1a
Compare
526292d to
860a914
Compare
29b4c1a to
9c959ff
Compare
860a914 to
e9091bc
Compare
9c959ff to
b5f776f
Compare
e9091bc to
63ebc09
Compare
b5f776f to
505af49
Compare
63ebc09 to
8fdb352
Compare
505af49 to
d8fe6ae
Compare
8fdb352 to
69fffa2
Compare
d8fe6ae to
18351d7
Compare
69fffa2 to
cf9a6e6
Compare
1b25db2 to
9edcfe7
Compare
9edcfe7 to
05440e1
Compare
|
@claude review |
There was a problem hiding this comment.
Inline comments cover two nit-level doc issues; I'm deferring to a human reviewer given the scope here — the PR touches Vulkan device creation, BDA / PSB capability gating, descriptor-set lifecycle, SPIR-V codegen, and a heuristic 32-launch drain for MoltenVK, which is more than I'd want to shadow-approve.
Extended reasoning...
Overview
Six-commit PR that swaps the 2022 Taichi-pinned MoltenVK for a LunarG Vulkan SDK fetch driven by build.py --shell, unblocks physical-storage-buffer (BDA) on Apple, and lands four follow-on MoltenVK quirk fixes: a latent alloc_info.usage → buffer_info.usage typo in VulkanDevice::allocate_memory, NonSemantic.DebugPrintf + shaderSharedFloat*AtomicAdd cap sanitisation on Apple with a companion lazy-import in the SPIR-V IR builder and format-string sanitisation in the overflow-diagnostic path, and a pending_launches_since_sync_ counter in GfxRuntime that forces a drain every 32 launches to bound VulkanStream::submitted_cmdbuffers_ growth.
Security risks
Low for the Apple-guarded paths (cap sanitisation is #if !defined(__APPLE__) and behind the validation-layer gate). The new SDK fetch invokes a LunarG installer binary via subprocess.check_call; the URL and binary name are hard-coded (no shell interpolation), but the LunarG download is a third-party supply-chain dependency the build hadn't had before. The buffer_info.usage typo fix tightens a pre-existing dormant bug rather than introducing new surface.
Level of scrutiny
High. This is cross-platform RHI / build / codegen infrastructure. The PSB capability flip on Apple changes behaviour for every Vulkan-on-Apple user, the vkGetBufferDeviceAddressKHR gating fix also affects Linux, and the kMaxPendingLaunches = 32 drain is a heuristic workaround for a MoltenVK SIGSEGV whose exact threshold ("somewhere around a few hundred") isn't precisely characterised. A human with MoltenVK / Vulkan RHI context should sanity-check the cap-sanitisation choices and the drain threshold.
Other factors
- Open discussion in the timeline about Mac SIP that the author answered with "I don't know what SIP means" — worth a human confirming the distributed wheel actually loads cleanly on a fresh Apple Silicon Mac.
- No unit tests added; regression coverage relies on the existing Vulkan-backend CI matrix (Mac 15/26, Linux Vulkan).
- Comments are dense and the two inline nits (stale path in
entry.py, duplicated safety-valve paragraph inruntime.cpp) suggest at least one more editorial pass on the doc before merge.
54d7dba to
d074c0e
Compare
05440e1 to
39c5a98
Compare
|
Doc: I wonder if we should have a table in our doc somewhere with OSes as the rows (Windows, Mac, Ubuntu), and columns as supported arch type (CUDA, Vulkan, arm, x86 etc) ? |
|
Good suggestion. Added a Backend / OS matrix to |
9c0e306 to
649f1a0
Compare
|
checklist:
=> ok to merge |
649f1a0 to
5b6aec9
Compare
5b6aec9 to
59101da
Compare
…ild.py --shell and enable spirv_has_physical_storage_buffer on Apple
…ture bit, not on CHECK_VERSION(1,3) alone
…p.py develop` works after `build.py --shell` exits
…bug_printf so MoltenVK stops rejecting debug-capable kernels
…ueue every 32 launches so MoltenVK stops SIGSEGVing on atomic-float kernels and long simulation loops
…ime::submit_current_cmdlist_if_timeout
…d_systems.md, fix entry.py MoltenVK path comment to match vulkan.py, escape '%' in debug-printf overflow traceback so SPIRV-Cross -> MSL on MoltenVK does not interpret it as a format specifier
59101da to
2859a1d
Compare
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>
LunarG-sourced MoltenVK on Apple unblocks PSB (BDA) for the runtime adstack sizer, plus the three MoltenVK quirks that surface once PSB, validation, and long kernel-loop workloads are live
TL;DR
The shell hook fetches LunarG's macOS installer once, extracts the SDK into
~/.cache/quadrants/vulkan-macos-1.4.321.0/, and exportsVULKAN_SDK/MOLTENVK_DIR. CMake then picks uplibMoltenVK.dylibfrom the SDK instead of Taichi's legacy pinned dylib. Downstream, the adstack sizer compute shader (Autodiff 17) relies on BDA to walkSizeExprtrees on device, so enabling PSB on Apple is the gating change that makes that shader legal to dispatch. The four add-on commits then cover what the freshly-enabled PSB path exposes - a latent buffer-usage-bit typo, two MoltenVK caps that are advertised-but-broken, and a cmdbuffer-queue drain that repeated kernel launches need.Why
The previous Apple Vulkan path pinned a 2022 MoltenVK dylib hosted on
taichi_assets, predating theVK_KHR_buffer_device_address/ physical-storage-buffer capability. Three concrete consequences:vulkan_device_creator.cpphard-guardedspirv_has_physical_storage_bufferoff on Apple behind#if !defined(__APPLE__) && false, citing taichi-dev/taichi#6295.OpLoadthrough aPhysicalStorageBuffer-class pointer to readSizeExprleaves; with PSB disabled on Apple, every reverse-mode kernel would hard-error at launch time on Metal.VulkanDevice::allocate_memory(alloc_info.usage & VK_BUFFER_USAGE_STORAGE_BUFFER_BITinstead ofbuffer_info.usage & ...) made the "attachSHADER_DEVICE_ADDRESS_BIT" branch dead for every buffer; unreachable while the PSB cap was off, but the moment PSB is enabled every buffer becomes a validation-layer violation (Linux) or garbage-address read (MoltenVK).The less-targeted workarounds are insufficient: keeping the Taichi pin and papering over the capability check would ship a MoltenVK that cannot serve BDA loads; asking every Quadrants developer to install LunarG's SDK globally breaks hermetic CI. Fetching through
build.py --shellgives us a single, reproducible SDK path the rest of the build consumes.Surface API
No Python-surface API changes. All diff is build-system (
vulkan.py,quadrants/rhi/CMakeLists.txt), Vulkan RHI internals (vulkan_device_creator.cpp,vulkan_device.cpp,vulkan_api.cpp), and SPIR-V codegen internals (spirv_ir_builder.cpp,spirv_codegen.cpp,runtime/gfx/runtime.cpp+runtime.h). Behaviour deltas visible to users ofqd.init(arch=qd.vulkan):qd.lang.impl.current_cfg().spirv_has_physical_storage_bufferflips toTrueon Apple.qd.init(arch=qd.vulkan, debug=True)no longer fails pipeline creation on MoltenVK for kernels that emitdebugPrintfEXTtraffic (lazy-import + Apple cap drop).qd.simt.block.SharedArraywith an atomic-f32add/subno longer fail MoltenVK's MSL compile withatomic_fetch_add_explicit(threadgroup atomic_float*, ...)- they route through the CAS-emulated fallback instead.MVKCommandEncoderafter a few hundred launches without aqd.sync().Entry points
.github/workflows/scripts/ti_build/vulkan.pysetup_vulkan()gains a Darwin / arm64 branch that fetches + extracts + installs LunarG's macOS bundle.quadrants/rhi/CMakeLists.txtlibMoltenVK.dylibvia$MOLTENVK_DIR/$VULKAN_SDK;configure_filestages it into${CMAKE_BINARY_DIR}/libMoltenVK.dylib.FATAL_ERRORon a missing SDK.quadrants/rhi/vulkan/vulkan_device_creator.cppspirv_has_physical_storage_buffer. Gates the overall PSB cap on the queriedbufferDeviceAddressfeature bit. SkipsVK_KHR_shader_non_semantic_infoon Apple (advertised but the MSL translator can't emitdebugPrintfEXT). SkipsshaderSharedFloat{16,32,64}AtomicAddon Apple (same reason: MSL rejectsatomic_fetch_add_explicitonthreadgroup atomic_float*).quadrants/rhi/vulkan/vulkan_device.cppalloc_info.usage→buffer_info.usagetypo that made the "attachSHADER_DEVICE_ADDRESS_BIT" branch dead. GatesvkGetBufferDeviceAddressKHRon whether the bit is actually set, so uniform / vertex / transfer-only staging buffers no longer trip VUID-VkBufferDeviceAddressInfo-buffer-02601.quadrants/rhi/vulkan/vulkan_api.cppshared_ptrrelease so MoltenVK's pool churn does not null-pool-deref after ~32 two-set kernel launches.quadrants/runtime/gfx/runtime.{h,cpp}pending_launches_since_sync_counter;submit_current_cmdlist_if_timeoutforces asynchronize()everykMaxPendingLaunches = 32launches to boundVulkanStream::submitted_cmdbuffers_growth on MPM-style tight kernel-launch loops.quadrants/codegen/spirv/spirv_ir_builder.{cpp,h}NonSemantic.DebugPrintfonly when acall_debugprintfsite actually needs it, so kernels with noprint/ debug-assert traffic stay MoltenVK-compatible.quadrants/codegen/spirv/spirv_codegen.cppcall_debugprintf: un-escaped quotes / newlines in the traceback string survive MoltenVK's MSL translation into the output and previously produceduse of undeclared identifier 'Users'-class errors from the path prefix.Mechanism end-to-end
1. SDK acquisition (
vulkan.py)vulkansdk-linux-x86_64-1.4.321.1.tar.xz(tarball, unchanged)~/.cache/quadrants/vulkan-1.4.321.1/x86_64/vulkansdk-macos-1.4.321.0.zip(installer bundle)~/.cache/quadrants/vulkan-macos-1.4.321.0/~/.cache/quadrants/vulkan-win-1.4.321.1/The macOS branch is the only new one. LunarG didn't publish a
1.4.321.1macOS asset, so the patch-level is inlined to1.4.321.0.zipfiledrops the installer bundle without preserving the Unix execute bit, so the scriptchmod 0755s the installer binary before running it (idempotent, scoped to the single file). The CLIinstallcommand writes the SDK into the--rootprefix.2. CMake pickup (
quadrants/rhi/CMakeLists.txt)MOLTENVK_DIRlibMoltenVK.dylibfind_file(MOLTEN_VK libMoltenVK.dylib NO_DEFAULT_PATH PATHS ${MOLTENVK_DIR})VULKAN_SDK${VULKAN_SDK}/libis tried ifMOLTENVK_DIRis unsetfind_filecall, fallback pathconfigure_filestages the located dylib into${CMAKE_BINARY_DIR}/libMoltenVK.dylib(copy, not symlink, so the install step can re-digest it) andinstall(FILES ... DESTINATION ${INSTALL_LIB_DIR}/runtime)ships it alongside the runtime. A missing SDK is aFATAL_ERRORpointing at./build.py --shell; there is no silent fallback to the legacy pin on purpose.3. PSB capability unblocked (
vulkan_device_creator.cpp)Removes the
#if !defined(__APPLE__) && falsekill-switch gate aroundcaps.set(DeviceCapability::spirv_has_physical_storage_buffer, true). The surrounding gate is tightened fromCHECK_VERSION(1, 3) || buffer_device_address_feature.bufferDeviceAddressto a plain feature-bit check: Vulkan 1.3 promotesVK_KHR_buffer_device_addressinto core but still lets implementations exposebufferDeviceAddress = VK_FALSE, so the version-OR gate was treating 1.3 devices as PSB-capable even when they weren't. Devices that genuinely don't advertise BDA (ancient drivers, headless CI without Vulkan) remain safe.4.
vkGetBufferDeviceAddressKHRnow sees the right usage bit (vulkan_device.cpp)Before this PR the branch that ORs
VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_KHRintobuffer_info.usagewas gated onalloc_info.usage & VK_BUFFER_USAGE_STORAGE_BUFFER_BIT- butalloc_info.usageis VMA'sVmaMemoryUsageenum (small integers), not the VulkanVkBufferUsageFlagsbitfield. The&always yielded 0; the branch was dead; every buffer reached thevkGetBufferDeviceAddressKHRcall below without the required bit. Latent while PSB was off on Apple (no one calledvkGetBufferDeviceAddressKHR). Once PSB is on it firesVUID-VkBufferDeviceAddressInfo-buffer-02601under validation (Linux CI'stest_printstderr-assertion failures) and returns a garbage address under MoltenVK (Mac CI'stest_tile16_*/test_mpm88_numpy_and_ndarraywrong-output failures). Fix readsbuffer_info.usageinstead, and additionally gates thevkGetBufferDeviceAddressKHRcall on the bit actually being set, so uniform / vertex / transfer-only staging buffers skip the BDA query and keepalloc.addr == 0.5. MoltenVK cap sanitisation (
vulkan_device_creator.cpp)MoltenVK advertises two Vulkan capabilities whose SPIR-V → MSL translation is broken:
VK_KHR_shader_non_semantic_info: the extension enumerates fine,OpExtInstImport "NonSemantic.DebugPrintf"validates, the OpExtInst call sites pass SPIR-V validation, but SPIRV-Cross emits an unconditionaldebugPrintfEXT(...)call stub whose identifier Metal's MSL compiler rejects (use of undeclared identifier 'debugPrintfEXT'). Every reverse-mode kernel that happens to compile with adebug=TruedebugPrintfEXTsite fails pipeline creation on MoltenVK. Skipped on Apple.shaderShared{Float32,Float16,Float64}AtomicAdd: the feature bit is set, but MoltenVK's MSL translator emitsatomic_fetch_add_explicit((threadgroup atomic_float*) &x, ...)which Metal rejects withcannot pass pointer to address space 'threadgroup' as a pointer to address space 'device'. Skipped on Apple, routing shared-memory float atomics through the existing CAS-emulated fallback inatomic_operation_widened.The skips are
#if !defined(__APPLE__)guards, with the MoltenVK issue links in the comment at each site.6. Companion lazy-import + format-string sanitisation (
spirv_ir_builder.{cpp,h},spirv_codegen.cpp)Even with
spirv_has_non_semantic_infoturned off on Apple, kernels withdebug=Truecan still enter the arithmetic-overflow check path inspirv_codegen.cpp::generate_overflow_branch, which callsir_->call_debugprintf(...). Left untreated, the traceback string passed to that call contains un-escaped"and\ncharacters (Python source file paths, newlines) that survive the MSL translation and blow up the output with errors likemissing terminating '"' character. Two mitigations:spirv_ir_builder::init_pre_defsno longer eagerly importsNonSemantic.DebugPrintf; the import now fires lazily from the firstcall_debugprintfsite. Kernels with no debug traffic emit noOpExtInstImport, so MoltenVK's unused-import stub never runs.TaskCodegen::generate_overflow_branchescapes"and replaces\n/\rwith spaces before feeding the traceback into the format string. Native Vulkan drivers get the traceback byte-for-byte; Metal / MSL round-trips cleanly.7. Descriptor-set lifecycle fix (
vulkan_api.cpp)DeviceObjVkDescriptorSet::~DeviceObjVkDescriptorSetnow returns theVkDescriptorSetto its source pool viavkFreeDescriptorSets. Without this, each launch accumulates consumed-but-never-reclaimed slots,VulkanDevice::alloc_desc_setspins up fresh pools at the 64-set boundary, and MoltenVK'sMVKDescriptorSet::_poolcan deref a pool the driver has torn down (null-pool deref insideMVKResourcesCommandEncoderState::bindDescriptorSet). The pool is created withVK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT, so the free call is legal; theref_poolshared_ptrkeeps the pool and itsVkDevicealive past the destructor.8. Periodic
submitted_cmdbuffers_drain (runtime/gfx/runtime.{h,cpp})VulkanStream::submitappends oneTrackedCmdbuf{fence, cmd_buffer}per submit. The vector is only cleared incommand_sync()/wait_idle(). Workloads that push hundreds of kernels before any host-side observable (MPM, iterative field solves) accumulate hundreds of live fences + cmdbuffers + descriptor sets; MoltenVK's encoder-state tracker SIGSEGVs somewhere around that size.GfxRuntime::submit_current_cmdlist_if_timeoutnow also drains the queue everykMaxPendingLaunches = 32launches via a bounded synchronize; workloads that already touch a Python observable per iteration are unaffected (ctx_buffers_clears earlier via the normal synchronize path).Per-backend coverage matrix
Test on Mac (15, 3.*)/(26, 3.*).non_semantic_info/shared_atomic_floatcaps off to match what MoltenVK's MSL translator actually supports; descriptor-set + cmdbuffer-queue lifecycle fixes in place. Covered end-to-end byTest on Mac (15, 3.*)/(26, 3.*).buffer_info.usagefix and thevkGetBufferDeviceAddressKHRbit-gate fix also apply here; they resolve thetest_printstderr validation-layer failures thattest_gpu / Test Linux Vulkanwas reporting..tar.xzbranch invulkan.pyand the Linux PSB / non-semantic-info paths are untouched.vulkan.pyuntouched; thebuffer_info.usagefix applies but is a no-op relative to the pre-PR state because Windows was already validation-clean.Tests
CI
Test on Mac (15, 3.10-3.13)andTest on Mac (26, 3.10-3.13)exercise the new fetch end-to-end and run the full Vulkan-backend test matrix. Pre-PR:test_tile16_*[arch=vulkan-*]/test_mpm88_numpy_and_ndarray[arch=vulkan-0]/test_shared_array_float_atomics[arch=vulkan-*-dtype1-{add,sub}]fail. Post-PR: those pass; any new regressions surface here.test_gpu / Test Linux Vulkanexercises thevkGetBufferDeviceAddressKHRbit-gate fix by running with validation enabled. Pre-PR:test_print_*[arch=vulkan]fail becauseVUID-VkBufferDeviceAddressInfo-buffer-02601warnings pollute stderr; post-PR those go quiet.Manylinux wheel Build/Test (ubuntu-22.04 / ubuntu-22.04-arm)validates that the Linux branch ofvulkan.pyis unchanged.Windows 2025 Build/Test (3.10-3.13)validates that the Windows branch ofvulkan.pyis unchanged.Local smoke
./build.py --shell -- cmake -S . -B build -DQD_WITH_VULKAN=ON && ./build.pyon macOS-26 / arm64 succeeds and stageslibMoltenVK.dylibintobuild/.python -c "import quadrants as qd; qd.init(arch=qd.vulkan); print(qd.lang.impl.current_cfg().spirv_has_physical_storage_buffer)"reportsTrueafter this PR; reportsFalsebefore it.CMAKE_BUILD_TYPE=Debug cmake --log-level=DEBUGshows theMoltenVK: using LunarG Vulkan SDK copy at ...status line.No unit tests are added by this PR itself: the SDK and RHI changes surface via the existing Vulkan-backend test matrix, which is the regression harness. The atomic-fetch-add and debug-printf MoltenVK quirks are already covered by
test_shared_array_float_atomicsand the existingdebug=True-usingtest_matrix/test_tile16cases respectively.Side-effect audit
vulkan.pycase (\"Linux\", \"x86_64\")/case (\"Windows\", \"AMD64\")branches untouchedVULKAN_SDKenv var semanticsquadrants/rhi/CMakeLists.txt(BSD find path),quadrants/rhi/vulkan/vulkan_device_creator.cpp(runtime loader), shader compilerglslanglookup$VULKAN_SDK; the new macOS prefix looks identical in layoutMOLTENVK_DIRenv varquadrants/rhi/CMakeLists.txt; no runtime lookup$VULKAN_SDK/libfind_file(MOLTEN_VK ...)is cached; after the SDK is installed the first configure populates it and subsequent configures skip~/.cache/quadrants/vulkan-macos-.../+rm -rf buildregenerates from scratchspirv_has_physical_storage_bufferalloc_info.usage→buffer_info.usagefixallocate_memorybranch attachesSHADER_DEVICE_ADDRESS_BITto the Vulkan buffer usage only; VMA allocation usage is unchangedvkGetBufferDeviceAddressKHRbit-gatenon_semantic_infoskipped on Apple#if !defined(__APPLE__); other platforms unaffectedshared_atomic_float*skipped on Appleatomic_operation_widenedif (APPLE)) inCMakeLists.txt; Apple-guarded caps invulkan_device_creator.cppdownload_dep(url, installer_dir, strip=1)uses the existing cache primitive; re-runs short-circuit on cached unzip + on the existence of$prefix/macOS/installer_bin.chmod(0o755)beforesubprocess.check_call; idempotentzipfiledropped mode0644, handled here$VK_LAYER_PATHCMakeLists.txtpreviouslycurl-edlibMoltenVK.dylib.zipfromtaichi_assets- removed.FATAL_ERRORreplaces the silent fallback.NonSemantic.DebugPrintfimportcall_debugprintf-> noOpExtInstImport; every previously-working Vulkan driver still sees the import when a kernel actually needs itvkFreeDescriptorSetsper-set on destruction, pool retainsVK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BITpending_launches_since_sync_thresholdsynchronize(); only fires when no Python-side observable has intervened