[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement by duburcqa · Pull Request #551 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-23T05:38:31Z

LunarG-sourced MoltenVK on Apple unblocks PSB (BDA) for the runtime adstack sizer, plus the three MoltenVK quirks that surface once PSB, validation, and long kernel-loop workloads are live

Six commits. Originally three: swap the 2022 Taichi-pinned MoltenVK for a LunarG Vulkan SDK fetch driven by ./build.py --shell, gate the PSB capability on the queried bufferDeviceAddress feature bit, and locate the staged dylib from the SDK extract so python setup.py develop keeps working after the shell exits. Three more were added once the new MoltenVK was exercised against the reverse-mode / MPM / tile16 test matrix: a vkGetBufferDeviceAddressKHR buffer-usage-bit fix that was latent while PSB was off on Apple, a NonSemantic.DebugPrintf / shaderSharedFloat32AtomicAdd cap-sanitisation pair that MoltenVK advertises but cannot actually service, and a GfxRuntime safety valve that drains the Vulkan stream's submitted_cmdbuffers_ queue on long kernel-launch loops so MoltenVK's encoder-state tracker stops SIGSEGVing. The PR now ships a MoltenVK that is both functional for BDA-backed reverse-mode workloads and doesn't regress the rest of the SPIR-V test suite on CI's Mac runners.

TL;DR

# macOS / arm64, from a clean checkout:
./build.py --shell -- cmake -S . -B build -DQD_WITH_VULKAN=ON
# $VULKAN_SDK / $MOLTENVK_DIR are exported by the shell hook;
# CMake's find_file locates libMoltenVK.dylib under the SDK and stages it into build/.
./build.py
python -c "import quadrants as qd; qd.init(arch=qd.vulkan)"
# Now reports spirv_has_physical_storage_buffer=True on Apple, and the full
# adstack / tile16 / MPM test matrix is green on Mac CI (15 & 26) and Linux Vulkan.

The shell hook fetches LunarG's macOS installer once, extracts the SDK into ~/.cache/quadrants/vulkan-macos-1.4.321.0/, and exports VULKAN_SDK / MOLTENVK_DIR. CMake then picks up libMoltenVK.dylib from the SDK instead of Taichi's legacy pinned dylib. Downstream, the adstack sizer compute shader (Autodiff 17) relies on BDA to walk SizeExpr trees on device, so enabling PSB on Apple is the gating change that makes that shader legal to dispatch. The four add-on commits then cover what the freshly-enabled PSB path exposes - a latent buffer-usage-bit typo, two MoltenVK caps that are advertised-but-broken, and a cmdbuffer-queue drain that repeated kernel launches need.

Why

The previous Apple Vulkan path pinned a 2022 MoltenVK dylib hosted on taichi_assets, predating the VK_KHR_buffer_device_address / physical-storage-buffer capability. Three concrete consequences:

vulkan_device_creator.cpp hard-guarded spirv_has_physical_storage_buffer off on Apple behind #if !defined(__APPLE__) && false, citing taichi-dev/taichi#6295.
The adstack sizer shader that lands in Autodiff 17 needs OpLoad through a PhysicalStorageBuffer-class pointer to read SizeExpr leaves; with PSB disabled on Apple, every reverse-mode kernel would hard-error at launch time on Metal.
A dormant typo in VulkanDevice::allocate_memory (alloc_info.usage & VK_BUFFER_USAGE_STORAGE_BUFFER_BIT instead of buffer_info.usage & ...) made the "attach SHADER_DEVICE_ADDRESS_BIT" branch dead for every buffer; unreachable while the PSB cap was off, but the moment PSB is enabled every buffer becomes a validation-layer violation (Linux) or garbage-address read (MoltenVK).

The less-targeted workarounds are insufficient: keeping the Taichi pin and papering over the capability check would ship a MoltenVK that cannot serve BDA loads; asking every Quadrants developer to install LunarG's SDK globally breaks hermetic CI. Fetching through build.py --shell gives us a single, reproducible SDK path the rest of the build consumes.

Surface API

No Python-surface API changes. All diff is build-system (vulkan.py, quadrants/rhi/CMakeLists.txt), Vulkan RHI internals (vulkan_device_creator.cpp, vulkan_device.cpp, vulkan_api.cpp), and SPIR-V codegen internals (spirv_ir_builder.cpp, spirv_codegen.cpp, runtime/gfx/runtime.cpp + runtime.h). Behaviour deltas visible to users of qd.init(arch=qd.vulkan):

qd.lang.impl.current_cfg().spirv_has_physical_storage_buffer flips to True on Apple.
qd.init(arch=qd.vulkan, debug=True) no longer fails pipeline creation on MoltenVK for kernels that emit debugPrintfEXT traffic (lazy-import + Apple cap drop).
Reverse-mode kernels using qd.simt.block.SharedArray with an atomic-f32 add / sub no longer fail MoltenVK's MSL compile with atomic_fetch_add_explicit(threadgroup atomic_float*, ...) - they route through the CAS-emulated fallback instead.
Long kernel-launch loops (MPM-style simulations, iterative field updates) no longer SIGSEGV inside MVKCommandEncoder after a few hundred launches without a qd.sync().

Entry points

File	What changes
`.github/workflows/scripts/ti_build/vulkan.py`	`setup_vulkan()` gains a Darwin / arm64 branch that fetches + extracts + installs LunarG's macOS bundle.
`quadrants/rhi/CMakeLists.txt`	Apple branch locates `libMoltenVK.dylib` via `$MOLTENVK_DIR` / `$VULKAN_SDK`; `configure_file` stages it into `${CMAKE_BINARY_DIR}/libMoltenVK.dylib`. `FATAL_ERROR` on a missing SDK.
`quadrants/rhi/vulkan/vulkan_device_creator.cpp`	Drops the Apple kill-switch around `spirv_has_physical_storage_buffer`. Gates the overall PSB cap on the queried `bufferDeviceAddress` feature bit. Skips `VK_KHR_shader_non_semantic_info` on Apple (advertised but the MSL translator can't emit `debugPrintfEXT`). Skips `shaderSharedFloat{16,32,64}AtomicAdd` on Apple (same reason: MSL rejects `atomic_fetch_add_explicit` on `threadgroup atomic_float*`).
`quadrants/rhi/vulkan/vulkan_device.cpp`	Fixes the `alloc_info.usage` → `buffer_info.usage` typo that made the "attach `SHADER_DEVICE_ADDRESS_BIT`" branch dead. Gates `vkGetBufferDeviceAddressKHR` on whether the bit is actually set, so uniform / vertex / transfer-only staging buffers no longer trip VUID-VkBufferDeviceAddressInfo-buffer-02601.
`quadrants/rhi/vulkan/vulkan_api.cpp`	Frees descriptor sets on `shared_ptr` release so MoltenVK's pool churn does not null-pool-deref after ~32 two-set kernel launches.
`quadrants/runtime/gfx/runtime.{h,cpp}`	Adds a `pending_launches_since_sync_` counter; `submit_current_cmdlist_if_timeout` forces a `synchronize()` every `kMaxPendingLaunches = 32` launches to bound `VulkanStream::submitted_cmdbuffers_` growth on MPM-style tight kernel-launch loops.
`quadrants/codegen/spirv/spirv_ir_builder.{cpp,h}`	Lazy-imports `NonSemantic.DebugPrintf` only when a `call_debugprintf` site actually needs it, so kernels with no `print` / debug-assert traffic stay MoltenVK-compatible.
`quadrants/codegen/spirv/spirv_codegen.cpp`	Sanitises the overflow-diagnostic traceback before feeding it to `call_debugprintf`: un-escaped quotes / newlines in the traceback string survive MoltenVK's MSL translation into the output and previously produced `use of undeclared identifier 'Users'`-class errors from the path prefix.

Mechanism end-to-end

1. SDK acquisition (`vulkan.py`)

Platform	Source	Prefix
Linux	`vulkansdk-linux-x86_64-1.4.321.1.tar.xz` (tarball, unchanged)	`~/.cache/quadrants/vulkan-1.4.321.1/x86_64/`
Darwin / arm64	`vulkansdk-macos-1.4.321.0.zip` (installer bundle)	`~/.cache/quadrants/vulkan-macos-1.4.321.0/`
Windows	MSI (unchanged)	`~/.cache/quadrants/vulkan-win-1.4.321.1/`

The macOS branch is the only new one. LunarG didn't publish a 1.4.321.1 macOS asset, so the patch-level is inlined to 1.4.321.0. zipfile drops the installer bundle without preserving the Unix execute bit, so the script chmod 0755s the installer binary before running it (idempotent, scoped to the single file). The CLI install command writes the SDK into the --root prefix.

2. CMake pickup (`quadrants/rhi/CMakeLists.txt`)

Env var	Meaning	Consumer
`MOLTENVK_DIR`	path that directly contains `libMoltenVK.dylib`	`find_file(MOLTEN_VK libMoltenVK.dylib NO_DEFAULT_PATH PATHS ${MOLTENVK_DIR})`
`VULKAN_SDK`	SDK prefix; `${VULKAN_SDK}/lib` is tried if `MOLTENVK_DIR` is unset	same `find_file` call, fallback path

configure_file stages the located dylib into ${CMAKE_BINARY_DIR}/libMoltenVK.dylib (copy, not symlink, so the install step can re-digest it) and install(FILES ... DESTINATION ${INSTALL_LIB_DIR}/runtime) ships it alongside the runtime. A missing SDK is a FATAL_ERROR pointing at ./build.py --shell; there is no silent fallback to the legacy pin on purpose.

3. PSB capability unblocked (`vulkan_device_creator.cpp`)

Removes the #if !defined(__APPLE__) && false kill-switch gate around caps.set(DeviceCapability::spirv_has_physical_storage_buffer, true). The surrounding gate is tightened from CHECK_VERSION(1, 3) || buffer_device_address_feature.bufferDeviceAddress to a plain feature-bit check: Vulkan 1.3 promotes VK_KHR_buffer_device_address into core but still lets implementations expose bufferDeviceAddress = VK_FALSE, so the version-OR gate was treating 1.3 devices as PSB-capable even when they weren't. Devices that genuinely don't advertise BDA (ancient drivers, headless CI without Vulkan) remain safe.

4. `vkGetBufferDeviceAddressKHR` now sees the right usage bit (`vulkan_device.cpp`)

Before this PR the branch that ORs VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_KHR into buffer_info.usage was gated on alloc_info.usage & VK_BUFFER_USAGE_STORAGE_BUFFER_BIT - but alloc_info.usage is VMA's VmaMemoryUsage enum (small integers), not the Vulkan VkBufferUsageFlags bitfield. The & always yielded 0; the branch was dead; every buffer reached the vkGetBufferDeviceAddressKHR call below without the required bit. Latent while PSB was off on Apple (no one called vkGetBufferDeviceAddressKHR). Once PSB is on it fires VUID-VkBufferDeviceAddressInfo-buffer-02601 under validation (Linux CI's test_print stderr-assertion failures) and returns a garbage address under MoltenVK (Mac CI's test_tile16_* / test_mpm88_numpy_and_ndarray wrong-output failures). Fix reads buffer_info.usage instead, and additionally gates the vkGetBufferDeviceAddressKHR call on the bit actually being set, so uniform / vertex / transfer-only staging buffers skip the BDA query and keep alloc.addr == 0.

5. MoltenVK cap sanitisation (`vulkan_device_creator.cpp`)

MoltenVK advertises two Vulkan capabilities whose SPIR-V → MSL translation is broken:

VK_KHR_shader_non_semantic_info: the extension enumerates fine, OpExtInstImport "NonSemantic.DebugPrintf" validates, the OpExtInst call sites pass SPIR-V validation, but SPIRV-Cross emits an unconditional debugPrintfEXT(...) call stub whose identifier Metal's MSL compiler rejects (use of undeclared identifier 'debugPrintfEXT'). Every reverse-mode kernel that happens to compile with a debug=True debugPrintfEXT site fails pipeline creation on MoltenVK. Skipped on Apple.
shaderShared{Float32,Float16,Float64}AtomicAdd: the feature bit is set, but MoltenVK's MSL translator emits atomic_fetch_add_explicit((threadgroup atomic_float*) &x, ...) which Metal rejects with cannot pass pointer to address space 'threadgroup' as a pointer to address space 'device'. Skipped on Apple, routing shared-memory float atomics through the existing CAS-emulated fallback in atomic_operation_widened.

The skips are #if !defined(__APPLE__) guards, with the MoltenVK issue links in the comment at each site.

6. Companion lazy-import + format-string sanitisation (`spirv_ir_builder.{cpp,h}`, `spirv_codegen.cpp`)

Even with spirv_has_non_semantic_info turned off on Apple, kernels with debug=True can still enter the arithmetic-overflow check path in spirv_codegen.cpp::generate_overflow_branch, which calls ir_->call_debugprintf(...). Left untreated, the traceback string passed to that call contains un-escaped " and \n characters (Python source file paths, newlines) that survive the MSL translation and blow up the output with errors like missing terminating '"' character. Two mitigations:

spirv_ir_builder::init_pre_defs no longer eagerly imports NonSemantic.DebugPrintf; the import now fires lazily from the first call_debugprintf site. Kernels with no debug traffic emit no OpExtInstImport, so MoltenVK's unused-import stub never runs.
TaskCodegen::generate_overflow_branch escapes " and replaces \n / \r with spaces before feeding the traceback into the format string. Native Vulkan drivers get the traceback byte-for-byte; Metal / MSL round-trips cleanly.

7. Descriptor-set lifecycle fix (`vulkan_api.cpp`)

DeviceObjVkDescriptorSet::~DeviceObjVkDescriptorSet now returns the VkDescriptorSet to its source pool via vkFreeDescriptorSets. Without this, each launch accumulates consumed-but-never-reclaimed slots, VulkanDevice::alloc_desc_set spins up fresh pools at the 64-set boundary, and MoltenVK's MVKDescriptorSet::_pool can deref a pool the driver has torn down (null-pool deref inside MVKResourcesCommandEncoderState::bindDescriptorSet). The pool is created with VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT, so the free call is legal; the ref_pool shared_ptr keeps the pool and its VkDevice alive past the destructor.

8. Periodic `submitted_cmdbuffers_` drain (`runtime/gfx/runtime.{h,cpp}`)

VulkanStream::submit appends one TrackedCmdbuf{fence, cmd_buffer} per submit. The vector is only cleared in command_sync() / wait_idle(). Workloads that push hundreds of kernels before any host-side observable (MPM, iterative field solves) accumulate hundreds of live fences + cmdbuffers + descriptor sets; MoltenVK's encoder-state tracker SIGSEGVs somewhere around that size. GfxRuntime::submit_current_cmdlist_if_timeout now also drains the queue every kMaxPendingLaunches = 32 launches via a bounded synchronize; workloads that already touch a Python observable per iteration are unaffected (ctx_buffers_ clears earlier via the normal synchronize path).

Per-backend coverage matrix

Backend	Affected by this PR?	Verdict
CPU (LLVM)	No	N/A - does not compile Vulkan RHI.
CUDA (LLVM)	No	N/A - does not compile Vulkan RHI.
AMDGPU (LLVM)	No	N/A - does not compile Vulkan RHI.
Metal (SPIR-V)	Indirectly	None of the Apple-guarded caps / cap-sanitisation code reaches the Metal RHI. Validated via CI `Test on Mac (15, 3.)` / `(26, 3.)`.
Vulkan on Apple / MoltenVK	Yes	MoltenVK is now LunarG-sourced; PSB + BDA enabled; `non_semantic_info` / `shared_atomic_float` caps off to match what MoltenVK's MSL translator actually supports; descriptor-set + cmdbuffer-queue lifecycle fixes in place. Covered end-to-end by `Test on Mac (15, 3.)` / `(26, 3.)`.
Vulkan on Linux	Yes	The `buffer_info.usage` fix and the `vkGetBufferDeviceAddressKHR` bit-gate fix also apply here; they resolve the `test_print` stderr validation-layer failures that `test_gpu / Test Linux Vulkan` was reporting. `.tar.xz` branch in `vulkan.py` and the Linux PSB / non-semantic-info paths are untouched.
Vulkan on Windows (SPIR-V)	No	MSI branch in `vulkan.py` untouched; the `buffer_info.usage` fix applies but is a no-op relative to the pre-PR state because Windows was already validation-clean.

Tests

CI

Test on Mac (15, 3.10-3.13) and Test on Mac (26, 3.10-3.13) exercise the new fetch end-to-end and run the full Vulkan-backend test matrix. Pre-PR: test_tile16_*[arch=vulkan-*] / test_mpm88_numpy_and_ndarray[arch=vulkan-0] / test_shared_array_float_atomics[arch=vulkan-*-dtype1-{add,sub}] fail. Post-PR: those pass; any new regressions surface here.
test_gpu / Test Linux Vulkan exercises the vkGetBufferDeviceAddressKHR bit-gate fix by running with validation enabled. Pre-PR: test_print_*[arch=vulkan] fail because VUID-VkBufferDeviceAddressInfo-buffer-02601 warnings pollute stderr; post-PR those go quiet.
Manylinux wheel Build/Test (ubuntu-22.04 / ubuntu-22.04-arm) validates that the Linux branch of vulkan.py is unchanged.
Windows 2025 Build/Test (3.10-3.13) validates that the Windows branch of vulkan.py is unchanged.

Local smoke

./build.py --shell -- cmake -S . -B build -DQD_WITH_VULKAN=ON && ./build.py on macOS-26 / arm64 succeeds and stages libMoltenVK.dylib into build/.
python -c "import quadrants as qd; qd.init(arch=qd.vulkan); print(qd.lang.impl.current_cfg().spirv_has_physical_storage_buffer)" reports True after this PR; reports False before it.
CMAKE_BUILD_TYPE=Debug cmake --log-level=DEBUG shows the MoltenVK: using LunarG Vulkan SDK copy at ... status line.

No unit tests are added by this PR itself: the SDK and RHI changes surface via the existing Vulkan-backend test matrix, which is the regression harness. The atomic-fetch-add and debug-printf MoltenVK quirks are already covered by test_shared_array_float_atomics and the existing debug=True-using test_matrix / test_tile16 cases respectively.

Side-effect audit

Concern	Where checked	Verdict
Linux / Windows Vulkan SDK fetch	`vulkan.py` `case (\"Linux\", \"x86_64\")` / `case (\"Windows\", \"AMD64\")` branches untouched	ok - no behaviour change outside Apple
`VULKAN_SDK` env var semantics	exported from the shell hook; consumed by `quadrants/rhi/CMakeLists.txt` (BSD find path), `quadrants/rhi/vulkan/vulkan_device_creator.cpp` (runtime loader), shader compiler `glslang` lookup	ok - existing consumers keep using `$VULKAN_SDK`; the new macOS prefix looks identical in layout
`MOLTENVK_DIR` env var	new. Only read by `quadrants/rhi/CMakeLists.txt`; no runtime lookup	ok - opt-in; falls back to `$VULKAN_SDK/lib`
CMake cache	`find_file(MOLTEN_VK ...)` is cached; after the SDK is installed the first configure populates it and subsequent configures skip	ok - deleting `~/.cache/quadrants/vulkan-macos-.../` + `rm -rf build` regenerates from scratch
BDA feature-bit gate	now the sole gate on `spirv_has_physical_storage_buffer`	intentional - Vulkan 1.3 devices without BDA (present on some headless CI drivers) no longer get PSB set
`alloc_info.usage` → `buffer_info.usage` fix	`allocate_memory` branch attaches `SHADER_DEVICE_ADDRESS_BIT` to the Vulkan buffer usage only; VMA allocation usage is unchanged	ok - VMA allocation strategy untouched; only the Vulkan buffer carries the new usage bit
`vkGetBufferDeviceAddressKHR` bit-gate	added alongside the fix; buffers without the bit skip the BDA query cleanly	ok - uniform / vertex / transfer-only buffers no longer spam validation or return garbage addresses
`non_semantic_info` skipped on Apple	gate is `#if !defined(__APPLE__)`; other platforms unaffected	ok - covered by per-backend matrix above
`shared_atomic_float*` skipped on Apple	same gate; CAS-emulated path already exists in `atomic_operation_widened`	ok - pure correctness fix for Apple; other platforms retain native support
Non-Apple behaviour	Apple-guarded block (`if (APPLE)`) in `CMakeLists.txt`; Apple-guarded caps in `vulkan_device_creator.cpp`	ok - Linux / Windows Vulkan paths untouched
Download cache	`download_dep(url, installer_dir, strip=1)` uses the existing cache primitive; re-runs short-circuit on cached unzip + on the existence of `$prefix/macOS/`	ok - no network on re-run
Installer execute bit	`installer_bin.chmod(0o755)` before `subprocess.check_call`; idempotent	ok - `zipfile` dropped mode `0644`, handled here
`$VK_LAYER_PATH`	set on every platform branch including the new macOS one	ok - layer validation remains wired
Legacy Taichi pin removal	`CMakeLists.txt` previously `curl`-ed `libMoltenVK.dylib.zip` from `taichi_assets` - removed. `FATAL_ERROR` replaces the silent fallback.	intentional; no legacy fallback is shipped
Lazy `NonSemantic.DebugPrintf` import	no call to `call_debugprintf` -> no `OpExtInstImport`; every previously-working Vulkan driver still sees the import when a kernel actually needs it	ok - no effect on platforms that don't advertise the cap
Descriptor-set reuse	`vkFreeDescriptorSets` per-set on destruction, pool retains `VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT`	ok - the call is legal by construction; verified by CI on Mac and Linux
`pending_launches_since_sync_` threshold	32 launches; reset on every `synchronize()`; only fires when no Python-side observable has intervened	ok - pathological-loop safety valve; normal workloads are unaffected

chatgpt-codex-connector

💡 Codex Review

quadrants/quadrants/codegen/llvm/codegen_llvm.cpp

Lines 2600 to 2601 in 453d0a8

    
           call(llvm_func, new_ctx); 
        
           llvm_val[stmt] = result_buffer;

Propagate cpu_assert_failed after real_func calls

In TaskCodeGenLLVM::visit(FuncCallStmt), the generated caller invokes the callee with a fresh RuntimeContext and then immediately continues (call(llvm_func, new_ctx)) without checking or forwarding new_ctx->cpu_assert_failed. On CPU, assertions now rely on this flag to abort execution, so an out-of-bounds/assert failure inside @qd.real_func is swallowed at the call boundary and the caller keeps running with invalid state. This makes debug assertions inside real functions ineffective and can reintroduce post-assert memory faults instead of cleanly terminating the kernel.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 quadrants/codegen/spirv/spirv_codegen.cpp:2320-2347 — SPIR-V codegen caches a single per-task invoc_id * stride SSA id in ad_stack_heap_thread_base_{float,int}_ and emits the underlying OpIMul via ir_->mul(...) into the current insertion block at the first AdStackAllocaStmt visit site (spirv_codegen.cpp:2324-2351). When a task contains multiple independent blocks — e.g. sibling inner range-fors that are each their own IB, each carrying its own f32 loop-carried variable — auto_diff.cpp's per-IB pipeline runs BackupSSA::run(ib) independently for each IB, so each AdStackAllocaStmt is hoisted (at most) to its own IB root. The first visit emits the OpIMul inside IB1's body block; the second visit reuses the cached SSA id from a block that does not dominate IB2, violating SPIR-V §2.16. Fix mirrors the LLVM backend's ensure_ad_stack_heap_base_llvm() (codegen_llvm.cpp:2166-2186): emit the OpIMul at the task function's entry/dispatch-entry via an insertion-point save/restore, not at the first alloca visit site.
Extended reasoning...

What the bug is

get_ad_stack_heap_thread_base_float() / get_ad_stack_heap_thread_base_int() cache a single SSA id per task and emit the backing invoc_id * stride OpIMul via ir_->mul(...), which commits the instruction to the IR builder's current insertion block. Emission is triggered eagerly from visit(AdStackAllocaStmt) at the first alloca visit; every subsequent Push/Pop/LoadTop/AccAdjoint re-reads the cached id via ad_stack_heap_{float,int}_ptr(). The cached id therefore dominates all downstream uses only if the first visit happens inside a block that structurally dominates every other AdStackAllocaStmt of the same heap kind.

The code comment in spirv_codegen.cpp:2326-2335 claims this holds because the first visit "lives in the dispatch body that dominates all inner loop bodies". That premise is what the bug contradicts.

How the premise breaks: multi-IB kernels

Reverse-mode AD's pipeline (quadrants/transforms/auto_diff.cpp:2726-2755) identifies multiple independent blocks and runs PromoteSSA2LocalVar / ReplaceLocalVarWithStacks / MakeAdjoint / BackupSSA per-IB. For a kernel shaped like:
```
for i in outer:                 # struct-for (outer)
    for j in range(n):          # inner range-for #1  -> IB1 = its body
        v = x[i, j]             # AllocaStmt at IB1 root
        for _ in range(k):      # dynamic inner
            v = qd.sin(v)
        out_a[i] += v
    for j in range(n):          # inner range-for #2  -> IB2 = its body
        w = y[i, j]             # AllocaStmt at IB2 root
        for _ in range(k):
            w = qd.cos(w)
        out_b[i] += w
```
IdentifyIndependentBlocks gives IB1 = inner-loop-1's body and IB2 = inner-loop-2's body (each has its own global atomic on a different output, so each qualifies as a smallest IB). BackupSSA::run(ib) uses independent_block = ib, so the hoisted backup AdStackAllocaStmt is inserted at that IB's position 0 — not at a task-wide root that dominates both IBs.

In IR order, SPIR-V codegen then visits:
1. start_label(inner1_body_label) at the inner1 RangeForStmt header.
2. visit(AdStackAllocaStmt_v) at IB1 root. Calls get_ad_stack_heap_thread_base_float(), which routes ir_->mul(...) through DEFINE_BUILDER_BINARY_USIGN_OP(mul, Mul) -> make_value(OpIMul, ...) -> make_inst, committing the OpIMul to curr_label_ == inner1_body_label. Caches the result SSA id.
3. Exit inner1. start_label(inner2_body_label).
4. visit(AdStackAllocaStmt_w) at IB2 root. Cache hit — returns the SSA id defined in step 2.
5. visit(AdStackPushStmt) for w inside the inner dynamic loop of inner2 calls ad_stack_heap_float_ptr(...), which does ir_->add(base, ...) in inner2's body. The OpIAdd has an operand (the cached base) whose defining instruction lives in inner1_body_label.
inner1_body_label and inner2_body_label are sibling children of the outer for-loop's merge/header — neither dominates the other. SPIR-V §2.16.2 rejects this; spirv-val prints a non-dominating-use error and drivers can TDR silently.

Why the refutations don't cover this

Both refutations correctly identify that BackupSSA::generic_visit hoists AdStackAllocaStmts to independent_block when a cross-block reference is detected — and this is sufficient for the narrow mutually-exclusive-if-branches within a single IB shape: MakeAdjoint creates a reverse new_if sibling to the forward if_stmt at the IB root, references from new_if's branches fall outside the forward if-branch's leaf_to_root chain, and the backup is inserted at IB root via independent_block->insert(std::move(backup_stack_alloca), 0) (auto_diff.cpp:2595). For that shape the bug report's claim is indeed partially wrong.

But the hoist is scoped to one IB at a time. When the kernel has sibling inner loops whose bodies are each IBs, each invocation of BackupSSA::run(ib) hoists its allocas to its own root — not to a task-wide block. The two resulting AdStackAllocaStmts live in sibling, mutually-non-dominating blocks. That is exactly the shape where the cached invoc_id * stride SSA id violates dominance.

The refutation about test_adstack_if_cond_snapshot_adaptive_sizing doesn't disprove this shape either: that test uses an if/elif/elif/else on a single carried variable (outputs[i_inner, i_batch]), so there is only one adstack kind and one alloca site. It produces no sibling-alloca pair and does not stress the cache.

Why the existing comment doesn't save this

The implementation's own inline comment at spirv_codegen.h:219-225 defends eager-at-alloca-site emission with:

Emitted eagerly from visit(AdStackAllocaStmt) so the OpIMul lives in the alloca's enclosing block, which strictly dominates every sibling inner loop that later references the cached SSA id.

That invariant relies on the alloca's enclosing block being task-global — i.e. the dispatch-body/function-entry block. With per-IB BackupSSA, the enclosing block is the IB root, which is task-global only when the kernel happens to have exactly one IB. The comment's invariant is therefore an accidental property of the test corpus, not a pipeline guarantee.

The LLVM backend already diagnosed the exact same concern and solved it explicitly: TaskCodeGenLLVM::ensure_ad_stack_heap_base_llvm() in codegen_llvm.cpp:2166-2186 emits the base load at entry_block via an llvm::IRBuilderBase::InsertPointGuard, with a comment calling out "two sibling adstacks under different branches of an if would trip verifyFunction with a non-dominating use". The SPIR-V side should mirror this.

Impact
- spirv-val rejects the produced SPIR-V with a non-dominating-operand error (SPIR-V §2.16.2).
- Native Metal / Vulkan drivers vary: some refuse to compile the shader, others miscompile silently.
- This is triggered by a natural reverse-mode AD shape — two accumulators with their own dynamic loops in the same kernel — and is not exercised by any of the PR's new SPIR-V heap-adstack tests.
Step-by-step proof

Consider the kernel above, with n = 4, k = 3, compiled with ad_stack_experimental_enabled=True.
1. IdentifyIndependentBlocks::run(root) returns {inner1_body_block, inner2_body_block} because each inner body is the smallest IB with a qualifying global atomic.
2. For ib = inner1_body_block:
  - ReplaceLocalVarWithStacks replaces AllocaStmt_v in place with AdStackAllocaStmt_v (at inner1_body position 0, since it was the first user stmt).
  - MakeAdjoint emits reverse code (new_for with body referencing AdStackAllocaStmt_v) appended to inner1_body_block.
  - BackupSSA examines reverse ops whose op->parent is inner1_body_block. Here inner1_body_block is in each reverse stmt's leaf_to_root, so no hoist fires. AdStackAllocaStmt_v stays at inner1_body position 0.
3. For ib = inner2_body_block: symmetric. AdStackAllocaStmt_w ends up at inner2_body position 0.
4. SPIR-V codegen's run() pre-scans IR (spirv_codegen.cpp:131-168) to size ad_stack_heap_per_thread_stride_float_. Both allocas are f32 with max_size bounded by the bounded-loop analyzer (k = 3 each), so stride ends up at ~12 f32 elements.
5. Code emission walks outer struct-for, enters inner1. visit(RangeForStmt) calls start_label(body_label_inner1). Now curr_label_ = body_label_inner1.
6. visit(AdStackAllocaStmt_v) at spirv_codegen.cpp:2420 calls get_ad_stack_heap_thread_base_float() which emits OpIMul %u32 %invoc_id %stride under body_label_inner1 and caches the SSA id as %base_ssa.
7. Exits inner1. visit(RangeForStmt) for inner2 calls start_label(body_label_inner2). curr_label_ = body_label_inner2.
8. visit(AdStackAllocaStmt_w) at the same line. Cache hit: returns %base_ssa (defined in body_label_inner1).
9. Any later visit(AdStackPushStmt) / visit(AdStackLoadTopStmt) on w calls ad_stack_heap_float_ptr(offset, count) which executes ir_->add(%base_ssa, offset_val) under body_label_inner2.
10. The OpIAdd references %base_ssa whose defining OpIMul is in body_label_inner1. In the CFG, body_label_inner1 is not on every path to body_label_inner2 (they are sibling loop bodies under the outer struct-for header), so it does not dominate the use. spirv-val's structured-dominance pass rejects the module.
Fix

Mirror the LLVM backend. Add a one-shot ensure_ad_stack_heap_thread_base_{float,int}() that:
1. Saves the current insertion point (e.g. ir_->save_insert_point() or an equivalent).
2. Switches to the function's entry/dispatch-body block (the block right after the offloaded task's function header; equivalent to LLVM's entry_block).
3. Emits the OpUConvert/OpIMul.
4. Restores the original insertion point.
5. Caches the result.
Call it from visit(AdStackAllocaStmt) and both ad_stack_heap_{float,int}_ptr lazily. This guarantees the OpIMul lives in a block that dominates every other block in the function regardless of how many IBs the task contains.
🔴 quadrants/codegen/llvm/codegen_llvm.cpp:2579-2585 — The PR adds a CPU assertion-propagation mechanism (cpu_assert_failed) but explicitly acknowledges via FIXME (codegen_llvm.cpp:2579-2584) that it is not propagated out of @qd.real_func callees. An OOB/assertion inside a real_func on CPU sets the flag on the callee's new_ctx alloca, but the caller never reads it — subsequent tasks continue running on possibly-corrupted data, which is exactly the silent-segfault class the PR is meant to prevent. Fix by zero-initializing new_ctx.cpu_assert_failed before the call, checking it after, propagating to get_context()->cpu_assert_failed, and emitting an early ret void on failure — all three steps are enumerated in the FIXME.
Extended reasoning...

What the bug is

The PR's central mechanism — setting cpu_assert_failed=1 inside quadrants_assert_format_ctx and having the kernel launcher break out of the task loop — fails when the assertion fires inside a @qd.real_func callee on CPU. The callee correctly writes to its context, but the caller's context is never updated.

The specific code path

At quadrants/codegen/llvm/codegen_llvm.cpp:2585, visit(FuncCallStmt) allocates the callee's context via create_entry_block_alloca(RuntimeContext) and only initializes the runtime field on line 2586. The call is then emitted on line 2600 via call(llvm_func, new_ctx), with no post-call propagation.

Inside the real_func body compilation (stmt->func->ir->accept(this) on line 2575), any AssertStmt routes through use_ctx_variant=true (since arch_is_cpu) and calls quadrants_assert_format_ctx with get_context() == get_arg(0), which is the caller's new_ctx pointer. When the assertion fires, runtime.cpp:845 writes new_ctx->cpu_assert_failed = 1 and codegen_llvm.cpp:1182 emits an early CreateRetVoid.

Why existing code doesn't prevent it

Back in the caller's task body, the flag on new_ctx is never copied into the caller's context. The outer launch_offloaded_tasks loop in quadrants/runtime/cpu/kernel_launcher.cpp:13-22 only checks ctx.get_context().cpu_assert_failed — but that context belongs to the task-level scope, not the real_func call. Regular @qd.func is AST-inlined so it does not hit this path; only @qd.real_func callees do.

Additionally, new_ctx is raw create_entry_block_alloca storage. The C++ in-class initializer int32_t cpu_assert_failed{0} in program/context.h only applies to C++ constructions, not LLVM allocas — so the slot starts with uninitialized stack bytes. This is currently latent (nothing reads it back), but it means step (1) of the fix is load-bearing once post-call propagation is added.

Impact

An OOB/assertion inside a reverse-mode or any other @qd.real_func on CPU silently fails to terminate the kernel. Subsequent tasks in the same launch_offloaded_tasks loop continue running on possibly-corrupted data — exactly the test_ndarray_oob_cpu_* / test_do_while_oob_does_not_loop_forever regression the new mechanism is meant to eliminate. None of the tests added in this PR exercise a real_func callee (all use @qd.kernel or @qd.func), so CI does not catch the gap.

How to fix

The FIXME itself enumerates the three steps:
1. Zero-init new_ctx->cpu_assert_failed after the RuntimeContext_set_runtime call (LLVM CreateStore of a constant zero to the cpu_assert_failed field of new_ctx).
2. After call(llvm_func, new_ctx), load new_ctx->cpu_assert_failed and compare against zero.
3. If non-zero, propagate via get_context()->cpu_assert_failed = 1 and emit CreateRetVoid on the caller side, matching the pattern visit(AssertStmt) already uses at lines 1175-1183.
Proof via a concrete example

Consider a kernel that calls a @qd.real_func which reads an ndarray out of bounds, then the kernel body writes to an unrelated field afterward:
```
@qd.real_func
def oob_reader(a: qd.types.ndarray(dtype=qd.f32, ndim=1)) -> qd.f32:
    return a[100]  # a.shape == (4,), fires OOB assert

@qd.kernel
def k(a: qd.types.ndarray(dtype=qd.f32, ndim=1), b: qd.types.ndarray(dtype=qd.f32, ndim=1)):
    for i in range(4):
        v = oob_reader(a)
        b[i] = v  # executes even after the assert in oob_reader fires
```
Step-by-step at runtime with debug=True, check_out_of_bound=True:
1. k enters its task function; ctx is the outer task's RuntimeContext with cpu_assert_failed=0 (cleared by launch_offloaded_tasks line 9).
2. visit(FuncCallStmt) emitted: new_ctx = alloca RuntimeContext (line 2585); RuntimeContext_set_runtime(new_ctx, runtime) (line 2586). new_ctx->cpu_assert_failed is stack garbage but unread.
3. call(oob_reader, new_ctx) jumps into the callee.
4. Inside oob_reader, the OOB AssertStmt fires. use_ctx_variant is true. get_context() returns get_arg(0) == new_ctx. quadrants_assert_format_ctx(new_ctx, false, ...) sets new_ctx->cpu_assert_failed = 1 (runtime.cpp:845) and returns 1.
5. The callee's visit(AssertStmt) epilogue (lines 1175-1183) sees the non-zero return, emits ret void. Control returns to the caller.
6. The caller does not check new_ctx->cpu_assert_failed — execution continues. b[i] = v is written (with whatever garbage v holds from the early-returned callee).
7. The for-loop in k iterates; the next iteration calls oob_reader again (same behaviour).
8. Eventually the task returns. launch_offloaded_tasks checks ctx.get_context().cpu_assert_failed — still 0, because nothing touched the outer context. The loop does not break. If there are more offloaded tasks, they also run.
9. The debug=True post-task check_runtime_error does eventually surface the assertion (via runtime->error_code set by quadrants_assert_format itself), but only after every subsequent task has already executed on corrupted state.
With the three-step fix, step 6 becomes: load new_ctx->cpu_assert_failed (reading the zero-init-then-maybe-set-to-1 slot), branch to a propagate block that stores 1 into get_context()->cpu_assert_failed and emits ret void, matching the invariant that every other assert-propagation site already upholds.

hughperkins · 2026-04-23T07:14:01Z

Question (I dont have astrong opinion on this point, but just posing the question)

the earlier comment in the code in question suggests installing moltenvk sdk using brew
what do you see as the good and bad points of the two posible approaches (ie downloading from lunarg, vs using hte homebrew version?)

duburcqa · 2026-04-23T07:14:15Z

the earlier comment in the code in question suggests installing moltenvk sdk using brew what do you see as the good and bad points of the two posible approaches (ie downloading from lunarg, vs using hte homebrew version?)

I'm not a huge fan of allowing system dependencies. By forcing our own version of MoltenVK, we can guarantee that it works. We do not support any other version than the one shipping with Quadrants and I don't think we want to explore such opportunity. If a dev wants to try some specific version, it is very easy to change it in the CMake file and delete the cache to force re-downloading the updated version. So I don't see any limitation for not supporting system-wise homebrew version. Except that, well, it forces downloading something, so it takes more time to bootstrap build env for the first time.

hughperkins · 2026-04-23T07:16:56Z

the earlier comment in the code in question suggests installing moltenvk sdk using brew what do you see as the good and bad points of the two posible approaches (ie downloading from lunarg, vs using hte homebrew version?)

I'm not a huge fan of allowing system dependencies. By forcing our own version of MoltenVK, we can guarantee that it works. We do not support any other version than the one shipping with Quadrants and I don't think we want to explore such opportunity. If a dev wants to try some specific version, it is very easy to change it in the CMake file and delete the cache to force re-downloading the updated version. So I don't see any limitation for not supporting system-wise homebrew version. Except that, well, it forces downloading something, so it takes more time to bootstrap build env for the first time.

Question: is waht we are downloading a binary, or source code? If source code, what is the impact on build time of using downloaded source code vs brew? (as an example of something we'd want to avoid: building LLVM SDK takes a looonnnnggg time, and a lot of effort; so best pre-built I feel; I dont know about the moltenvk sdk, hence posing the quetion)

duburcqa · 2026-04-23T07:18:47Z

Question: is waht we are downloading a binary, or source code? If source code, what is the impact on build time of using downloaded source code vs brew? (as an example of something we'd want to avoid: building LLVM SDK takes a looonnnnggg time, and a lot of effort; so best pre-built I feel; I dont know about the moltenvk sdk, hence posing the quetion)

We are installing the entire SDK. Exactly as we do on windows. This provides sources and SOME precompiled binaries. Actually all we use is the pre-compiled binary of moltenVK in this case, and we do not build it. So no impact on build it.

hughperkins · 2026-04-23T07:20:16Z

Question: is waht we are downloading a binary, or source code? If source code, what is the impact on build time of using downloaded source code vs brew? (as an example of something we'd want to avoid: building LLVM SDK takes a looonnnnggg time, and a lot of effort; so best pre-built I feel; I dont know about the moltenvk sdk, hence posing the quetion)

We are installing the entire SDK. Exactly as we do on windows. This provides sources and SOME precompiled binaries. Actually all we use is the pre-compiled binary of moltenVK in this case, and we do not build it. So no impact on build it.

Ok. And what about Mac SIP? How are we avoiding triggering that? Have you tested this build on your own Mac locally, and you confirm no SIP issues?

duburcqa · 2026-04-23T07:24:00Z

Ok. And what about Mac SIP? How are we avoiding triggering that? Have you tested this build on your own Mac locally, and you confirm no SIP issues?

I don't know what SIP means, but I can confirm it build locally and run without issue. We are bundling the dylib of moltenvk directly in the wheels (which was the pre-existing behaviour before this PR), so what we are distributing is reasonably standalone). It would be completely standalone if we properly run delocate audit tool on wheels before pushing on pypi to ensure proper name mangling. I think it is not prevent all possible cases of symbol collisions but it is fine is the vast majority of cases.

duburcqa · 2026-04-24T10:37:56Z

@claude review

claude

Inline comments cover two nit-level doc issues; I'm deferring to a human reviewer given the scope here — the PR touches Vulkan device creation, BDA / PSB capability gating, descriptor-set lifecycle, SPIR-V codegen, and a heuristic 32-launch drain for MoltenVK, which is more than I'd want to shadow-approve.

Extended reasoning...

Overview

Six-commit PR that swaps the 2022 Taichi-pinned MoltenVK for a LunarG Vulkan SDK fetch driven by build.py --shell, unblocks physical-storage-buffer (BDA) on Apple, and lands four follow-on MoltenVK quirk fixes: a latent alloc_info.usage → buffer_info.usage typo in VulkanDevice::allocate_memory, NonSemantic.DebugPrintf + shaderSharedFloat*AtomicAdd cap sanitisation on Apple with a companion lazy-import in the SPIR-V IR builder and format-string sanitisation in the overflow-diagnostic path, and a pending_launches_since_sync_ counter in GfxRuntime that forces a drain every 32 launches to bound VulkanStream::submitted_cmdbuffers_ growth.

Security risks

Low for the Apple-guarded paths (cap sanitisation is #if !defined(__APPLE__) and behind the validation-layer gate). The new SDK fetch invokes a LunarG installer binary via subprocess.check_call; the URL and binary name are hard-coded (no shell interpolation), but the LunarG download is a third-party supply-chain dependency the build hadn't had before. The buffer_info.usage typo fix tightens a pre-existing dormant bug rather than introducing new surface.

Level of scrutiny

High. This is cross-platform RHI / build / codegen infrastructure. The PSB capability flip on Apple changes behaviour for every Vulkan-on-Apple user, the vkGetBufferDeviceAddressKHR gating fix also affects Linux, and the kMaxPendingLaunches = 32 drain is a heuristic workaround for a MoltenVK SIGSEGV whose exact threshold ("somewhere around a few hundred") isn't precisely characterised. A human with MoltenVK / Vulkan RHI context should sanity-check the cap-sanitisation choices and the drain threshold.

Other factors

Open discussion in the timeline about Mac SIP that the author answered with "I don't know what SIP means" — worth a human confirming the distributed wheel actually loads cleanly on a fresh Apple Silicon Mac.
No unit tests added; regression coverage relies on the existing Vulkan-backend CI matrix (Mac 15/26, Linux Vulkan).
Comments are dense and the two inline nits (stale path in entry.py, duplicated safety-valve paragraph in runtime.cpp) suggest at least one more editorial pass on the doc before merge.

hughperkins · 2026-04-24T12:37:36Z

Doc: I wonder if we should have a table in our doc somewhere with OSes as the rows (Windows, Mac, Ubuntu), and columns as supported arch type (CUDA, Vulkan, arm, x86 etc) ?

duburcqa · 2026-04-24T13:31:19Z

Good suggestion. Added a Backend / OS matrix to docs/source/user_guide/supported_systems.md in 9c0e306 - rows are macOS (Apple Silicon), Linux x64, Linux ARM64, Windows x86, Windows ARM64; columns are qd.cpu / qd.cuda / qd.amdgpu / qd.metal / qd.vulkan; n/a marks combinations where the vendor platform itself is unavailable (no NVIDIA driver on macOS / Linux ARM64 / Windows ARM64, no ROCm outside Linux x64, no Metal outside Apple). Short notes under the table capture the driver-runtime prerequisites and the bundled MoltenVK on macOS.

hughperkins · 2026-04-24T13:37:58Z

checklist:

doc updated appropriately (added support matrix of supported backends)

=> ok to merge

…ild.py --shell and enable spirv_has_physical_storage_buffer on Apple

…ture bit, not on CHECK_VERSION(1,3) alone

…p.py develop` works after `build.py --shell` exits

…bug_printf so MoltenVK stops rejecting debug-capable kernels

…ueue every 32 launches so MoltenVK stops SIGSEGVing on atomic-float kernels and long simulation loops

…ime::submit_current_cmdlist_if_timeout

…d_systems.md, fix entry.py MoltenVK path comment to match vulkan.py, escape '%' in debug-printf overflow traceback so SPIRV-Cross -> MSL on MoltenVK does not interpret it as a format specifier

* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 453d0a8 to 416d869 Compare April 23, 2026 05:43

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 416d869 to f213cd6 Compare April 23, 2026 05:54

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from c8f36e6 to e76a5a0 Compare April 23, 2026 06:00

claude Bot reviewed Apr 23, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 6c6d25d to 14ae74f Compare April 23, 2026 07:57

duburcqa changed the title ~~[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch~~ [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch + free VkDescriptorSet on shared_ptr release Apr 23, 2026

duburcqa changed the title ~~[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch + free VkDescriptorSet on shared_ptr release~~ [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch Apr 23, 2026

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch 2 times, most recently from 44679dd to 8123ff1 Compare April 23, 2026 11:45

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from ef0b228 to 938b32a Compare April 23, 2026 12:56

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 8123ff1 to 526292d Compare April 23, 2026 12:56

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 938b32a to 29b4c1a Compare April 23, 2026 13:15

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 526292d to 860a914 Compare April 23, 2026 13:15

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 29b4c1a to 9c959ff Compare April 23, 2026 13:17

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 860a914 to e9091bc Compare April 23, 2026 13:17

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 9c959ff to b5f776f Compare April 23, 2026 13:19

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from e9091bc to 63ebc09 Compare April 23, 2026 13:19

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from b5f776f to 505af49 Compare April 23, 2026 15:01

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 63ebc09 to 8fdb352 Compare April 23, 2026 15:01

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 505af49 to d8fe6ae Compare April 23, 2026 15:35

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 8fdb352 to 69fffa2 Compare April 23, 2026 15:35

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from d8fe6ae to 18351d7 Compare April 23, 2026 18:24

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 69fffa2 to cf9a6e6 Compare April 23, 2026 18:24

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 1b25db2 to 9edcfe7 Compare April 24, 2026 07:29

duburcqa changed the title ~~[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch~~ [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement Apr 24, 2026

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 9edcfe7 to 05440e1 Compare April 24, 2026 10:11

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread .github/workflows/scripts/ti_build/entry.py Outdated

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 54d7dba to d074c0e Compare April 24, 2026 11:38

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 05440e1 to 39c5a98 Compare April 24, 2026 11:38

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp

Comment thread .github/workflows/scripts/ti_build/vulkan.py

Comment thread quadrants/rhi/CMakeLists.txt

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 9c0e306 to 649f1a0 Compare April 24, 2026 13:35

hughperkins reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/user_guide/supported_systems.md

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 649f1a0 to 5b6aec9 Compare April 24, 2026 13:38

hughperkins added the ok-to-merge label Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/user_guide/supported_systems.md Outdated

Comment thread .github/workflows/scripts/ti_build/entry.py

Base automatically changed from duburcqa/adstack_bounded_loop_sizing to main April 24, 2026 13:52

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 5b6aec9 to 59101da Compare April 24, 2026 14:00

duburcqa added 7 commits April 24, 2026 17:15

[Build] Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch via bu…

91ac68c

…ild.py --shell and enable spirv_has_physical_storage_buffer on Apple

[Build] Gate Vulkan PSB capability on queried bufferDeviceAddress fea…

ff3d3b1

…ture bit, not on CHECK_VERSION(1,3) alone

[Build] Find MoltenVK from ti-build-cache SDK extract so `python setu…

74d33d5

…p.py develop` works after `build.py --shell` exits

[Vulkan] Drop spirv_has_non_semantic_info on Apple and lazy-import de…

9f59bb0

…bug_printf so MoltenVK stops rejecting debug-capable kernels

[Vulkan] Drop shared_atomic_float caps on Apple and drain cmdbuffer q…

3110aa1

…ueue every 32 launches so MoltenVK stops SIGSEGVing on atomic-float kernels and long simulation loops

[Vulkan] Consolidate duplicate safety-valve comment blocks in GfxRunt…

064cb03

…ime::submit_current_cmdlist_if_timeout

[Docs/Vulkan] Document backend-vs-OS compatibility matrix in supporte…

2859a1d

…d_systems.md, fix entry.py MoltenVK path comment to match vulkan.py, escape '%' in debug-printf overflow traceback so SPIRV-Cross -> MSL on MoltenVK does not interpret it as a format specifier

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 59101da to 2859a1d Compare April 24, 2026 15:17

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread .github/workflows/scripts/ti_build/vulkan.py

duburcqa merged commit ae2d1c0 into main Apr 24, 2026
57 of 58 checks passed

duburcqa deleted the duburcqa/moltenvk_sdk_source branch April 24, 2026 17:24

Uh oh!

Conversation

duburcqa commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LunarG-sourced MoltenVK on Apple unblocks PSB (BDA) for the runtime adstack sizer, plus the three MoltenVK quirks that surface once PSB, validation, and long kernel-loop workloads are live

TL;DR

Why

Surface API

Entry points

Mechanism end-to-end

1. SDK acquisition (vulkan.py)

2. CMake pickup (quadrants/rhi/CMakeLists.txt)

3. PSB capability unblocked (vulkan_device_creator.cpp)

4. vkGetBufferDeviceAddressKHR now sees the right usage bit (vulkan_device.cpp)

5. MoltenVK cap sanitisation (vulkan_device_creator.cpp)

6. Companion lazy-import + format-string sanitisation (spirv_ir_builder.{cpp,h}, spirv_codegen.cpp)

7. Descriptor-set lifecycle fix (vulkan_api.cpp)

8. Periodic submitted_cmdbuffers_ drain (runtime/gfx/runtime.{h,cpp})

Per-backend coverage matrix

Tests

CI

Local smoke

Side-effect audit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What the bug is

How the premise breaks: multi-IB kernels

Why the refutations don't cover this

Why the existing comment doesn't save this

Impact

Step-by-step proof

Fix

What the bug is

The specific code path

Why existing code doesn't prevent it

Impact

How to fix

Proof via a concrete example

Uh oh!

hughperkins commented Apr 23, 2026

Uh oh!

duburcqa commented Apr 23, 2026

Uh oh!

hughperkins commented Apr 23, 2026

Uh oh!

duburcqa commented Apr 23, 2026

Uh oh!

hughperkins commented Apr 23, 2026

Uh oh!

duburcqa commented Apr 23, 2026

Uh oh!

duburcqa commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 24, 2026

Uh oh!

duburcqa commented Apr 24, 2026

Uh oh!

Uh oh!

hughperkins commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

duburcqa commented Apr 23, 2026 •

edited

Loading

1. SDK acquisition (`vulkan.py`)

2. CMake pickup (`quadrants/rhi/CMakeLists.txt`)

3. PSB capability unblocked (`vulkan_device_creator.cpp`)

4. `vkGetBufferDeviceAddressKHR` now sees the right usage bit (`vulkan_device.cpp`)

5. MoltenVK cap sanitisation (`vulkan_device_creator.cpp`)

6. Companion lazy-import + format-string sanitisation (`spirv_ir_builder.{cpp,h}`, `spirv_codegen.cpp`)

7. Descriptor-set lifecycle fix (`vulkan_api.cpp`)

8. Periodic `submitted_cmdbuffers_` drain (`runtime/gfx/runtime.{h,cpp}`)