Skip to content

[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement#551

Merged
duburcqa merged 7 commits into
mainfrom
duburcqa/moltenvk_sdk_source
Apr 24, 2026
Merged

[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement#551
duburcqa merged 7 commits into
mainfrom
duburcqa/moltenvk_sdk_source

Conversation

@duburcqa

@duburcqa duburcqa commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

LunarG-sourced MoltenVK on Apple unblocks PSB (BDA) for the runtime adstack sizer, plus the three MoltenVK quirks that surface once PSB, validation, and long kernel-loop workloads are live

Six commits. Originally three: swap the 2022 Taichi-pinned MoltenVK for a LunarG Vulkan SDK fetch driven by ./build.py --shell, gate the PSB capability on the queried bufferDeviceAddress feature bit, and locate the staged dylib from the SDK extract so python setup.py develop keeps working after the shell exits. Three more were added once the new MoltenVK was exercised against the reverse-mode / MPM / tile16 test matrix: a vkGetBufferDeviceAddressKHR buffer-usage-bit fix that was latent while PSB was off on Apple, a NonSemantic.DebugPrintf / shaderSharedFloat32AtomicAdd cap-sanitisation pair that MoltenVK advertises but cannot actually service, and a GfxRuntime safety valve that drains the Vulkan stream's submitted_cmdbuffers_ queue on long kernel-launch loops so MoltenVK's encoder-state tracker stops SIGSEGVing. The PR now ships a MoltenVK that is both functional for BDA-backed reverse-mode workloads and doesn't regress the rest of the SPIR-V test suite on CI's Mac runners.

TL;DR

# macOS / arm64, from a clean checkout:
./build.py --shell -- cmake -S . -B build -DQD_WITH_VULKAN=ON
# $VULKAN_SDK / $MOLTENVK_DIR are exported by the shell hook;
# CMake's find_file locates libMoltenVK.dylib under the SDK and stages it into build/.
./build.py
python -c "import quadrants as qd; qd.init(arch=qd.vulkan)"
# Now reports spirv_has_physical_storage_buffer=True on Apple, and the full
# adstack / tile16 / MPM test matrix is green on Mac CI (15 & 26) and Linux Vulkan.

The shell hook fetches LunarG's macOS installer once, extracts the SDK into ~/.cache/quadrants/vulkan-macos-1.4.321.0/, and exports VULKAN_SDK / MOLTENVK_DIR. CMake then picks up libMoltenVK.dylib from the SDK instead of Taichi's legacy pinned dylib. Downstream, the adstack sizer compute shader (Autodiff 17) relies on BDA to walk SizeExpr trees on device, so enabling PSB on Apple is the gating change that makes that shader legal to dispatch. The four add-on commits then cover what the freshly-enabled PSB path exposes - a latent buffer-usage-bit typo, two MoltenVK caps that are advertised-but-broken, and a cmdbuffer-queue drain that repeated kernel launches need.

Why

The previous Apple Vulkan path pinned a 2022 MoltenVK dylib hosted on taichi_assets, predating the VK_KHR_buffer_device_address / physical-storage-buffer capability. Three concrete consequences:

  • vulkan_device_creator.cpp hard-guarded spirv_has_physical_storage_buffer off on Apple behind #if !defined(__APPLE__) && false, citing taichi-dev/taichi#6295.
  • The adstack sizer shader that lands in Autodiff 17 needs OpLoad through a PhysicalStorageBuffer-class pointer to read SizeExpr leaves; with PSB disabled on Apple, every reverse-mode kernel would hard-error at launch time on Metal.
  • A dormant typo in VulkanDevice::allocate_memory (alloc_info.usage & VK_BUFFER_USAGE_STORAGE_BUFFER_BIT instead of buffer_info.usage & ...) made the "attach SHADER_DEVICE_ADDRESS_BIT" branch dead for every buffer; unreachable while the PSB cap was off, but the moment PSB is enabled every buffer becomes a validation-layer violation (Linux) or garbage-address read (MoltenVK).

The less-targeted workarounds are insufficient: keeping the Taichi pin and papering over the capability check would ship a MoltenVK that cannot serve BDA loads; asking every Quadrants developer to install LunarG's SDK globally breaks hermetic CI. Fetching through build.py --shell gives us a single, reproducible SDK path the rest of the build consumes.

Surface API

No Python-surface API changes. All diff is build-system (vulkan.py, quadrants/rhi/CMakeLists.txt), Vulkan RHI internals (vulkan_device_creator.cpp, vulkan_device.cpp, vulkan_api.cpp), and SPIR-V codegen internals (spirv_ir_builder.cpp, spirv_codegen.cpp, runtime/gfx/runtime.cpp + runtime.h). Behaviour deltas visible to users of qd.init(arch=qd.vulkan):

  • qd.lang.impl.current_cfg().spirv_has_physical_storage_buffer flips to True on Apple.
  • qd.init(arch=qd.vulkan, debug=True) no longer fails pipeline creation on MoltenVK for kernels that emit debugPrintfEXT traffic (lazy-import + Apple cap drop).
  • Reverse-mode kernels using qd.simt.block.SharedArray with an atomic-f32 add / sub no longer fail MoltenVK's MSL compile with atomic_fetch_add_explicit(threadgroup atomic_float*, ...) - they route through the CAS-emulated fallback instead.
  • Long kernel-launch loops (MPM-style simulations, iterative field updates) no longer SIGSEGV inside MVKCommandEncoder after a few hundred launches without a qd.sync().

Entry points

File What changes
.github/workflows/scripts/ti_build/vulkan.py setup_vulkan() gains a Darwin / arm64 branch that fetches + extracts + installs LunarG's macOS bundle.
quadrants/rhi/CMakeLists.txt Apple branch locates libMoltenVK.dylib via $MOLTENVK_DIR / $VULKAN_SDK; configure_file stages it into ${CMAKE_BINARY_DIR}/libMoltenVK.dylib. FATAL_ERROR on a missing SDK.
quadrants/rhi/vulkan/vulkan_device_creator.cpp Drops the Apple kill-switch around spirv_has_physical_storage_buffer. Gates the overall PSB cap on the queried bufferDeviceAddress feature bit. Skips VK_KHR_shader_non_semantic_info on Apple (advertised but the MSL translator can't emit debugPrintfEXT). Skips shaderSharedFloat{16,32,64}AtomicAdd on Apple (same reason: MSL rejects atomic_fetch_add_explicit on threadgroup atomic_float*).
quadrants/rhi/vulkan/vulkan_device.cpp Fixes the alloc_info.usagebuffer_info.usage typo that made the "attach SHADER_DEVICE_ADDRESS_BIT" branch dead. Gates vkGetBufferDeviceAddressKHR on whether the bit is actually set, so uniform / vertex / transfer-only staging buffers no longer trip VUID-VkBufferDeviceAddressInfo-buffer-02601.
quadrants/rhi/vulkan/vulkan_api.cpp Frees descriptor sets on shared_ptr release so MoltenVK's pool churn does not null-pool-deref after ~32 two-set kernel launches.
quadrants/runtime/gfx/runtime.{h,cpp} Adds a pending_launches_since_sync_ counter; submit_current_cmdlist_if_timeout forces a synchronize() every kMaxPendingLaunches = 32 launches to bound VulkanStream::submitted_cmdbuffers_ growth on MPM-style tight kernel-launch loops.
quadrants/codegen/spirv/spirv_ir_builder.{cpp,h} Lazy-imports NonSemantic.DebugPrintf only when a call_debugprintf site actually needs it, so kernels with no print / debug-assert traffic stay MoltenVK-compatible.
quadrants/codegen/spirv/spirv_codegen.cpp Sanitises the overflow-diagnostic traceback before feeding it to call_debugprintf: un-escaped quotes / newlines in the traceback string survive MoltenVK's MSL translation into the output and previously produced use of undeclared identifier 'Users'-class errors from the path prefix.

Mechanism end-to-end

1. SDK acquisition (vulkan.py)

Platform Source Prefix
Linux vulkansdk-linux-x86_64-1.4.321.1.tar.xz (tarball, unchanged) ~/.cache/quadrants/vulkan-1.4.321.1/x86_64/
Darwin / arm64 vulkansdk-macos-1.4.321.0.zip (installer bundle) ~/.cache/quadrants/vulkan-macos-1.4.321.0/
Windows MSI (unchanged) ~/.cache/quadrants/vulkan-win-1.4.321.1/

The macOS branch is the only new one. LunarG didn't publish a 1.4.321.1 macOS asset, so the patch-level is inlined to 1.4.321.0. zipfile drops the installer bundle without preserving the Unix execute bit, so the script chmod 0755s the installer binary before running it (idempotent, scoped to the single file). The CLI install command writes the SDK into the --root prefix.

2. CMake pickup (quadrants/rhi/CMakeLists.txt)

Env var Meaning Consumer
MOLTENVK_DIR path that directly contains libMoltenVK.dylib find_file(MOLTEN_VK libMoltenVK.dylib NO_DEFAULT_PATH PATHS ${MOLTENVK_DIR})
VULKAN_SDK SDK prefix; ${VULKAN_SDK}/lib is tried if MOLTENVK_DIR is unset same find_file call, fallback path

configure_file stages the located dylib into ${CMAKE_BINARY_DIR}/libMoltenVK.dylib (copy, not symlink, so the install step can re-digest it) and install(FILES ... DESTINATION ${INSTALL_LIB_DIR}/runtime) ships it alongside the runtime. A missing SDK is a FATAL_ERROR pointing at ./build.py --shell; there is no silent fallback to the legacy pin on purpose.

3. PSB capability unblocked (vulkan_device_creator.cpp)

Removes the #if !defined(__APPLE__) && false kill-switch gate around caps.set(DeviceCapability::spirv_has_physical_storage_buffer, true). The surrounding gate is tightened from CHECK_VERSION(1, 3) || buffer_device_address_feature.bufferDeviceAddress to a plain feature-bit check: Vulkan 1.3 promotes VK_KHR_buffer_device_address into core but still lets implementations expose bufferDeviceAddress = VK_FALSE, so the version-OR gate was treating 1.3 devices as PSB-capable even when they weren't. Devices that genuinely don't advertise BDA (ancient drivers, headless CI without Vulkan) remain safe.

4. vkGetBufferDeviceAddressKHR now sees the right usage bit (vulkan_device.cpp)

Before this PR the branch that ORs VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_KHR into buffer_info.usage was gated on alloc_info.usage & VK_BUFFER_USAGE_STORAGE_BUFFER_BIT - but alloc_info.usage is VMA's VmaMemoryUsage enum (small integers), not the Vulkan VkBufferUsageFlags bitfield. The & always yielded 0; the branch was dead; every buffer reached the vkGetBufferDeviceAddressKHR call below without the required bit. Latent while PSB was off on Apple (no one called vkGetBufferDeviceAddressKHR). Once PSB is on it fires VUID-VkBufferDeviceAddressInfo-buffer-02601 under validation (Linux CI's test_print stderr-assertion failures) and returns a garbage address under MoltenVK (Mac CI's test_tile16_* / test_mpm88_numpy_and_ndarray wrong-output failures). Fix reads buffer_info.usage instead, and additionally gates the vkGetBufferDeviceAddressKHR call on the bit actually being set, so uniform / vertex / transfer-only staging buffers skip the BDA query and keep alloc.addr == 0.

5. MoltenVK cap sanitisation (vulkan_device_creator.cpp)

MoltenVK advertises two Vulkan capabilities whose SPIR-V → MSL translation is broken:

  • VK_KHR_shader_non_semantic_info: the extension enumerates fine, OpExtInstImport "NonSemantic.DebugPrintf" validates, the OpExtInst call sites pass SPIR-V validation, but SPIRV-Cross emits an unconditional debugPrintfEXT(...) call stub whose identifier Metal's MSL compiler rejects (use of undeclared identifier 'debugPrintfEXT'). Every reverse-mode kernel that happens to compile with a debug=True debugPrintfEXT site fails pipeline creation on MoltenVK. Skipped on Apple.
  • shaderShared{Float32,Float16,Float64}AtomicAdd: the feature bit is set, but MoltenVK's MSL translator emits atomic_fetch_add_explicit((threadgroup atomic_float*) &x, ...) which Metal rejects with cannot pass pointer to address space 'threadgroup' as a pointer to address space 'device'. Skipped on Apple, routing shared-memory float atomics through the existing CAS-emulated fallback in atomic_operation_widened.

The skips are #if !defined(__APPLE__) guards, with the MoltenVK issue links in the comment at each site.

6. Companion lazy-import + format-string sanitisation (spirv_ir_builder.{cpp,h}, spirv_codegen.cpp)

Even with spirv_has_non_semantic_info turned off on Apple, kernels with debug=True can still enter the arithmetic-overflow check path in spirv_codegen.cpp::generate_overflow_branch, which calls ir_->call_debugprintf(...). Left untreated, the traceback string passed to that call contains un-escaped " and \n characters (Python source file paths, newlines) that survive the MSL translation and blow up the output with errors like missing terminating '"' character. Two mitigations:

  • spirv_ir_builder::init_pre_defs no longer eagerly imports NonSemantic.DebugPrintf; the import now fires lazily from the first call_debugprintf site. Kernels with no debug traffic emit no OpExtInstImport, so MoltenVK's unused-import stub never runs.
  • TaskCodegen::generate_overflow_branch escapes " and replaces \n / \r with spaces before feeding the traceback into the format string. Native Vulkan drivers get the traceback byte-for-byte; Metal / MSL round-trips cleanly.

7. Descriptor-set lifecycle fix (vulkan_api.cpp)

DeviceObjVkDescriptorSet::~DeviceObjVkDescriptorSet now returns the VkDescriptorSet to its source pool via vkFreeDescriptorSets. Without this, each launch accumulates consumed-but-never-reclaimed slots, VulkanDevice::alloc_desc_set spins up fresh pools at the 64-set boundary, and MoltenVK's MVKDescriptorSet::_pool can deref a pool the driver has torn down (null-pool deref inside MVKResourcesCommandEncoderState::bindDescriptorSet). The pool is created with VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT, so the free call is legal; the ref_pool shared_ptr keeps the pool and its VkDevice alive past the destructor.

8. Periodic submitted_cmdbuffers_ drain (runtime/gfx/runtime.{h,cpp})

VulkanStream::submit appends one TrackedCmdbuf{fence, cmd_buffer} per submit. The vector is only cleared in command_sync() / wait_idle(). Workloads that push hundreds of kernels before any host-side observable (MPM, iterative field solves) accumulate hundreds of live fences + cmdbuffers + descriptor sets; MoltenVK's encoder-state tracker SIGSEGVs somewhere around that size. GfxRuntime::submit_current_cmdlist_if_timeout now also drains the queue every kMaxPendingLaunches = 32 launches via a bounded synchronize; workloads that already touch a Python observable per iteration are unaffected (ctx_buffers_ clears earlier via the normal synchronize path).

Per-backend coverage matrix

Backend Affected by this PR? Verdict
CPU (LLVM) No N/A - does not compile Vulkan RHI.
CUDA (LLVM) No N/A - does not compile Vulkan RHI.
AMDGPU (LLVM) No N/A - does not compile Vulkan RHI.
Metal (SPIR-V) Indirectly None of the Apple-guarded caps / cap-sanitisation code reaches the Metal RHI. Validated via CI Test on Mac (15, 3.*) / (26, 3.*).
Vulkan on Apple / MoltenVK Yes MoltenVK is now LunarG-sourced; PSB + BDA enabled; non_semantic_info / shared_atomic_float caps off to match what MoltenVK's MSL translator actually supports; descriptor-set + cmdbuffer-queue lifecycle fixes in place. Covered end-to-end by Test on Mac (15, 3.*) / (26, 3.*).
Vulkan on Linux Yes The buffer_info.usage fix and the vkGetBufferDeviceAddressKHR bit-gate fix also apply here; they resolve the test_print stderr validation-layer failures that test_gpu / Test Linux Vulkan was reporting. .tar.xz branch in vulkan.py and the Linux PSB / non-semantic-info paths are untouched.
Vulkan on Windows (SPIR-V) No MSI branch in vulkan.py untouched; the buffer_info.usage fix applies but is a no-op relative to the pre-PR state because Windows was already validation-clean.

Tests

CI

  • Test on Mac (15, 3.10-3.13) and Test on Mac (26, 3.10-3.13) exercise the new fetch end-to-end and run the full Vulkan-backend test matrix. Pre-PR: test_tile16_*[arch=vulkan-*] / test_mpm88_numpy_and_ndarray[arch=vulkan-0] / test_shared_array_float_atomics[arch=vulkan-*-dtype1-{add,sub}] fail. Post-PR: those pass; any new regressions surface here.
  • test_gpu / Test Linux Vulkan exercises the vkGetBufferDeviceAddressKHR bit-gate fix by running with validation enabled. Pre-PR: test_print_*[arch=vulkan] fail because VUID-VkBufferDeviceAddressInfo-buffer-02601 warnings pollute stderr; post-PR those go quiet.
  • Manylinux wheel Build/Test (ubuntu-22.04 / ubuntu-22.04-arm) validates that the Linux branch of vulkan.py is unchanged.
  • Windows 2025 Build/Test (3.10-3.13) validates that the Windows branch of vulkan.py is unchanged.

Local smoke

  • ./build.py --shell -- cmake -S . -B build -DQD_WITH_VULKAN=ON && ./build.py on macOS-26 / arm64 succeeds and stages libMoltenVK.dylib into build/.
  • python -c "import quadrants as qd; qd.init(arch=qd.vulkan); print(qd.lang.impl.current_cfg().spirv_has_physical_storage_buffer)" reports True after this PR; reports False before it.
  • CMAKE_BUILD_TYPE=Debug cmake --log-level=DEBUG shows the MoltenVK: using LunarG Vulkan SDK copy at ... status line.

No unit tests are added by this PR itself: the SDK and RHI changes surface via the existing Vulkan-backend test matrix, which is the regression harness. The atomic-fetch-add and debug-printf MoltenVK quirks are already covered by test_shared_array_float_atomics and the existing debug=True-using test_matrix / test_tile16 cases respectively.

Side-effect audit

Concern Where checked Verdict
Linux / Windows Vulkan SDK fetch vulkan.py case (\"Linux\", \"x86_64\") / case (\"Windows\", \"AMD64\") branches untouched ok - no behaviour change outside Apple
VULKAN_SDK env var semantics exported from the shell hook; consumed by quadrants/rhi/CMakeLists.txt (BSD find path), quadrants/rhi/vulkan/vulkan_device_creator.cpp (runtime loader), shader compiler glslang lookup ok - existing consumers keep using $VULKAN_SDK; the new macOS prefix looks identical in layout
MOLTENVK_DIR env var new. Only read by quadrants/rhi/CMakeLists.txt; no runtime lookup ok - opt-in; falls back to $VULKAN_SDK/lib
CMake cache find_file(MOLTEN_VK ...) is cached; after the SDK is installed the first configure populates it and subsequent configures skip ok - deleting ~/.cache/quadrants/vulkan-macos-.../ + rm -rf build regenerates from scratch
BDA feature-bit gate now the sole gate on spirv_has_physical_storage_buffer intentional - Vulkan 1.3 devices without BDA (present on some headless CI drivers) no longer get PSB set
alloc_info.usagebuffer_info.usage fix allocate_memory branch attaches SHADER_DEVICE_ADDRESS_BIT to the Vulkan buffer usage only; VMA allocation usage is unchanged ok - VMA allocation strategy untouched; only the Vulkan buffer carries the new usage bit
vkGetBufferDeviceAddressKHR bit-gate added alongside the fix; buffers without the bit skip the BDA query cleanly ok - uniform / vertex / transfer-only buffers no longer spam validation or return garbage addresses
non_semantic_info skipped on Apple gate is #if !defined(__APPLE__); other platforms unaffected ok - covered by per-backend matrix above
shared_atomic_float* skipped on Apple same gate; CAS-emulated path already exists in atomic_operation_widened ok - pure correctness fix for Apple; other platforms retain native support
Non-Apple behaviour Apple-guarded block (if (APPLE)) in CMakeLists.txt; Apple-guarded caps in vulkan_device_creator.cpp ok - Linux / Windows Vulkan paths untouched
Download cache download_dep(url, installer_dir, strip=1) uses the existing cache primitive; re-runs short-circuit on cached unzip + on the existence of $prefix/macOS/ ok - no network on re-run
Installer execute bit installer_bin.chmod(0o755) before subprocess.check_call; idempotent ok - zipfile dropped mode 0644, handled here
$VK_LAYER_PATH set on every platform branch including the new macOS one ok - layer validation remains wired
Legacy Taichi pin removal CMakeLists.txt previously curl-ed libMoltenVK.dylib.zip from taichi_assets - removed. FATAL_ERROR replaces the silent fallback. intentional; no legacy fallback is shipped
Lazy NonSemantic.DebugPrintf import no call to call_debugprintf -> no OpExtInstImport; every previously-working Vulkan driver still sees the import when a kernel actually needs it ok - no effect on platforms that don't advertise the cap
Descriptor-set reuse vkFreeDescriptorSets per-set on destruction, pool retains VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT ok - the call is legal by construction; verified by CI on Mac and Linux
pending_launches_since_sync_ threshold 32 launches; reset on every synchronize(); only fires when no Python-side observable has intervened ok - pathological-loop safety valve; normal workloads are unaffected

@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 453d0a8 to 416d869 Compare April 23, 2026 05:43

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

call(llvm_func, new_ctx);
llvm_val[stmt] = result_buffer;

P1 Badge Propagate cpu_assert_failed after real_func calls

In TaskCodeGenLLVM::visit(FuncCallStmt), the generated caller invokes the callee with a fresh RuntimeContext and then immediately continues (call(llvm_func, new_ctx)) without checking or forwarding new_ctx->cpu_assert_failed. On CPU, assertions now rely on this flag to abort execution, so an out-of-bounds/assert failure inside @qd.real_func is swallowed at the call boundary and the caller keeps running with invalid state. This makes debug assertions inside real functions ineffective and can reintroduce post-assert memory faults instead of cleanly terminating the kernel.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 416d869 to f213cd6 Compare April 23, 2026 05:54
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from c8f36e6 to e76a5a0 Compare April 23, 2026 06:00

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 quadrants/codegen/spirv/spirv_codegen.cpp:2320-2347 — SPIR-V codegen caches a single per-task invoc_id * stride SSA id in ad_stack_heap_thread_base_{float,int}_ and emits the underlying OpIMul via ir_->mul(...) into the current insertion block at the first AdStackAllocaStmt visit site (spirv_codegen.cpp:2324-2351). When a task contains multiple independent blocks — e.g. sibling inner range-fors that are each their own IB, each carrying its own f32 loop-carried variable — auto_diff.cpp's per-IB pipeline runs BackupSSA::run(ib) independently for each IB, so each AdStackAllocaStmt is hoisted (at most) to its own IB root. The first visit emits the OpIMul inside IB1's body block; the second visit reuses the cached SSA id from a block that does not dominate IB2, violating SPIR-V §2.16. Fix mirrors the LLVM backend's ensure_ad_stack_heap_base_llvm() (codegen_llvm.cpp:2166-2186): emit the OpIMul at the task function's entry/dispatch-entry via an insertion-point save/restore, not at the first alloca visit site.

    Extended reasoning...

    What the bug is

    get_ad_stack_heap_thread_base_float() / get_ad_stack_heap_thread_base_int() cache a single SSA id per task and emit the backing invoc_id * stride OpIMul via ir_->mul(...), which commits the instruction to the IR builder's current insertion block. Emission is triggered eagerly from visit(AdStackAllocaStmt) at the first alloca visit; every subsequent Push/Pop/LoadTop/AccAdjoint re-reads the cached id via ad_stack_heap_{float,int}_ptr(). The cached id therefore dominates all downstream uses only if the first visit happens inside a block that structurally dominates every other AdStackAllocaStmt of the same heap kind.

    The code comment in spirv_codegen.cpp:2326-2335 claims this holds because the first visit "lives in the dispatch body that dominates all inner loop bodies". That premise is what the bug contradicts.

    How the premise breaks: multi-IB kernels

    Reverse-mode AD's pipeline (quadrants/transforms/auto_diff.cpp:2726-2755) identifies multiple independent blocks and runs PromoteSSA2LocalVar / ReplaceLocalVarWithStacks / MakeAdjoint / BackupSSA per-IB. For a kernel shaped like:

    for i in outer:                 # struct-for (outer)
        for j in range(n):          # inner range-for #1  -> IB1 = its body
            v = x[i, j]             # AllocaStmt at IB1 root
            for _ in range(k):      # dynamic inner
                v = qd.sin(v)
            out_a[i] += v
        for j in range(n):          # inner range-for #2  -> IB2 = its body
            w = y[i, j]             # AllocaStmt at IB2 root
            for _ in range(k):
                w = qd.cos(w)
            out_b[i] += w

    IdentifyIndependentBlocks gives IB1 = inner-loop-1's body and IB2 = inner-loop-2's body (each has its own global atomic on a different output, so each qualifies as a smallest IB). BackupSSA::run(ib) uses independent_block = ib, so the hoisted backup AdStackAllocaStmt is inserted at that IB's position 0 — not at a task-wide root that dominates both IBs.

    In IR order, SPIR-V codegen then visits:

    1. start_label(inner1_body_label) at the inner1 RangeForStmt header.
    2. visit(AdStackAllocaStmt_v) at IB1 root. Calls get_ad_stack_heap_thread_base_float(), which routes ir_->mul(...) through DEFINE_BUILDER_BINARY_USIGN_OP(mul, Mul) -> make_value(OpIMul, ...) -> make_inst, committing the OpIMul to curr_label_ == inner1_body_label. Caches the result SSA id.
    3. Exit inner1. start_label(inner2_body_label).
    4. visit(AdStackAllocaStmt_w) at IB2 root. Cache hit — returns the SSA id defined in step 2.
    5. visit(AdStackPushStmt) for w inside the inner dynamic loop of inner2 calls ad_stack_heap_float_ptr(...), which does ir_->add(base, ...) in inner2's body. The OpIAdd has an operand (the cached base) whose defining instruction lives in inner1_body_label.

    inner1_body_label and inner2_body_label are sibling children of the outer for-loop's merge/header — neither dominates the other. SPIR-V §2.16.2 rejects this; spirv-val prints a non-dominating-use error and drivers can TDR silently.

    Why the refutations don't cover this

    Both refutations correctly identify that BackupSSA::generic_visit hoists AdStackAllocaStmts to independent_block when a cross-block reference is detected — and this is sufficient for the narrow mutually-exclusive-if-branches within a single IB shape: MakeAdjoint creates a reverse new_if sibling to the forward if_stmt at the IB root, references from new_if's branches fall outside the forward if-branch's leaf_to_root chain, and the backup is inserted at IB root via independent_block->insert(std::move(backup_stack_alloca), 0) (auto_diff.cpp:2595). For that shape the bug report's claim is indeed partially wrong.

    But the hoist is scoped to one IB at a time. When the kernel has sibling inner loops whose bodies are each IBs, each invocation of BackupSSA::run(ib) hoists its allocas to its own root — not to a task-wide block. The two resulting AdStackAllocaStmts live in sibling, mutually-non-dominating blocks. That is exactly the shape where the cached invoc_id * stride SSA id violates dominance.

    The refutation about test_adstack_if_cond_snapshot_adaptive_sizing doesn't disprove this shape either: that test uses an if/elif/elif/else on a single carried variable (outputs[i_inner, i_batch]), so there is only one adstack kind and one alloca site. It produces no sibling-alloca pair and does not stress the cache.

    Why the existing comment doesn't save this

    The implementation's own inline comment at spirv_codegen.h:219-225 defends eager-at-alloca-site emission with:

    Emitted eagerly from visit(AdStackAllocaStmt) so the OpIMul lives in the alloca's enclosing block, which strictly dominates every sibling inner loop that later references the cached SSA id.

    That invariant relies on the alloca's enclosing block being task-global — i.e. the dispatch-body/function-entry block. With per-IB BackupSSA, the enclosing block is the IB root, which is task-global only when the kernel happens to have exactly one IB. The comment's invariant is therefore an accidental property of the test corpus, not a pipeline guarantee.

    The LLVM backend already diagnosed the exact same concern and solved it explicitly: TaskCodeGenLLVM::ensure_ad_stack_heap_base_llvm() in codegen_llvm.cpp:2166-2186 emits the base load at entry_block via an llvm::IRBuilderBase::InsertPointGuard, with a comment calling out "two sibling adstacks under different branches of an if would trip verifyFunction with a non-dominating use". The SPIR-V side should mirror this.

    Impact

    • spirv-val rejects the produced SPIR-V with a non-dominating-operand error (SPIR-V §2.16.2).
    • Native Metal / Vulkan drivers vary: some refuse to compile the shader, others miscompile silently.
    • This is triggered by a natural reverse-mode AD shape — two accumulators with their own dynamic loops in the same kernel — and is not exercised by any of the PR's new SPIR-V heap-adstack tests.

    Step-by-step proof

    Consider the kernel above, with n = 4, k = 3, compiled with ad_stack_experimental_enabled=True.

    1. IdentifyIndependentBlocks::run(root) returns {inner1_body_block, inner2_body_block} because each inner body is the smallest IB with a qualifying global atomic.
    2. For ib = inner1_body_block:
      • ReplaceLocalVarWithStacks replaces AllocaStmt_v in place with AdStackAllocaStmt_v (at inner1_body position 0, since it was the first user stmt).
      • MakeAdjoint emits reverse code (new_for with body referencing AdStackAllocaStmt_v) appended to inner1_body_block.
      • BackupSSA examines reverse ops whose op->parent is inner1_body_block. Here inner1_body_block is in each reverse stmt's leaf_to_root, so no hoist fires. AdStackAllocaStmt_v stays at inner1_body position 0.
    3. For ib = inner2_body_block: symmetric. AdStackAllocaStmt_w ends up at inner2_body position 0.
    4. SPIR-V codegen's run() pre-scans IR (spirv_codegen.cpp:131-168) to size ad_stack_heap_per_thread_stride_float_. Both allocas are f32 with max_size bounded by the bounded-loop analyzer (k = 3 each), so stride ends up at ~12 f32 elements.
    5. Code emission walks outer struct-for, enters inner1. visit(RangeForStmt) calls start_label(body_label_inner1). Now curr_label_ = body_label_inner1.
    6. visit(AdStackAllocaStmt_v) at spirv_codegen.cpp:2420 calls get_ad_stack_heap_thread_base_float() which emits OpIMul %u32 %invoc_id %stride under body_label_inner1 and caches the SSA id as %base_ssa.
    7. Exits inner1. visit(RangeForStmt) for inner2 calls start_label(body_label_inner2). curr_label_ = body_label_inner2.
    8. visit(AdStackAllocaStmt_w) at the same line. Cache hit: returns %base_ssa (defined in body_label_inner1).
    9. Any later visit(AdStackPushStmt) / visit(AdStackLoadTopStmt) on w calls ad_stack_heap_float_ptr(offset, count) which executes ir_->add(%base_ssa, offset_val) under body_label_inner2.
    10. The OpIAdd references %base_ssa whose defining OpIMul is in body_label_inner1. In the CFG, body_label_inner1 is not on every path to body_label_inner2 (they are sibling loop bodies under the outer struct-for header), so it does not dominate the use. spirv-val's structured-dominance pass rejects the module.

    Fix

    Mirror the LLVM backend. Add a one-shot ensure_ad_stack_heap_thread_base_{float,int}() that:

    1. Saves the current insertion point (e.g. ir_->save_insert_point() or an equivalent).
    2. Switches to the function's entry/dispatch-body block (the block right after the offloaded task's function header; equivalent to LLVM's entry_block).
    3. Emits the OpUConvert/OpIMul.
    4. Restores the original insertion point.
    5. Caches the result.

    Call it from visit(AdStackAllocaStmt) and both ad_stack_heap_{float,int}_ptr lazily. This guarantees the OpIMul lives in a block that dominates every other block in the function regardless of how many IBs the task contains.

  • 🔴 quadrants/codegen/llvm/codegen_llvm.cpp:2579-2585 — The PR adds a CPU assertion-propagation mechanism (cpu_assert_failed) but explicitly acknowledges via FIXME (codegen_llvm.cpp:2579-2584) that it is not propagated out of @qd.real_func callees. An OOB/assertion inside a real_func on CPU sets the flag on the callee's new_ctx alloca, but the caller never reads it — subsequent tasks continue running on possibly-corrupted data, which is exactly the silent-segfault class the PR is meant to prevent. Fix by zero-initializing new_ctx.cpu_assert_failed before the call, checking it after, propagating to get_context()->cpu_assert_failed, and emitting an early ret void on failure — all three steps are enumerated in the FIXME.

    Extended reasoning...

    What the bug is

    The PR's central mechanism — setting cpu_assert_failed=1 inside quadrants_assert_format_ctx and having the kernel launcher break out of the task loop — fails when the assertion fires inside a @qd.real_func callee on CPU. The callee correctly writes to its context, but the caller's context is never updated.

    The specific code path

    At quadrants/codegen/llvm/codegen_llvm.cpp:2585, visit(FuncCallStmt) allocates the callee's context via create_entry_block_alloca(RuntimeContext) and only initializes the runtime field on line 2586. The call is then emitted on line 2600 via call(llvm_func, new_ctx), with no post-call propagation.

    Inside the real_func body compilation (stmt->func->ir->accept(this) on line 2575), any AssertStmt routes through use_ctx_variant=true (since arch_is_cpu) and calls quadrants_assert_format_ctx with get_context() == get_arg(0), which is the caller's new_ctx pointer. When the assertion fires, runtime.cpp:845 writes new_ctx->cpu_assert_failed = 1 and codegen_llvm.cpp:1182 emits an early CreateRetVoid.

    Why existing code doesn't prevent it

    Back in the caller's task body, the flag on new_ctx is never copied into the caller's context. The outer launch_offloaded_tasks loop in quadrants/runtime/cpu/kernel_launcher.cpp:13-22 only checks ctx.get_context().cpu_assert_failed — but that context belongs to the task-level scope, not the real_func call. Regular @qd.func is AST-inlined so it does not hit this path; only @qd.real_func callees do.

    Additionally, new_ctx is raw create_entry_block_alloca storage. The C++ in-class initializer int32_t cpu_assert_failed{0} in program/context.h only applies to C++ constructions, not LLVM allocas — so the slot starts with uninitialized stack bytes. This is currently latent (nothing reads it back), but it means step (1) of the fix is load-bearing once post-call propagation is added.

    Impact

    An OOB/assertion inside a reverse-mode or any other @qd.real_func on CPU silently fails to terminate the kernel. Subsequent tasks in the same launch_offloaded_tasks loop continue running on possibly-corrupted data — exactly the test_ndarray_oob_cpu_* / test_do_while_oob_does_not_loop_forever regression the new mechanism is meant to eliminate. None of the tests added in this PR exercise a real_func callee (all use @qd.kernel or @qd.func), so CI does not catch the gap.

    How to fix

    The FIXME itself enumerates the three steps:

    1. Zero-init new_ctx->cpu_assert_failed after the RuntimeContext_set_runtime call (LLVM CreateStore of a constant zero to the cpu_assert_failed field of new_ctx).
    2. After call(llvm_func, new_ctx), load new_ctx->cpu_assert_failed and compare against zero.
    3. If non-zero, propagate via get_context()->cpu_assert_failed = 1 and emit CreateRetVoid on the caller side, matching the pattern visit(AssertStmt) already uses at lines 1175-1183.

    Proof via a concrete example

    Consider a kernel that calls a @qd.real_func which reads an ndarray out of bounds, then the kernel body writes to an unrelated field afterward:

    @qd.real_func
    def oob_reader(a: qd.types.ndarray(dtype=qd.f32, ndim=1)) -> qd.f32:
        return a[100]  # a.shape == (4,), fires OOB assert
    
    @qd.kernel
    def k(a: qd.types.ndarray(dtype=qd.f32, ndim=1), b: qd.types.ndarray(dtype=qd.f32, ndim=1)):
        for i in range(4):
            v = oob_reader(a)
            b[i] = v  # executes even after the assert in oob_reader fires

    Step-by-step at runtime with debug=True, check_out_of_bound=True:

    1. k enters its task function; ctx is the outer task's RuntimeContext with cpu_assert_failed=0 (cleared by launch_offloaded_tasks line 9).
    2. visit(FuncCallStmt) emitted: new_ctx = alloca RuntimeContext (line 2585); RuntimeContext_set_runtime(new_ctx, runtime) (line 2586). new_ctx->cpu_assert_failed is stack garbage but unread.
    3. call(oob_reader, new_ctx) jumps into the callee.
    4. Inside oob_reader, the OOB AssertStmt fires. use_ctx_variant is true. get_context() returns get_arg(0) == new_ctx. quadrants_assert_format_ctx(new_ctx, false, ...) sets new_ctx->cpu_assert_failed = 1 (runtime.cpp:845) and returns 1.
    5. The callee's visit(AssertStmt) epilogue (lines 1175-1183) sees the non-zero return, emits ret void. Control returns to the caller.
    6. The caller does not check new_ctx->cpu_assert_failed — execution continues. b[i] = v is written (with whatever garbage v holds from the early-returned callee).
    7. The for-loop in k iterates; the next iteration calls oob_reader again (same behaviour).
    8. Eventually the task returns. launch_offloaded_tasks checks ctx.get_context().cpu_assert_failed — still 0, because nothing touched the outer context. The loop does not break. If there are more offloaded tasks, they also run.
    9. The debug=True post-task check_runtime_error does eventually surface the assertion (via runtime->error_code set by quadrants_assert_format itself), but only after every subsequent task has already executed on corrupted state.

    With the three-step fix, step 6 becomes: load new_ctx->cpu_assert_failed (reading the zero-init-then-maybe-set-to-1 slot), branch to a propagate block that stores 1 into get_context()->cpu_assert_failed and emits ret void, matching the invariant that every other assert-propagation site already upholds.

@hughperkins

Copy link
Copy Markdown
Collaborator

Question (I dont have astrong opinion on this point, but just posing the question)

  • the earlier comment in the code in question suggests installing moltenvk sdk using brew
  • what do you see as the good and bad points of the two posible approaches (ie downloading from lunarg, vs using hte homebrew version?)

@duburcqa

Copy link
Copy Markdown
Contributor Author

the earlier comment in the code in question suggests installing moltenvk sdk using brew what do you see as the good and bad points of the two posible approaches (ie downloading from lunarg, vs using hte homebrew version?)

I'm not a huge fan of allowing system dependencies. By forcing our own version of MoltenVK, we can guarantee that it works. We do not support any other version than the one shipping with Quadrants and I don't think we want to explore such opportunity. If a dev wants to try some specific version, it is very easy to change it in the CMake file and delete the cache to force re-downloading the updated version. So I don't see any limitation for not supporting system-wise homebrew version. Except that, well, it forces downloading something, so it takes more time to bootstrap build env for the first time.

@hughperkins

Copy link
Copy Markdown
Collaborator

the earlier comment in the code in question suggests installing moltenvk sdk using brew what do you see as the good and bad points of the two posible approaches (ie downloading from lunarg, vs using hte homebrew version?)

I'm not a huge fan of allowing system dependencies. By forcing our own version of MoltenVK, we can guarantee that it works. We do not support any other version than the one shipping with Quadrants and I don't think we want to explore such opportunity. If a dev wants to try some specific version, it is very easy to change it in the CMake file and delete the cache to force re-downloading the updated version. So I don't see any limitation for not supporting system-wise homebrew version. Except that, well, it forces downloading something, so it takes more time to bootstrap build env for the first time.

Question: is waht we are downloading a binary, or source code? If source code, what is the impact on build time of using downloaded source code vs brew? (as an example of something we'd want to avoid: building LLVM SDK takes a looonnnnggg time, and a lot of effort; so best pre-built I feel; I dont know about the moltenvk sdk, hence posing the quetion)

@duburcqa

Copy link
Copy Markdown
Contributor Author

Question: is waht we are downloading a binary, or source code? If source code, what is the impact on build time of using downloaded source code vs brew? (as an example of something we'd want to avoid: building LLVM SDK takes a looonnnnggg time, and a lot of effort; so best pre-built I feel; I dont know about the moltenvk sdk, hence posing the quetion)

We are installing the entire SDK. Exactly as we do on windows. This provides sources and SOME precompiled binaries. Actually all we use is the pre-compiled binary of moltenVK in this case, and we do not build it. So no impact on build it.

@hughperkins

Copy link
Copy Markdown
Collaborator

Question: is waht we are downloading a binary, or source code? If source code, what is the impact on build time of using downloaded source code vs brew? (as an example of something we'd want to avoid: building LLVM SDK takes a looonnnnggg time, and a lot of effort; so best pre-built I feel; I dont know about the moltenvk sdk, hence posing the quetion)

We are installing the entire SDK. Exactly as we do on windows. This provides sources and SOME precompiled binaries. Actually all we use is the pre-compiled binary of moltenVK in this case, and we do not build it. So no impact on build it.

Ok. And what about Mac SIP? How are we avoiding triggering that? Have you tested this build on your own Mac locally, and you confirm no SIP issues?

@duburcqa

Copy link
Copy Markdown
Contributor Author

Ok. And what about Mac SIP? How are we avoiding triggering that? Have you tested this build on your own Mac locally, and you confirm no SIP issues?

I don't know what SIP means, but I can confirm it build locally and run without issue. We are bundling the dylib of moltenvk directly in the wheels (which was the pre-existing behaviour before this PR), so what we are distributing is reasonably standalone). It would be completely standalone if we properly run delocate audit tool on wheels before pushing on pypi to ensure proper name mangling. I think it is not prevent all possible cases of symbol collisions but it is fine is the vast majority of cases.

@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 6c6d25d to 14ae74f Compare April 23, 2026 07:57
@duburcqa duburcqa changed the title [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch + free VkDescriptorSet on shared_ptr release Apr 23, 2026
@duburcqa duburcqa changed the title [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch + free VkDescriptorSet on shared_ptr release [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch Apr 23, 2026
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch 2 times, most recently from 44679dd to 8123ff1 Compare April 23, 2026 11:45
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from ef0b228 to 938b32a Compare April 23, 2026 12:56
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 8123ff1 to 526292d Compare April 23, 2026 12:56
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 938b32a to 29b4c1a Compare April 23, 2026 13:15
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 526292d to 860a914 Compare April 23, 2026 13:15
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 29b4c1a to 9c959ff Compare April 23, 2026 13:17
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 860a914 to e9091bc Compare April 23, 2026 13:17
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 9c959ff to b5f776f Compare April 23, 2026 13:19
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from e9091bc to 63ebc09 Compare April 23, 2026 13:19
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from b5f776f to 505af49 Compare April 23, 2026 15:01
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 63ebc09 to 8fdb352 Compare April 23, 2026 15:01
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 505af49 to d8fe6ae Compare April 23, 2026 15:35
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 8fdb352 to 69fffa2 Compare April 23, 2026 15:35
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from d8fe6ae to 18351d7 Compare April 23, 2026 18:24
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 69fffa2 to cf9a6e6 Compare April 23, 2026 18:24
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 1b25db2 to 9edcfe7 Compare April 24, 2026 07:29
@duburcqa duburcqa changed the title [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement Apr 24, 2026
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 9edcfe7 to 05440e1 Compare April 24, 2026 10:11
@duburcqa

Copy link
Copy Markdown
Contributor Author

@claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments cover two nit-level doc issues; I'm deferring to a human reviewer given the scope here — the PR touches Vulkan device creation, BDA / PSB capability gating, descriptor-set lifecycle, SPIR-V codegen, and a heuristic 32-launch drain for MoltenVK, which is more than I'd want to shadow-approve.

Extended reasoning...

Overview

Six-commit PR that swaps the 2022 Taichi-pinned MoltenVK for a LunarG Vulkan SDK fetch driven by build.py --shell, unblocks physical-storage-buffer (BDA) on Apple, and lands four follow-on MoltenVK quirk fixes: a latent alloc_info.usagebuffer_info.usage typo in VulkanDevice::allocate_memory, NonSemantic.DebugPrintf + shaderSharedFloat*AtomicAdd cap sanitisation on Apple with a companion lazy-import in the SPIR-V IR builder and format-string sanitisation in the overflow-diagnostic path, and a pending_launches_since_sync_ counter in GfxRuntime that forces a drain every 32 launches to bound VulkanStream::submitted_cmdbuffers_ growth.

Security risks

Low for the Apple-guarded paths (cap sanitisation is #if !defined(__APPLE__) and behind the validation-layer gate). The new SDK fetch invokes a LunarG installer binary via subprocess.check_call; the URL and binary name are hard-coded (no shell interpolation), but the LunarG download is a third-party supply-chain dependency the build hadn't had before. The buffer_info.usage typo fix tightens a pre-existing dormant bug rather than introducing new surface.

Level of scrutiny

High. This is cross-platform RHI / build / codegen infrastructure. The PSB capability flip on Apple changes behaviour for every Vulkan-on-Apple user, the vkGetBufferDeviceAddressKHR gating fix also affects Linux, and the kMaxPendingLaunches = 32 drain is a heuristic workaround for a MoltenVK SIGSEGV whose exact threshold ("somewhere around a few hundred") isn't precisely characterised. A human with MoltenVK / Vulkan RHI context should sanity-check the cap-sanitisation choices and the drain threshold.

Other factors

  • Open discussion in the timeline about Mac SIP that the author answered with "I don't know what SIP means" — worth a human confirming the distributed wheel actually loads cleanly on a fresh Apple Silicon Mac.
  • No unit tests added; regression coverage relies on the existing Vulkan-backend CI matrix (Mac 15/26, Linux Vulkan).
  • Comments are dense and the two inline nits (stale path in entry.py, duplicated safety-valve paragraph in runtime.cpp) suggest at least one more editorial pass on the doc before merge.

Comment thread .github/workflows/scripts/ti_build/entry.py Outdated
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 54d7dba to d074c0e Compare April 24, 2026 11:38
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 05440e1 to 39c5a98 Compare April 24, 2026 11:38
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread .github/workflows/scripts/ti_build/vulkan.py
Comment thread quadrants/rhi/CMakeLists.txt
@hughperkins

Copy link
Copy Markdown
Collaborator

Doc: I wonder if we should have a table in our doc somewhere with OSes as the rows (Windows, Mac, Ubuntu), and columns as supported arch type (CUDA, Vulkan, arm, x86 etc) ?

@duburcqa

Copy link
Copy Markdown
Contributor Author

Good suggestion. Added a Backend / OS matrix to docs/source/user_guide/supported_systems.md in 9c0e306 - rows are macOS (Apple Silicon), Linux x64, Linux ARM64, Windows x86, Windows ARM64; columns are qd.cpu / qd.cuda / qd.amdgpu / qd.metal / qd.vulkan; n/a marks combinations where the vendor platform itself is unavailable (no NVIDIA driver on macOS / Linux ARM64 / Windows ARM64, no ROCm outside Linux x64, no Metal outside Apple). Short notes under the table capture the driver-runtime prerequisites and the bundled MoltenVK on macOS.

@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 9c0e306 to 649f1a0 Compare April 24, 2026 13:35
Comment thread docs/source/user_guide/supported_systems.md
@hughperkins

Copy link
Copy Markdown
Collaborator

checklist:

  • doc updated appropriately (added support matrix of supported backends)

=> ok to merge

@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 649f1a0 to 5b6aec9 Compare April 24, 2026 13:38
Comment thread docs/source/user_guide/supported_systems.md Outdated
Comment thread .github/workflows/scripts/ti_build/entry.py
Base automatically changed from duburcqa/adstack_bounded_loop_sizing to main April 24, 2026 13:52
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 5b6aec9 to 59101da Compare April 24, 2026 14:00
…ild.py --shell and enable spirv_has_physical_storage_buffer on Apple
…p.py develop` works after `build.py --shell` exits
…bug_printf so MoltenVK stops rejecting debug-capable kernels
…ueue every 32 launches so MoltenVK stops SIGSEGVing on atomic-float kernels and long simulation loops
…d_systems.md, fix entry.py MoltenVK path comment to match vulkan.py, escape '%' in debug-printf overflow traceback so SPIRV-Cross -> MSL on MoltenVK does not interpret it as a format specifier
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 59101da to 2859a1d Compare April 24, 2026 15:17
Comment thread .github/workflows/scripts/ti_build/vulkan.py
@duburcqa duburcqa merged commit ae2d1c0 into main Apr 24, 2026
57 of 58 checks passed
@duburcqa duburcqa deleted the duburcqa/moltenvk_sdk_source branch April 24, 2026 17:24
npoulad1 added a commit to ROCm/quadrants that referenced this pull request Jun 8, 2026
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428)

* [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

* [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430)

* Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420)

* [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435)

* [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438)

* Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

* Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442)

* [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439)

* [Misc] Add named top-level loops (Genesis-Embodied-AI#440)

* [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

* [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

* [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456)

* [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461)

* [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

* [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

* [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

* [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465)

* [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

* [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471)

* [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

* [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

* [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475)

* [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

* Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485)

* [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484)

* [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477)

* [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)

* Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488)

* Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489)

* [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487)

* [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492)

* [CI] Serialize api doc workflow (Genesis-Embodied-AI#494)

* [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506)

* [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509)

* [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504)

* [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505)

* [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507)

* [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508)

* [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482)

* [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483)

* [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512)

* [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510)

* [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511)

* [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422)

* [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500)

* [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501)

* [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502)

* [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503)

* [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496)

* [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491)

* [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534)

* [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535)

* [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495)

* [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490)

* [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536)

* [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541)

* [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419)

* [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411)

* [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552)

* [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441)

* [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412)

* [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555)

* [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554)

* [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537)

* [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493)

* [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539)

* [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513)

* [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551)

* [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557)

* [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562)

* [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559)

* [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558)

* [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563)

* [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426)

Authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543)

* Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564)

* [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470)

* [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567)

* Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573)

* [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574)

* [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571)

* [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575)

* [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576)

* [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577)

* [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570)

* [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566)

* [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579)

* [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584)

* [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580)

* [Type] Tensor 24 (Genesis-Embodied-AI#561)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587)

* [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578)

* [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588)

* [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590)

* [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592)

* [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591)

* [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596)

* [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450)

* Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585)

Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598)

Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>

* [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599)

* [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606)

* [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610)

* [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611)

* [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Doc] Update README (Genesis-Embodied-AI#617)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

* [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Add PR Line change report (Genesis-Embodied-AI#624)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

* [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630)

* [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

Co-authored-by: Johnny Nunez and Hugh Perkins

* [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632)

* [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

* [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633)

* [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634)

* [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638)

* [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639)

* [Perf] Streams 1-4 (Genesis-Embodied-AI#410)

* [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643)

* [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650)

* [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640)

* [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641)

* [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635)

* [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658)

* [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655)

* [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653)

* [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659)

* [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654)

* [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660)

* [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669)

* [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668)

* [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667)

* [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671)

* [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675)

* [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677)

* [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Cross gpu atomics (Genesis-Embodied-AI#666)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664)

* [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685)

* [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670)

* [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662)

* [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687)

* [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672)

* [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679)

* [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665)

* [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691)

* [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694)

* [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690)

* Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698)

* [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692)

* [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696)

* [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683)

* [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676)

* [GPU] New QIPC ops for block (Genesis-Embodied-AI#684)

* [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693)

* [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701)

* [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700)

* [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702)

* [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708)

* [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707)

* Fix duplicate HIP graph driver-function declarations after v1.0.0 merge

The amd-integration fork had cherry-picked the HIP graph driver functions
(graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate /
graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set.
The per-file 3-way merge appended both copies into
amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the
AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures
are identical to the fork's existing declarations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge

- kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel
  rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream
  PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design,
  leaving references to undefined `ephemeral_context_ptr`. Restore the fork's
  coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced
  launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel
  groups adapted onto the AMD launch path.
- llvm_context.h: both the fork and upstream added `num_instructions`; the merge
  kept upstream's private placement, but the AMDGPU codegen force-inline heuristic
  calls it statically from outside the class. Move it back to the public section.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore async result D2H and hoist kernarg vectors in AMDGPU launcher

The v1.0.0 merge resolution regressed two amd-integration baseline
optimizations in launch_llvm_kernel / launch_offloaded_tasks:

  - The per-launch result-buffer copy was a blocking memcpy_device_to_host,
    forcing a host stall on every value-returning launch and serializing the
    GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it
    needs the value); external-array transfers still stream_synchronize once
    before reading back.

  - launch_task constructed the kernarg std::vectors from initializer lists
    ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free
    per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget

Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup
ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through
`amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside
`llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco`
(i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted
these constructs, which is why it was unaffected.

1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend.
   Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target
   (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the
   native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK`
   is now the default and still honored. This is the actual crash fix.

2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so
   `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries
   x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies
   but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm
   during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the
   wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources

CI pre-commit's clang-format hook reformatted these files (long
declarations/lambda signatures collapsed onto single lines per the repo's
clang-format config). Apply the same formatting so the hook passes.

No functional changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input)

clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged
`builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to
the `llvm::Value*` LHS parameter as a null pointer, not an integer zero.
Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper
zero constant -- identical intended semantics, and clang-tidy clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>
Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants