[SPIR-V] Shrink reverse-grad kernel MSL by ~50%#591
Conversation
|
@claude review |
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_adstack.py |
100% |
Diff coverage: 100% · Overall: 73% · 6 lines, 0 missing
| a = x[-1] # AssertionError in debug mode | ||
| ``` | ||
|
|
||
| The same flag also enables a deferred runtime check on the adstack used by reverse-mode autodiff: a push past the per-stack capacity (set via `qd.init(ad_stack_size=...)` or per-alloca by `determine_ad_stack_size`) raises `RuntimeError("[Aa]dstack overflow")` on the next `qd.sync()`. Without bounds-checking, an adstack overflow silently writes past the per-thread slab and produces a wrong gradient. |
There was a problem hiding this comment.
Lets add some kind of adstack section/subsection header please.
|
|
||
| The same flag also enables a deferred runtime check on the adstack used by reverse-mode autodiff: a push past the per-stack capacity (set via `qd.init(ad_stack_size=...)` or per-alloca by `determine_ad_stack_size`) raises `RuntimeError("[Aa]dstack overflow")` on the next `qd.sync()`. Without bounds-checking, an adstack overflow silently writes past the per-thread slab and produces a wrong gradient. | ||
|
|
||
| `debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost. |
There was a problem hiding this comment.
This seems like a mixture of general debug stuff, and adstack-speicfic stuff. Can we factorize out the general stuff to go outsdie fo the new adstack subsection, and keep just the adstack specific stuff here please.
There was a problem hiding this comment.
you havent introduced check_out_of_bounds yet. It should be a sepearate section to 'debug' I feel. But .... why introduce a separate flag? Why not just have a single debug flag, for simplicity?
There was a problem hiding this comment.
Ok, I see you've started to provide the reasons, but I feel this could be structured more clearly, and I think it's confusing to have two flags, one of which is a subset of the other, so if we can avoid that that might be cleaner. I guess full debug is super slow?
what happens if debug is true, and check_out_of_bounds is false?
|
|
||
| `debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost. | ||
|
|
||
| On the Metal and Vulkan backends, `check_out_of_bound=True` is silently disabled at `qd.init` time because those backends lack the in-kernel assertion extension that the field bounds check relies on; passing it on its own gives you neither the field bounds check nor the adstack overflow check. Pass `debug=True` instead: that keeps the adstack overflow check live (it is gated independently and does not need the assertion extension), but the field bounds check still does not fire on these backends. |
There was a problem hiding this comment.
check_out_of_bounds tru seems alike a general thing, so lets also move it outside of the adstack section pelase.
|
|
|
…pu] now that PR #591 codegen is in place via a398612; this is the test where the original Genesis CI failure was observed and where the local M4 measurement put mpm_grid_op_c65 at 85.6 MB peak phys_footprint - right at the 100 MB cap, so the matrix run on macos-15 / macos-26 will resolve whether PR #591 alone is enough
…adstack-overflow checks (init_options.md) and for kernel print() (debug.md), restate that the adstack overflow check fires on all backends with debug=True regardless of whether the backend supports the assertion mechanism, and warn that kernel print() forces a queue sync after every dispatch of the containing kernel - significant overhead even when the surrounding control flow makes the print unreachable; also relocate the slot-pointer comment block in spirv_codegen.cpp from above ad_stack_count_ptr to above ad_stack_slot_ptr where it actually belongs (per the bot review on PR #591)
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_adstack.py |
100% |
Diff coverage: 100% · Overall: 67% · 6 lines, 0 missing
|
|
||
| **Note.** Output from GPU kernels appears in order despite parallel execution because all kernels are queued in the same compute stream. | ||
|
|
||
| **Important.** Avoid kernel `print()` calls in production code where you can. Quadrants synchronizes the compute queue after every dispatch of a kernel that contains a `print()` so the output appears as close as possible to the call site. The synchronization happens unconditionally on every launch of that kernel, even when the surrounding control flow leaves the `print()` unreached at runtime; the cost is the full per-launch sync overhead, not just the cost of the `print()` itself. |
| | CPU | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` | | ||
| | CUDA | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` | | ||
| | AMDGPU | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` | | ||
| | Metal | never (no in-kernel assertion mechanism) | with `debug=True` only | |
There was a problem hiding this comment.
wait. why the inconcistency for 'adstack overflow check' on Metal and Vulkan?
There was a problem hiding this comment.
I think the behavior should be consistent across platforms, except for features not supported by a platform at all (so 'never' is ok for vulkan and metal for example (though not ideal of course)).
There was a problem hiding this comment.
I'm ok with that, but this is pre-existing in this PR, here I'm just documenting the current state. I could fix it in this PR if you want.
There was a problem hiding this comment.
Oh, I see, I assumed that these were changes in this PR.
Yeah, ok, let's not feature-flate this PR :) Thank you for the doc :) I think it explains clearly the current situation. 🙌
| | Metal | never (no in-kernel assertion mechanism) | with `debug=True` only | | ||
| | Vulkan | never (no in-kernel assertion mechanism) | with `debug=True` only | | ||
|
|
||
| The adstack overflow check is gated independently of the assertion mechanism, so `debug=True` activates it on every backend - including Metal and Vulkan, where the field bounds check stays unavailable. On Metal and Vulkan, `check_out_of_bound` is silently reset to `False` at `qd.init` time (a warning is logged); passing it on its own gives neither check on those backends. |
There was a problem hiding this comment.
I think we should lave it enabled on Metal, and narrow the warning to say that only adstack overflows will be checked, not out of bounds.
But actually, now I think about it, why should 'out of bound' track 'adstack overflow'?
I think these should be two different flags.
There was a problem hiding this comment.
That makes sense. Do you want to use 'debug' for this or a new flag? You want to do the changes in this PR?
There was a problem hiding this comment.
Since you are just documenting the existing behavior, let's not change this in this doc. Thank you :)
|
Ok, doc looks good to me. Whilst it looks like these changes just target reverse-grad autodiff, lets get genesis unit test results and genesis benchmark results please, just to be sure. |
|
oh they're already there. checklist:
=> ok to merge |
…elease bounds-check elision, shared count-array
…on error with kernel name + MSL byte size
…strings (per source-comment style rule)
…path; drop shader-size-cause speculation
…EXC_RESOURCE mechanism observed on GitHub-hosted macos-15 Apple-M1 runners (XPC service hits a hard 100 MB working-set cap during AIR-to-GPU compile and is killed by the kernel) and add a Metal/Vulkan caveat to the new debug-mode paragraph clarifying that check_out_of_bound is silently disabled on those backends and only the adstack overflow check survives via debug=True
…cific 100 MB / EXC_RESOURCE framing in favor of a generic 'compiler service exceeds a per-process memory budget mid-compile' wording, since the cap is platform-specific and citing it inline overspecifies the failure
…lit the adstack overflow check into its own subsection of debug.md, move the check_out_of_bound flag interaction (table of debug/check_out_of_bound combinations + Metal/Vulkan caveat) into the dedicated check_out_of_bound entry of init_options.md so debug.md stays focused on the user-facing checks and the option-reference centralizes the flag-level details, and tighten both debug and check_out_of_bound entries to bullet/table form so the relevant facts are scannable instead of buried in prose
…ng section as a #### Adstack overflow subsection (per Hugh's #### suggestion: it's another bounds check, sharing the same check_out_of_bound flag), and add a back-cross-reference from init_options.md's Debugging section to debug.md so users landing on the option reference can find the runnable examples and develop/benchmark workflow
…adstack-overflow checks (init_options.md) and for kernel print() (debug.md), restate that the adstack overflow check fires on all backends with debug=True regardless of whether the backend supports the assertion mechanism, and warn that kernel print() forces a queue sync after every dispatch of the containing kernel - significant overhead even when the surrounding control flow makes the print unreachable; also relocate the slot-pointer comment block in spirv_codegen.cpp from above ad_stack_count_ptr to above ad_stack_slot_ptr where it actually belongs (per the bot review on PR #591)
…to autodiff.md's bold-prefix style for consistency across the user_guide
…arking' to 'Avoid ... in production code' since the queue-sync overhead matters in any production path, not just during benchmarks
…rint sync warning - the print is in the kernel body and reachable in principle; it just may not be hit on a given launch
…de if possible' - print may be unavoidable in some debugging-in-production scenarios
… close as possible to the call site' - more precise about what the sync buys
… warning to avoid the doubled 'possible' against 'as close as possible'
…e debug=True implies check_out_of_bound=True relationship first, then the actionable recommendation that follows from it
…unnecessarily dropped from the kernel-print intro line
…note - the claim is unverified and Quadrants' per-kernel sync after dispatching a print-bearing kernel may already serialize in practice
…e all kernels share one compute stream) instead of dropping it, and switch the two 'Cost:' leads in init_options.md to '**Cost.**' bold-prefix style for consistency with the autodiff.md / debug.md Note / Important / Cost convention
…s check per Hugh's review on PR #591: gate the AdStack push/pop/load_top/load_top_adj sites on compile_config.debug instead of compile_config.check_out_of_bound on the LLVM side (matches the pre-#591 behavior verbatim) and on compile_config_->debug (no longer ORed with check_out_of_bound) on the SPIR-V side, so the two checks land on independent flags and PR #591 stops introducing the coupling Hugh flagged. Also relabel the AMDGPU print() table row from 'no (compile error)' to 'no (silently dropped)' since codegen_amdgpu.cpp visit(PrintStmt) overrides with a no-op (per bot review), fix the spirv_codegen.h cross-reference from the non-existent 'ensure_ad_stack_count_array_var' to the real 'ad_stack_count_ptr' helper (per bot review), and update the init_options.md per-backend table + flag-interaction bullets to reflect the new debug-only gating for the adstack overflow check
d9a1443 to
a05e06a
Compare
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_adstack.py |
100% |
Diff coverage: 100% · Overall: 73% · 6 lines, 0 missing
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 tests/python/test_adstack.py |
33% | 1024-1027 |
Diff coverage: 33% · Overall: 65% · 6 lines, 4 missing
…'implied check_out_of_bound' references in the two adjacent overflow-test docstrings, per bot review on PR #591. After the earlier decoupling commit on this branch (which moved the LLVM adstack-visitor gates back to compile_config.debug and the SPIR-V push gate to compile_config_->debug only), check_out_of_bound=True alone no longer activates the adstack-overflow check on any backend - the test pinning that coupling is invalid by construction. The remaining test_adstack_overflow_raises[debug=True] still covers the user-facing 'I need the deferred RuntimeError on overflow' path
Coverage Report (
|
| File | Coverage | Missing |
|---|
Diff coverage: 0% · Overall: 73% · 0 lines, 0 missing
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>
SPIR-V reverse-grad kernel-size reduction: clamp + OpSelect adstack push, release-mode bounds-check elision, shared count-array, plus diagnostic-only Metal compiler-rejection logging
TL;DR
(Numbers from a local
tests/test_grad.py::test_differentiable_push[gpu]run withQD_DUMP_MSL=1QD_OFFLINE_CACHE=0. Test wall-clock on the same run dropped 91.9s -> 38.8s, a 58% speedup that comes from less time spent in the MSL compiler.)Why
Two motivations:
Reduce SPIR-V codegen size waste. The pre-PR
AdStackPushStmtcodegen emits a structuredOpSelectionMerge/OpBranchConditionalregion per push that spirv-cross renders as ~13 MSL lines per push. For reverse-grad kernels with ~1000 pushes, that's the dominant size amplifier. Per-stackcount_varOpVariable Functionslots also become independentOpPhimega-clusters at every enclosing loop header (700+ phis per merge). Both can be compressed without correctness change. The release-mode bounds-check (clamp + atomic-signal) on SPIR-V is currently always live; LLVM has always elided it in release; aligning the gates lets release builds skip the per-push branch entirely.Make Metal pipeline-create failures self-describing. When Apple's MSL compiler service drops the XPC connection mid-compile,
newComputePipelineStateWithFunction:error:returnsnilwitherror == nil. The pre-PR path silently returnednullptr; the user only saw the genericruntime.cpp:298 RhiResult=-1line with no kernel name or byte size. The new path warn-logs the kernel name and cross-compiled MSL byte size, so any future investigator reading CI artifacts immediately sees which kernel hit the path.Mechanism end-to-end
1. AdStackPushStmt: clamp + OpSelect instead of structured if-then-else
quadrants/codegen/spirv/spirv_codegen.cpp::TaskCodegen::visit(AdStackPushStmt*)previously emitted a structuredOpSelectionMerge/OpBranchConditionalregion around every push, with the then-branch doing the in-bounds store and the else-branch publishing the overflow signal. The new emit folds the entire region into:clamped_idx = GLSLstd450UMin(count, max_size - 1)primal[clamped_idx](andadjoint[clamped_idx] = 0for heap_float)count++signal = OpSelect(count >= max_size, stack_id+1, 0)OpAtomicUMax(overflow_buffer[0], signal)The clamp keeps the OpAccessChain in-bounds; the atomic-max with 0 is a no-op when the stack didn't overflow, so the host-readable flag still ends up at
stack_id + 1only when an actual overflow happened. spirv-cross emits this as straight-line MSL: ~5 lines per push instead of ~13.2. Bounds-check gate switched to
check_out_of_boundThe clamp + atomic-signal pair from above is now gated on
compile_config->check_out_of_bound || compile_config->debugin SPIR-V codegen. Release builds elide the entire bounds check, mirroring LLVM's release-build push (which has always relied ondetermine_ad_stack_sizeproducing a tight static bound and dropped the per-push runtime guard). LLVM's six adstack visitors switch their gate fromcompile_config.debugtocompile_config.check_out_of_boundso the two backends key off the same flag.check_out_of_bound->stack_init/stack_pushruntime callscheck_out_of_bound || debug-> clamp + OpAtomicUMax signalCompileConfig::fit()already promotesdebug=Truetocheck_out_of_bound=True, so existingqd.init(debug=True)users see no behaviour change. Users who explicitly setqd.init(check_out_of_bound=True, debug=False)now also get the bounds check on LLVM, which they didn't before. The OR withdebugin the SPIR-V gate preserves theqd.init(debug=True)path on Metal / Vulkan, whereProgram::initforce-disablescheck_out_of_boundbecause those arches lackExtension::assertion.3. Shared count-array OpVariable for adstack
count_varEach adstack
count_varused to be its ownOpVariable Functionof typeuint. spirv-opt'sLocalMultiStoreElim/SSARewritepromoted each into its own SSA chain, which became a separateOpPhiat every enclosing loop header. spirv-cross then emitted each phi as oneuint _N;forward-decl + one_N = _N;alias copy per predecessor branch. Reverse-grad kernels with hundreds of adstacks crossing a single loop accumulated phi mega-clusters of 700+ entries per loop header.This PR replaces the per-stack scalar OpVariable with a single Function-scope
uint[num_ad_stacks_]array, allocated lazily on firstad_stack_count_ptr(stack_id)call and indexed byOpAccessChainper push / pop / load-top. spirv-opt's mem2reg passes do not promote OpAccessChain into an aggregate, so the slots stay memory-backed and never become per-stack phis. The array is sized from a pre-pass scan that countsAdStackAllocaStmtnodes (num_ad_stacks_).This is the single biggest lever:
kernel_update_cartesian_space_c289_0_reverse_graddrops from 57,853 MSL lines to 7,617 (-87%) entirely from this change.4. Metal pipeline / library failure:
QD_WARN+nullptrreturn (notQD_ERROR)quadrants/rhi/metal/metal_device.mm::create_compute_pipelineandMetalDevice::get_mtl_librarypreviously took thenil pipeline + nil NSErrorpath silently and returnednullptr(or, on theerr != nilpath, calledRHI_LOG_ERRORand returnednullptr). The new path logs at WARN level withQD_WARN, including the kernel name (where available) and cross-compiled MSL byte size:QD_WARNrather thanQD_ERRORis critical:QD_ERRORends withthrow s(wheresis a barestd::string), andMetalDevice::create_pipelineis declarednoexceptand only catchesstd::exceptionderivatives. A throw ofstd::stringhere would cross thenoexceptboundary and tripstd::terminate(), replacing the existing clean PythonRuntimeErrortranslation with a fatal process abort. WithQD_WARN, no exception is thrown inside the noexcept function; thenullptrreturn is converted by the caller toRhiResult::error, theruntime.cpp:298 QD_ERROR_IFthen throwsstd::string, the existingpybind11translator (quadrants/python/py_exception_translator.cpp) catches it and raisesPyExc_RuntimeError. Verified empirically by force-injecting the failure path locally and observing the Python-levelRuntimeErrorexception.The wording deliberately does not assert a specific cause (size, construct, driver bug, ...). The XPC connection drop is observable from this side; the actual reason in the toolchain is not.
Per-backend coverage matrix
check_out_of_boundcheck_out_of_boundcheck_out_of_bound || debugcheck_out_of_bound || debugTests
tests/python/test_adstack.py::test_adstack_overflow_raisesExisting test, kept on
debug=True. Verifies that an adstack push past the publishedmax_sizeraisesRuntimeError("[Aa]dstack overflow")on the nextqd.sync().debug=Trueimpliescheck_out_of_bound=TrueviaCompileConfig::fit, so the bounds-check codepath is live.tests/python/test_adstack.py::test_adstack_overflow_raises_check_oob_explicit(new)Same overflow scenario as the test above but with
check_out_of_bound=Trueset explicitly withoutdebug=True. Pins the gating tocheck_out_of_boundrather thandebug: a release-build user who explicitly opts into bounds-checks gets the sameRuntimeErroras a debug-mode user. Excluded on Metal / Vulkan becauseProgram::initforce-disablescheck_out_of_boundon arches withoutExtension::assertion, so the explicit-flag spelling alone cannot light up the bounds check there.tests/python/test_adstack.py::test_adstack_overflow_flag_resets_after_catchExisting test, unchanged. Pins that
check_adstack_overflow()clears the flag after raising so a subsequentqd.sync()returns normally.Local AD test status with
QD_OFFLINE_CACHE=01214 tests pass on this branch (
tests/python/test_adstack.pyplus the broadertest_ad_*.pyfiles) acrossarch=arm64,arch=metal-2,arch=vulkan-0. Same numbers on pristineorigin/main.Side-effect audit
compile_config.h::check_out_of_boundis already part of the config feeding the cache key; no schema changetask_attribs.ad_stackserialisationper_thread_stride_*_compile_timeandallocas[]populated identicallyinfo.count_varusersload_variable/store_variablewhich acceptkVariablePtrwhether the underlying is a freshOpVariableor anOpAccessChainelementOpAtomicUMax(buffer, 0)is a no-op for the host-visible value, so the runtime still observes a clear flag iff some thread actually overflowedcount++is now unconditional, so push and pop are balanced even when the in-bounds check would have skipped the increment.LoadTop*/AccAdjointalready clamp via UMin so an overflowed count of UINT_MAX still produces a clamped in-bounds indexcompile_config.debuginteger-overflow checks (BinaryOpStmt / shift sites)debug(codegen_llvm.cpplines 503/514/525/560); only the six adstack visitors switched tocheck_out_of_boundmetal_device.mmraster-fallback site (build_mtl_render_pipeline)RHI_LOG_ERROR+ silent-on-err==nilbehaviour. Out of scope for this PR; flagged here so the audit table matches the diffnil errpathQD_WARN+nullptrreturn, force-injecting the path produces a PythonRuntimeError(notstd::terminate)