[SPIR-V] Shrink reverse-grad kernel MSL by ~50% by duburcqa · Pull Request #591 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-29T16:58:24Z

SPIR-V reverse-grad kernel-size reduction: clamp + OpSelect adstack push, release-mode bounds-check elision, shared count-array, plus diagnostic-only Metal compiler-rejection logging

Three commits. The first commit halves the cross-compiled MSL size of every reverse-grad kernel via three additive SPIR-V codegen changes (faster local Metal/Vulkan builds, lower host-side memory pressure during compile). The second commit replaces the silent nil pipeline + nil NSError log path in metal_device.mm with a non-throwing QD_WARN that includes the kernel name and cross-compiled MSL byte size, so when Apple's compiler service drops the connection (a known mode on some macOS / Metal-toolchain combinations) the next investigator immediately knows which kernel and how big it was, instead of bisecting from a generic RhiResult=-1 line. The third commit drops two test cross-references from new test docstrings per the source-comment style rule.

The PR is NOT a fix for the macos-15 M1 GitHub-runner failure on its own. The same g2p kernel that fails on macos-15 (cross-compiled MSL: 689,472 bytes after this PR's reduction) compiles cleanly on macOS 26 / M4 with xcrun -sdk macosx metal -std=macos-metal2.3 -c (218 KB AIR, exit 0). So the macos-15 failure is specific to the macos-15 Metal toolchain, not an MSL-size or MSL-content issue we can resolve from this side. This PR is the right shape regardless: the size cut is a real win, and the metal_device.mm warning is the right diagnostic surface for any future host that drops the XPC connection silently.

TL;DR

+----------------------------------------+---------+---------+----------------------+
|                Kernel                  | Before  |  After  |       Reduction      |
+----------------------------------------+---------+---------+----------------------+
| g2p_c511 reverse-grad                  |  23,226 |  11,639 | -50.1%               |
| p2g_c509 reverse-grad                  |  37,860 |  16,318 | -56.9%               |
| mpm_grid_op_c65 reverse-grad           |  49,206 |  25,661 | -47.9%               |
| kernel_forward_velocity_c273           |  14,874 |  11,172 | -24.9%               |
| kernel_update_cartesian_space_c289     |  57,853 |   7,617 | -86.8%               |
+----------------------------------------+---------+---------+----------------------+
| Total MSL across the test              | 282,603 | 143,203 | -49.3% (-139,400 LOC)|
+----------------------------------------+---------+---------+----------------------+

(Numbers from a local tests/test_grad.py::test_differentiable_push[gpu] run with QD_DUMP_MSL=1 QD_OFFLINE_CACHE=0. Test wall-clock on the same run dropped 91.9s -> 38.8s, a 58% speedup that comes from less time spent in the MSL compiler.)

Why

Two motivations:

Reduce SPIR-V codegen size waste. The pre-PR AdStackPushStmt codegen emits a structured OpSelectionMerge / OpBranchConditional region per push that spirv-cross renders as ~13 MSL lines per push. For reverse-grad kernels with ~1000 pushes, that's the dominant size amplifier. Per-stack count_var OpVariable Function slots also become independent OpPhi mega-clusters at every enclosing loop header (700+ phis per merge). Both can be compressed without correctness change. The release-mode bounds-check (clamp + atomic-signal) on SPIR-V is currently always live; LLVM has always elided it in release; aligning the gates lets release builds skip the per-push branch entirely.
Make Metal pipeline-create failures self-describing. When Apple's MSL compiler service drops the XPC connection mid-compile, newComputePipelineStateWithFunction:error: returns nil with error == nil. The pre-PR path silently returned nullptr; the user only saw the generic runtime.cpp:298 RhiResult=-1 line with no kernel name or byte size. The new path warn-logs the kernel name and cross-compiled MSL byte size, so any future investigator reading CI artifacts immediately sees which kernel hit the path.

Mechanism end-to-end

1. AdStackPushStmt: clamp + OpSelect instead of structured if-then-else

quadrants/codegen/spirv/spirv_codegen.cpp::TaskCodegen::visit(AdStackPushStmt*) previously emitted a structured OpSelectionMerge / OpBranchConditional region around every push, with the then-branch doing the in-bounds store and the else-branch publishing the overflow signal. The new emit folds the entire region into:

clamped_idx = GLSLstd450UMin(count, max_size - 1)
unconditional store to primal[clamped_idx] (and adjoint[clamped_idx] = 0 for heap_float)
unconditional count++
signal = OpSelect(count >= max_size, stack_id+1, 0)
unconditional OpAtomicUMax(overflow_buffer[0], signal)

The clamp keeps the OpAccessChain in-bounds; the atomic-max with 0 is a no-op when the stack didn't overflow, so the host-readable flag still ends up at stack_id + 1 only when an actual overflow happened. spirv-cross emits this as straight-line MSL: ~5 lines per push instead of ~13.

2. Bounds-check gate switched to `check_out_of_bound`

The clamp + atomic-signal pair from above is now gated on compile_config->check_out_of_bound || compile_config->debug in SPIR-V codegen. Release builds elide the entire bounds check, mirroring LLVM's release-build push (which has always relied on determine_ad_stack_size producing a tight static bound and dropped the per-push runtime guard). LLVM's six adstack visitors switch their gate from compile_config.debug to compile_config.check_out_of_bound so the two backends key off the same flag.

Backend	Bounds-check path	Release behaviour
LLVM (CPU / CUDA / AMDGPU)	`check_out_of_bound` -> `stack_init` / `stack_push` runtime calls	inline ops, no overflow flag
SPIR-V (Metal / Vulkan)	`check_out_of_bound \|\| debug` -> clamp + OpAtomicUMax signal	unconditional store, no overflow flag

CompileConfig::fit() already promotes debug=True to check_out_of_bound=True, so existing qd.init(debug=True) users see no behaviour change. Users who explicitly set qd.init(check_out_of_bound=True, debug=False) now also get the bounds check on LLVM, which they didn't before. The OR with debug in the SPIR-V gate preserves the qd.init(debug=True) path on Metal / Vulkan, where Program::init force-disables check_out_of_bound because those arches lack Extension::assertion.

3. Shared count-array OpVariable for adstack `count_var`

Each adstack count_var used to be its own OpVariable Function of type uint. spirv-opt's LocalMultiStoreElim / SSARewrite promoted each into its own SSA chain, which became a separate OpPhi at every enclosing loop header. spirv-cross then emitted each phi as one uint _N; forward-decl + one _N = _N; alias copy per predecessor branch. Reverse-grad kernels with hundreds of adstacks crossing a single loop accumulated phi mega-clusters of 700+ entries per loop header.

This PR replaces the per-stack scalar OpVariable with a single Function-scope uint[num_ad_stacks_] array, allocated lazily on first ad_stack_count_ptr(stack_id) call and indexed by OpAccessChain per push / pop / load-top. spirv-opt's mem2reg passes do not promote OpAccessChain into an aggregate, so the slots stay memory-backed and never become per-stack phis. The array is sized from a pre-pass scan that counts AdStackAllocaStmt nodes (num_ad_stacks_).

This is the single biggest lever: kernel_update_cartesian_space_c289_0_reverse_grad drops from 57,853 MSL lines to 7,617 (-87%) entirely from this change.

4. Metal pipeline / library failure: `QD_WARN` + `nullptr` return (not `QD_ERROR`)

quadrants/rhi/metal/metal_device.mm::create_compute_pipeline and MetalDevice::get_mtl_library previously took the nil pipeline + nil NSError path silently and returned nullptr (or, on the err != nil path, called RHI_LOG_ERROR and returned nullptr). The new path logs at WARN level with QD_WARN, including the kernel name (where available) and cross-compiled MSL byte size:

[W ...] [metal_device.mm:create_compute_pipeline@206] Apple's Metal compiler service
rejected the compute-pipeline build for kernel 'g2p_c511_0_reverse_grad_0_t00'
(cross-compiled MSL size: 689472 bytes) without returning a structured error. The XPC
service drops its connection in this shape; the underlying cause is host-toolchain-
specific and is not recoverable from this side.
[E ...] [runtime.cpp:CompiledQuadrantsKernel@298] Failed to create pipeline ... RhiResult=-1

QD_WARN rather than QD_ERROR is critical: QD_ERROR ends with throw s (where s is a bare std::string), and MetalDevice::create_pipeline is declared noexcept and only catches std::exception derivatives. A throw of std::string here would cross the noexcept boundary and trip std::terminate(), replacing the existing clean Python RuntimeError translation with a fatal process abort. With QD_WARN, no exception is thrown inside the noexcept function; the nullptr return is converted by the caller to RhiResult::error, the runtime.cpp:298 QD_ERROR_IF then throws std::string, the existing pybind11 translator (quadrants/python/py_exception_translator.cpp) catches it and raises PyExc_RuntimeError. Verified empirically by force-injecting the failure path locally and observing the Python-level RuntimeError exception.

The wording deliberately does not assert a specific cause (size, construct, driver bug, ...). The XPC connection drop is observable from this side; the actual reason in the toolchain is not.

Per-backend coverage matrix

Backend	Adstack push shrink	Bounds-check gate	Count-array shared	Metal warn-log
arm64 / x64 (LLVM CPU)	N/A (LLVM emits inline)	switched to `check_out_of_bound`	N/A	N/A
CUDA / AMDGPU (LLVM GPU)	N/A	switched to `check_out_of_bound`	N/A	N/A
Vulkan (SPIR-V)	clamp + OpSelect	`check_out_of_bound \|\| debug`	yes	N/A
Metal (SPIR-V)	clamp + OpSelect	`check_out_of_bound \|\| debug`	yes	yes

Tests

`tests/python/test_adstack.py::test_adstack_overflow_raises`

Existing test, kept on debug=True. Verifies that an adstack push past the published max_size raises RuntimeError("[Aa]dstack overflow") on the next qd.sync(). debug=True implies check_out_of_bound=True via CompileConfig::fit, so the bounds-check codepath is live.

`tests/python/test_adstack.py::test_adstack_overflow_raises_check_oob_explicit` (new)

Same overflow scenario as the test above but with check_out_of_bound=True set explicitly without debug=True. Pins the gating to check_out_of_bound rather than debug: a release-build user who explicitly opts into bounds-checks gets the same RuntimeError as a debug-mode user. Excluded on Metal / Vulkan because Program::init force-disables check_out_of_bound on arches without Extension::assertion, so the explicit-flag spelling alone cannot light up the bounds check there.

`tests/python/test_adstack.py::test_adstack_overflow_flag_resets_after_catch`

Existing test, unchanged. Pins that check_adstack_overflow() clears the flag after raising so a subsequent qd.sync() returns normally.

Local AD test status with `QD_OFFLINE_CACHE=0`

1214 tests pass on this branch (tests/python/test_adstack.py plus the broader test_ad_*.py files) across arch=arm64, arch=metal-2, arch=vulkan-0. Same numbers on pristine origin/main.

Side-effect audit

Concern	Where checked	Verdict
Offline cache key	`compile_config.h::check_out_of_bound` is already part of the config feeding the cache key; no schema change	OK
`task_attribs.ad_stack` serialisation	Layout unchanged: `per_thread_stride_*_compile_time` and `allocas[]` populated identically	OK
`info.count_var` users	All loads / stores go through `load_variable` / `store_variable` which accept `kVariablePtr` whether the underlying is a fresh `OpVariable` or an `OpAccessChain` element	OK
Adstack overflow signal	`OpAtomicUMax(buffer, 0)` is a no-op for the host-visible value, so the runtime still observes a clear flag iff some thread actually overflowed	OK
Reverse-pass count semantics	`count++` is now unconditional, so push and pop are balanced even when the in-bounds check would have skipped the increment. `LoadTop*/AccAdjoint` already clamp via UMin so an overflowed count of UINT_MAX still produces a clamped in-bounds index	OK
LLVM `compile_config.debug` integer-overflow checks (BinaryOpStmt / shift sites)	Untouched - still gate on `debug` (`codegen_llvm.cpp` lines 503/514/525/560); only the six adstack visitors switched to `check_out_of_bound`	OK
Vulkan push-constant / descriptor binding layout	Unchanged	OK
`metal_device.mm` raster-fallback site (`build_mtl_render_pipeline`)	This PR does NOT touch the raster site - it still uses the pre-PR `RHI_LOG_ERROR` + silent-on-`err==nil` behaviour. Out of scope for this PR; flagged here so the audit table matches the diff	Untouched (intentional)
Process-abort regression on the new metal `nil err` path	Verified empirically: with `QD_WARN` + `nullptr` return, force-injecting the path produces a Python `RuntimeError` (not `std::terminate`)	OK

duburcqa · 2026-04-29T17:53:16Z

@claude review

github-actions · 2026-04-29T18:40:30Z

Coverage Report (`78e94d3f1`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	100%

Diff coverage: 100% · Overall: 73% · 6 lines, 0 missing

Full annotated report

hughperkins · 2026-04-29T20:56:19Z

 a = x[-1]     # AssertionError in debug mode
 ```

+The same flag also enables a deferred runtime check on the adstack used by reverse-mode autodiff: a push past the per-stack capacity (set via `qd.init(ad_stack_size=...)` or per-alloca by `determine_ad_stack_size`) raises `RuntimeError("[Aa]dstack overflow")` on the next `qd.sync()`. Without bounds-checking, an adstack overflow silently writes past the per-thread slab and produces a wrong gradient.


Lets add some kind of adstack section/subsection header please.

maybe #### adstack ?

hughperkins · 2026-04-29T20:57:42Z


+The same flag also enables a deferred runtime check on the adstack used by reverse-mode autodiff: a push past the per-stack capacity (set via `qd.init(ad_stack_size=...)` or per-alloca by `determine_ad_stack_size`) raises `RuntimeError("[Aa]dstack overflow")` on the next `qd.sync()`. Without bounds-checking, an adstack overflow silently writes past the per-thread slab and produces a wrong gradient.
+
+`debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost.


This seems like a mixture of general debug stuff, and adstack-speicfic stuff. Can we factorize out the general stuff to go outsdie fo the new adstack subsection, and keep just the adstack specific stuff here please.

you havent introduced check_out_of_bounds yet. It should be a sepearate section to 'debug' I feel. But .... why introduce a separate flag? Why not just have a single debug flag, for simplicity?

Ok, I see you've started to provide the reasons, but I feel this could be structured more clearly, and I think it's confusing to have two flags, one of which is a subset of the other, so if we can avoid that that might be cleaner. I guess full debug is super slow?

what happens if debug is true, and check_out_of_bounds is false?

hughperkins · 2026-04-29T20:58:29Z

+
+`debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost.
+
+On the Metal and Vulkan backends, `check_out_of_bound=True` is silently disabled at `qd.init` time because those backends lack the in-kernel assertion extension that the field bounds check relies on; passing it on its own gives you neither the field bounds check nor the adstack overflow check. Pass `debug=True` instead: that keeps the adstack overflow check live (it is gated independently and does not need the assertion extension), but the field bounds check still does not fire on these backends.


check_out_of_bounds tru seems alike a general thing, so lets also move it outside of the adstack section pelase.

duburcqa · 2026-04-29T21:03:11Z

========== 649 passed, 3 skipped, 2 xfailed in 1089.97s (0:18:09) ===========

duburcqa · 2026-04-29T21:04:26Z

env	batch_size	backend	gjk_collision	constraint_solver	runtime_fps_590	runtime_fps_591	runtime_fps_delta_pct
anymal_random	30000	cuda	-	-	9314890	9225767	-0.96
anymal_uniform	30000	cuda	-	-	12361607	12197108	-1.33
anymal_uniform_kinematic	0	cpu	-	-	2028	2024	-0.20
anymal_uniform_kinematic	30000	cuda	-	-	10266957	10462211	+1.90
anymal_zero	0	cpu	-	-	7073	7295	+3.14
anymal_zero	30000	cuda	-	-	19096316	18941109	-0.81
box_pyramid_3	4096	cuda	-	-	968342	975716	+0.76
box_pyramid_4	4096	cuda	-	-	395824	389584	-1.58
box_pyramid_5	4096	cuda	-	-	141715	139459	-1.59
box_pyramid_6	4096	cuda	False	-	59547	58861	-1.15
box_pyramid_6	4096	cuda	True	-	61918	59940	-3.19
dex_hand	4096	cuda	-	-	17081	17181	+0.59
duck_in_box_easy	30000	cuda	False	-	26469341	26766339	+1.12
duck_in_box_easy	30000	cuda	True	-	9681964	9660875	-0.22
duck_in_box_hard	0	cpu	-	-	5204	5154	-0.96
duck_in_box_hard	30000	cuda	False	-	10077402	10149398	+0.71
duck_in_box_hard	30000	cuda	True	-	3493847	3354442	-3.99
franka	30000	cuda	-	-	21764102	21686917	-0.35
franka_accessors	0	cpu	-	-	1182	1214	+2.71
franka_accessors	30000	cuda	-	-	15853954	15474380	-2.39
franka_free	30000	cuda	-	-	32626193	32190516	-1.34
franka_random	0	cpu	-	-	6301	6287	-0.22
franka_random	30000	cuda	-	CG	16793645	16658204	-0.81
franka_random	30000	cuda	-	Newton	16535074	16288336	-1.49
franka_random	30000	cuda	False	-	16786850	16318058	-2.79
franka_random	30000	cuda	True	-	11327811	11463262	+1.20
g1_fall	4096	cuda	-	Newton	920318	918658	-0.18
go2	4096	cuda	False	CG	3668476	3653937	-0.40
go2	4096	cuda	False	Newton	4386696	4383261	-0.08
go2	4096	cuda	True	-	3230236	3290806	+1.88
shadow_hand_cubes	0	cpu	-	-	41	41	+0.00
shadow_hand_cubes_sparse	0	cpu	-	-	66	66	+0.00

speed_test_591.txt

…pu] now that PR #591 codegen is in place via a398612; this is the test where the original Genesis CI failure was observed and where the local M4 measurement put mpm_grid_op_c65 at 85.6 MB peak phys_footprint - right at the 100 MB cap, so the matrix run on macos-15 / macos-26 will resolve whether PR #591 alone is enough

…adstack-overflow checks (init_options.md) and for kernel print() (debug.md), restate that the adstack overflow check fires on all backends with debug=True regardless of whether the backend supports the assertion mechanism, and warn that kernel print() forces a queue sync after every dispatch of the containing kernel - significant overhead even when the surrounding control flow makes the print unreachable; also relocate the slot-pointer comment block in spirv_codegen.cpp from above ad_stack_count_ptr to above ad_stack_slot_ptr where it actually belongs (per the bot review on PR #591)

github-actions · 2026-04-29T21:30:37Z

Coverage Report (`71cf4bdcb`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	100%

Diff coverage: 100% · Overall: 67% · 6 lines, 0 missing

Full annotated report

hughperkins · 2026-04-29T21:42:48Z

+
+**Note.** Output from GPU kernels appears in order despite parallel execution because all kernels are queued in the same compute stream.
+
+**Important.** Avoid kernel `print()` calls in production code where you can. Quadrants synchronizes the compute queue after every dispatch of a kernel that contains a `print()` so the output appears as close as possible to the call site. The synchronization happens unconditionally on every launch of that kernel, even when the surrounding control flow leaves the `print()` unreached at runtime; the cost is the full per-launch sync overhead, not just the cost of the `print()` itself.


hughperkins · 2026-04-29T21:45:58Z

+| CPU | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` |
+| CUDA | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` |
+| AMDGPU | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` |
+| Metal | never (no in-kernel assertion mechanism) | with `debug=True` only |


wait. why the inconcistency for 'adstack overflow check' on Metal and Vulkan?

I think the behavior should be consistent across platforms, except for features not supported by a platform at all (so 'never' is ok for vulkan and metal for example (though not ideal of course)).

I'm ok with that, but this is pre-existing in this PR, here I'm just documenting the current state. I could fix it in this PR if you want.

Oh, I see, I assumed that these were changes in this PR.

Yeah, ok, let's not feature-flate this PR :) Thank you for the doc :) I think it explains clearly the current situation. 🙌

hughperkins · 2026-04-29T21:48:54Z

+| Metal | never (no in-kernel assertion mechanism) | with `debug=True` only |
+| Vulkan | never (no in-kernel assertion mechanism) | with `debug=True` only |
+
+The adstack overflow check is gated independently of the assertion mechanism, so `debug=True` activates it on every backend - including Metal and Vulkan, where the field bounds check stays unavailable. On Metal and Vulkan, `check_out_of_bound` is silently reset to `False` at `qd.init` time (a warning is logged); passing it on its own gives neither check on those backends.


I think we should lave it enabled on Metal, and narrow the warning to say that only adstack overflows will be checked, not out of bounds.

But actually, now I think about it, why should 'out of bound' track 'adstack overflow'?

I think these should be two different flags.

That makes sense. Do you want to use 'debug' for this or a new flag? You want to do the changes in this PR?

Since you are just documenting the existing behavior, let's not change this in this doc. Thank you :)

hughperkins · 2026-04-29T21:57:26Z

Ok, doc looks good to me. Whilst it looks like these changes just target reverse-grad autodiff, lets get genesis unit test results and genesis benchmark results please, just to be sure.

hughperkins · 2026-04-29T22:00:52Z

oh they're already there.

checklist:

user-facing doc filled in
genesi benchmarks neutral
genesis unit tests passing

=> ok to merge

…elease bounds-check elision, shared count-array

…on error with kernel name + MSL byte size

…strings (per source-comment style rule)

…path; drop shader-size-cause speculation

…EXC_RESOURCE mechanism observed on GitHub-hosted macos-15 Apple-M1 runners (XPC service hits a hard 100 MB working-set cap during AIR-to-GPU compile and is killed by the kernel) and add a Metal/Vulkan caveat to the new debug-mode paragraph clarifying that check_out_of_bound is silently disabled on those backends and only the adstack overflow check survives via debug=True

…cific 100 MB / EXC_RESOURCE framing in favor of a generic 'compiler service exceeds a per-process memory budget mid-compile' wording, since the cap is platform-specific and citing it inline overspecifies the failure

…lit the adstack overflow check into its own subsection of debug.md, move the check_out_of_bound flag interaction (table of debug/check_out_of_bound combinations + Metal/Vulkan caveat) into the dedicated check_out_of_bound entry of init_options.md so debug.md stays focused on the user-facing checks and the option-reference centralizes the flag-level details, and tighten both debug and check_out_of_bound entries to bullet/table form so the relevant facts are scannable instead of buried in prose

…ng section as a #### Adstack overflow subsection (per Hugh's #### suggestion: it's another bounds check, sharing the same check_out_of_bound flag), and add a back-cross-reference from init_options.md's Debugging section to debug.md so users landing on the option reference can find the runnable examples and develop/benchmark workflow

…adstack-overflow checks (init_options.md) and for kernel print() (debug.md), restate that the adstack overflow check fires on all backends with debug=True regardless of whether the backend supports the assertion mechanism, and warn that kernel print() forces a queue sync after every dispatch of the containing kernel - significant overhead even when the surrounding control flow makes the print unreachable; also relocate the slot-pointer comment block in spirv_codegen.cpp from above ad_stack_count_ptr to above ad_stack_slot_ptr where it actually belongs (per the bot review on PR #591)

…to autodiff.md's bold-prefix style for consistency across the user_guide

…arking' to 'Avoid ... in production code' since the queue-sync overhead matters in any production path, not just during benchmarks

…rint sync warning - the print is in the kernel body and reachable in principle; it just may not be hit on a given launch

…de if possible' - print may be unavoidable in some debugging-in-production scenarios

… close as possible to the call site' - more precise about what the sync buys

… warning to avoid the doubled 'possible' against 'as close as possible'

…e debug=True implies check_out_of_bound=True relationship first, then the actionable recommendation that follows from it

…unnecessarily dropped from the kernel-print intro line

…note - the claim is unverified and Quadrants' per-kernel sync after dispatching a print-bearing kernel may already serialize in practice

…e all kernels share one compute stream) instead of dropping it, and switch the two 'Cost:' leads in init_options.md to '**Cost.**' bold-prefix style for consistency with the autodiff.md / debug.md Note / Important / Cost convention

…s check per Hugh's review on PR #591: gate the AdStack push/pop/load_top/load_top_adj sites on compile_config.debug instead of compile_config.check_out_of_bound on the LLVM side (matches the pre-#591 behavior verbatim) and on compile_config_->debug (no longer ORed with check_out_of_bound) on the SPIR-V side, so the two checks land on independent flags and PR #591 stops introducing the coupling Hugh flagged. Also relabel the AMDGPU print() table row from 'no (compile error)' to 'no (silently dropped)' since codegen_amdgpu.cpp visit(PrintStmt) overrides with a no-op (per bot review), fix the spirv_codegen.h cross-reference from the non-existent 'ensure_ad_stack_count_array_var' to the real 'ad_stack_count_ptr' helper (per bot review), and update the init_options.md per-backend table + flag-interaction bullets to reflect the new debug-only gating for the adstack overflow check

github-actions · 2026-04-29T22:28:29Z

Coverage Report (`d9a1443b5`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	100%

Diff coverage: 100% · Overall: 73% · 6 lines, 0 missing

Full annotated report

github-actions · 2026-04-29T23:25:05Z

Coverage Report (`a05e06a17`)

File	Coverage	Missing
🔴 `tests/python/test_adstack.py`	33%	1024-1027

Diff coverage: 33% · Overall: 65% · 6 lines, 4 missing

Full annotated report

…'implied check_out_of_bound' references in the two adjacent overflow-test docstrings, per bot review on PR #591. After the earlier decoupling commit on this branch (which moved the LLVM adstack-visitor gates back to compile_config.debug and the SPIR-V push gate to compile_config_->debug only), check_out_of_bound=True alone no longer activates the adstack-overflow check on any backend - the test pinning that coupling is invalid by construction. The remaining test_adstack_overflow_raises[debug=True] still covers the user-facing 'I need the deferred RuntimeError on overflow' path

github-actions · 2026-04-30T06:30:29Z

Coverage Report (`2713efbc8`)

File	Coverage	Missing

Diff coverage: 0% · Overall: 73% · 0 lines, 0 missing

Full annotated report

* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread quadrants/rhi/metal/metal_device.mm Outdated

Comment thread quadrants/rhi/metal/metal_device.mm

duburcqa changed the title ~~[SPIR-V] Shrink reverse-grad kernel MSL by ~50% to fit Apple Metal compiler~~ [SPIR-V] Shrink reverse-grad kernel MSL by ~50% Apr 29, 2026

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread docs/source/user_guide/debug.md Outdated

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated

hughperkins reviewed Apr 29, 2026

View reviewed changes

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread docs/source/user_guide/debug.md

hughperkins added the ok-to-merge label Apr 29, 2026

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread quadrants/codegen/spirv/detail/spirv_codegen.h

duburcqa added 8 commits April 30, 2026 00:26

[SPIR-V] Shrink reverse-grad kernel MSL: clamp+select adstack push, r…

6e6d71c

…elease bounds-check elision, shared count-array

[Metal] Surface MSL-compiler-service shader-size rejections as a Pyth…

cf7c5d7

…on error with kernel name + MSL byte size

[Tests] Drop test cross-references from new adstack-overflow test doc…

62f8ba0

…strings (per source-comment style rule)

[Metal] Use QD_WARN (not QD_ERROR) on the nil-pipeline / nil-library …

5fa93dd

…path; drop shader-size-cause speculation

duburcqa added 12 commits April 30, 2026 00:26

[Docs] Switch the Note / Important leads in debug.md's print section …

c028d57

…to autodiff.md's bold-prefix style for consistency across the user_guide

[Docs] Reword the kernel-print warning from 'Remove ... before benchm…

0f9f3d5

…arking' to 'Avoid ... in production code' since the queue-sync overhead matters in any production path, not just during benchmarks

[Docs] Reword 'unreachable' to 'unreached at runtime' in the kernel-p…

8b28771

…rint sync warning - the print is in the kernel body and reachable in principle; it just may not be hit on a given launch

[Docs] Soften the kernel-print warning to 'avoid ... in production co…

a20e807

…de if possible' - print may be unavoidable in some debugging-in-production scenarios

[Docs] Tighten 'appears at the right place in the log' to 'appears as…

75a5c79

… close as possible to the call site' - more precise about what the sync buys

[Docs] Replace 'if possible' with 'where you can' in the kernel-print…

d11a28b

… warning to avoid the doubled 'possible' against 'as close as possible'

[Docs] Flip the two clauses of the debug.md cross-reference: state th…

f5ba424

…e debug=True implies check_out_of_bound=True relationship first, then the actionable recommendation that follows from it

[Docs] Restore the 'which can be useful for debugging' clause that I …

3044e5c

…unnecessarily dropped from the kernel-print intro line

[Docs] Drop the upstream 'GPU kernel prints may appear out of order' …

2290a83

…note - the claim is unverified and Quadrants' per-kernel sync after dispatching a print-bearing kernel may already serialize in practice

duburcqa force-pushed the duburcqa/spirv_reverse_grad_kernel_size branch from d9a1443 to a05e06a Compare April 29, 2026 22:27

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread tests/python/test_adstack.py Outdated

duburcqa merged commit da5e8e0 into main Apr 30, 2026
53 checks passed

duburcqa deleted the duburcqa/spirv_reverse_grad_kernel_size branch April 30, 2026 06:31


		The same flag also enables a deferred runtime check on the adstack used by reverse-mode autodiff: a push past the per-stack capacity (set via `qd.init(ad_stack_size=...)` or per-alloca by `determine_ad_stack_size`) raises `RuntimeError("[Aa]dstack overflow")` on the next `qd.sync()`. Without bounds-checking, an adstack overflow silently writes past the per-thread slab and produces a wrong gradient.

		`debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost.


		`debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost.

		On the Metal and Vulkan backends, `check_out_of_bound=True` is silently disabled at `qd.init` time because those backends lack the in-kernel assertion extension that the field bounds check relies on; passing it on its own gives you neither the field bounds check nor the adstack overflow check. Pass `debug=True` instead: that keeps the adstack overflow check live (it is gated independently and does not need the assertion extension), but the field bounds check still does not fire on these backends.


		Note. Output from GPU kernels appears in order despite parallel execution because all kernels are queued in the same compute stream.

		Important. Avoid kernel `print()` calls in production code where you can. Quadrants synchronizes the compute queue after every dispatch of a kernel that contains a `print()` so the output appears as close as possible to the call site. The synchronization happens unconditionally on every launch of that kernel, even when the surrounding control flow leaves the `print()` unreached at runtime; the cost is the full per-launch sync overhead, not just the cost of the `print()` itself.

Uh oh!

Conversation

duburcqa commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SPIR-V reverse-grad kernel-size reduction: clamp + OpSelect adstack push, release-mode bounds-check elision, shared count-array, plus diagnostic-only Metal compiler-rejection logging

TL;DR

Why

Mechanism end-to-end

1. AdStackPushStmt: clamp + OpSelect instead of structured if-then-else

2. Bounds-check gate switched to check_out_of_bound

3. Shared count-array OpVariable for adstack count_var

4. Metal pipeline / library failure: QD_WARN + nullptr return (not QD_ERROR)

Per-backend coverage matrix

Tests

tests/python/test_adstack.py::test_adstack_overflow_raises

tests/python/test_adstack.py::test_adstack_overflow_raises_check_oob_explicit (new)

tests/python/test_adstack.py::test_adstack_overflow_flag_resets_after_catch

Local AD test status with QD_OFFLINE_CACHE=0

Side-effect audit

Uh oh!

Uh oh!

Uh oh!

duburcqa commented Apr 29, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026

Coverage Report (78e94d3f1)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa commented Apr 29, 2026

Uh oh!

duburcqa commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Coverage Report (71cf4bdcb)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hughperkins commented Apr 29, 2026

Uh oh!

hughperkins commented Apr 29, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026

Coverage Report (d9a1443b5)

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026

Coverage Report (a05e06a17)

Uh oh!

github-actions Bot commented Apr 30, 2026

Coverage Report (2713efbc8)

Uh oh!

duburcqa commented Apr 29, 2026 •

edited

Loading

2. Bounds-check gate switched to `check_out_of_bound`

3. Shared count-array OpVariable for adstack `count_var`

4. Metal pipeline / library failure: `QD_WARN` + `nullptr` return (not `QD_ERROR`)

`tests/python/test_adstack.py::test_adstack_overflow_raises`

`tests/python/test_adstack.py::test_adstack_overflow_raises_check_oob_explicit` (new)

`tests/python/test_adstack.py::test_adstack_overflow_flag_resets_after_catch`

Local AD test status with `QD_OFFLINE_CACHE=0`

Coverage Report (`78e94d3f1`)

Coverage Report (`71cf4bdcb`)

Coverage Report (`d9a1443b5`)

Coverage Report (`a05e06a17`)

Coverage Report (`2713efbc8`)