Skip to content

[SPIR-V] Shrink reverse-grad kernel MSL by ~50%#591

Merged
duburcqa merged 21 commits into
mainfrom
duburcqa/spirv_reverse_grad_kernel_size
Apr 30, 2026
Merged

[SPIR-V] Shrink reverse-grad kernel MSL by ~50%#591
duburcqa merged 21 commits into
mainfrom
duburcqa/spirv_reverse_grad_kernel_size

Conversation

@duburcqa

@duburcqa duburcqa commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

SPIR-V reverse-grad kernel-size reduction: clamp + OpSelect adstack push, release-mode bounds-check elision, shared count-array, plus diagnostic-only Metal compiler-rejection logging

Three commits. The first commit halves the cross-compiled MSL size of every reverse-grad kernel via three additive SPIR-V codegen changes (faster local Metal/Vulkan builds, lower host-side memory pressure during compile). The second commit replaces the silent nil pipeline + nil NSError log path in metal_device.mm with a non-throwing QD_WARN that includes the kernel name and cross-compiled MSL byte size, so when Apple's compiler service drops the connection (a known mode on some macOS / Metal-toolchain combinations) the next investigator immediately knows which kernel and how big it was, instead of bisecting from a generic RhiResult=-1 line. The third commit drops two test cross-references from new test docstrings per the source-comment style rule.

The PR is NOT a fix for the macos-15 M1 GitHub-runner failure on its own. The same g2p kernel that fails on macos-15 (cross-compiled MSL: 689,472 bytes after this PR's reduction) compiles cleanly on macOS 26 / M4 with xcrun -sdk macosx metal -std=macos-metal2.3 -c (218 KB AIR, exit 0). So the macos-15 failure is specific to the macos-15 Metal toolchain, not an MSL-size or MSL-content issue we can resolve from this side. This PR is the right shape regardless: the size cut is a real win, and the metal_device.mm warning is the right diagnostic surface for any future host that drops the XPC connection silently.

TL;DR

+----------------------------------------+---------+---------+----------------------+
|                Kernel                  | Before  |  After  |       Reduction      |
+----------------------------------------+---------+---------+----------------------+
| g2p_c511 reverse-grad                  |  23,226 |  11,639 | -50.1%               |
| p2g_c509 reverse-grad                  |  37,860 |  16,318 | -56.9%               |
| mpm_grid_op_c65 reverse-grad           |  49,206 |  25,661 | -47.9%               |
| kernel_forward_velocity_c273           |  14,874 |  11,172 | -24.9%               |
| kernel_update_cartesian_space_c289     |  57,853 |   7,617 | -86.8%               |
+----------------------------------------+---------+---------+----------------------+
| Total MSL across the test              | 282,603 | 143,203 | -49.3% (-139,400 LOC)|
+----------------------------------------+---------+---------+----------------------+

(Numbers from a local tests/test_grad.py::test_differentiable_push[gpu] run with QD_DUMP_MSL=1 QD_OFFLINE_CACHE=0. Test wall-clock on the same run dropped 91.9s -> 38.8s, a 58% speedup that comes from less time spent in the MSL compiler.)

Why

Two motivations:

  1. Reduce SPIR-V codegen size waste. The pre-PR AdStackPushStmt codegen emits a structured OpSelectionMerge / OpBranchConditional region per push that spirv-cross renders as ~13 MSL lines per push. For reverse-grad kernels with ~1000 pushes, that's the dominant size amplifier. Per-stack count_var OpVariable Function slots also become independent OpPhi mega-clusters at every enclosing loop header (700+ phis per merge). Both can be compressed without correctness change. The release-mode bounds-check (clamp + atomic-signal) on SPIR-V is currently always live; LLVM has always elided it in release; aligning the gates lets release builds skip the per-push branch entirely.

  2. Make Metal pipeline-create failures self-describing. When Apple's MSL compiler service drops the XPC connection mid-compile, newComputePipelineStateWithFunction:error: returns nil with error == nil. The pre-PR path silently returned nullptr; the user only saw the generic runtime.cpp:298 RhiResult=-1 line with no kernel name or byte size. The new path warn-logs the kernel name and cross-compiled MSL byte size, so any future investigator reading CI artifacts immediately sees which kernel hit the path.

Mechanism end-to-end

1. AdStackPushStmt: clamp + OpSelect instead of structured if-then-else

quadrants/codegen/spirv/spirv_codegen.cpp::TaskCodegen::visit(AdStackPushStmt*) previously emitted a structured OpSelectionMerge / OpBranchConditional region around every push, with the then-branch doing the in-bounds store and the else-branch publishing the overflow signal. The new emit folds the entire region into:

  1. clamped_idx = GLSLstd450UMin(count, max_size - 1)
  2. unconditional store to primal[clamped_idx] (and adjoint[clamped_idx] = 0 for heap_float)
  3. unconditional count++
  4. signal = OpSelect(count >= max_size, stack_id+1, 0)
  5. unconditional OpAtomicUMax(overflow_buffer[0], signal)

The clamp keeps the OpAccessChain in-bounds; the atomic-max with 0 is a no-op when the stack didn't overflow, so the host-readable flag still ends up at stack_id + 1 only when an actual overflow happened. spirv-cross emits this as straight-line MSL: ~5 lines per push instead of ~13.

2. Bounds-check gate switched to check_out_of_bound

The clamp + atomic-signal pair from above is now gated on compile_config->check_out_of_bound || compile_config->debug in SPIR-V codegen. Release builds elide the entire bounds check, mirroring LLVM's release-build push (which has always relied on determine_ad_stack_size producing a tight static bound and dropped the per-push runtime guard). LLVM's six adstack visitors switch their gate from compile_config.debug to compile_config.check_out_of_bound so the two backends key off the same flag.

Backend Bounds-check path Release behaviour
LLVM (CPU / CUDA / AMDGPU) check_out_of_bound -> stack_init / stack_push runtime calls inline ops, no overflow flag
SPIR-V (Metal / Vulkan) check_out_of_bound || debug -> clamp + OpAtomicUMax signal unconditional store, no overflow flag

CompileConfig::fit() already promotes debug=True to check_out_of_bound=True, so existing qd.init(debug=True) users see no behaviour change. Users who explicitly set qd.init(check_out_of_bound=True, debug=False) now also get the bounds check on LLVM, which they didn't before. The OR with debug in the SPIR-V gate preserves the qd.init(debug=True) path on Metal / Vulkan, where Program::init force-disables check_out_of_bound because those arches lack Extension::assertion.

3. Shared count-array OpVariable for adstack count_var

Each adstack count_var used to be its own OpVariable Function of type uint. spirv-opt's LocalMultiStoreElim / SSARewrite promoted each into its own SSA chain, which became a separate OpPhi at every enclosing loop header. spirv-cross then emitted each phi as one uint _N; forward-decl + one _N = _N; alias copy per predecessor branch. Reverse-grad kernels with hundreds of adstacks crossing a single loop accumulated phi mega-clusters of 700+ entries per loop header.

This PR replaces the per-stack scalar OpVariable with a single Function-scope uint[num_ad_stacks_] array, allocated lazily on first ad_stack_count_ptr(stack_id) call and indexed by OpAccessChain per push / pop / load-top. spirv-opt's mem2reg passes do not promote OpAccessChain into an aggregate, so the slots stay memory-backed and never become per-stack phis. The array is sized from a pre-pass scan that counts AdStackAllocaStmt nodes (num_ad_stacks_).

This is the single biggest lever: kernel_update_cartesian_space_c289_0_reverse_grad drops from 57,853 MSL lines to 7,617 (-87%) entirely from this change.

4. Metal pipeline / library failure: QD_WARN + nullptr return (not QD_ERROR)

quadrants/rhi/metal/metal_device.mm::create_compute_pipeline and MetalDevice::get_mtl_library previously took the nil pipeline + nil NSError path silently and returned nullptr (or, on the err != nil path, called RHI_LOG_ERROR and returned nullptr). The new path logs at WARN level with QD_WARN, including the kernel name (where available) and cross-compiled MSL byte size:

[W ...] [metal_device.mm:create_compute_pipeline@206] Apple's Metal compiler service
rejected the compute-pipeline build for kernel 'g2p_c511_0_reverse_grad_0_t00'
(cross-compiled MSL size: 689472 bytes) without returning a structured error. The XPC
service drops its connection in this shape; the underlying cause is host-toolchain-
specific and is not recoverable from this side.
[E ...] [runtime.cpp:CompiledQuadrantsKernel@298] Failed to create pipeline ... RhiResult=-1

QD_WARN rather than QD_ERROR is critical: QD_ERROR ends with throw s (where s is a bare std::string), and MetalDevice::create_pipeline is declared noexcept and only catches std::exception derivatives. A throw of std::string here would cross the noexcept boundary and trip std::terminate(), replacing the existing clean Python RuntimeError translation with a fatal process abort. With QD_WARN, no exception is thrown inside the noexcept function; the nullptr return is converted by the caller to RhiResult::error, the runtime.cpp:298 QD_ERROR_IF then throws std::string, the existing pybind11 translator (quadrants/python/py_exception_translator.cpp) catches it and raises PyExc_RuntimeError. Verified empirically by force-injecting the failure path locally and observing the Python-level RuntimeError exception.

The wording deliberately does not assert a specific cause (size, construct, driver bug, ...). The XPC connection drop is observable from this side; the actual reason in the toolchain is not.

Per-backend coverage matrix

Backend Adstack push shrink Bounds-check gate Count-array shared Metal warn-log
arm64 / x64 (LLVM CPU) N/A (LLVM emits inline) switched to check_out_of_bound N/A N/A
CUDA / AMDGPU (LLVM GPU) N/A switched to check_out_of_bound N/A N/A
Vulkan (SPIR-V) clamp + OpSelect check_out_of_bound || debug yes N/A
Metal (SPIR-V) clamp + OpSelect check_out_of_bound || debug yes yes

Tests

tests/python/test_adstack.py::test_adstack_overflow_raises

Existing test, kept on debug=True. Verifies that an adstack push past the published max_size raises RuntimeError("[Aa]dstack overflow") on the next qd.sync(). debug=True implies check_out_of_bound=True via CompileConfig::fit, so the bounds-check codepath is live.

tests/python/test_adstack.py::test_adstack_overflow_raises_check_oob_explicit (new)

Same overflow scenario as the test above but with check_out_of_bound=True set explicitly without debug=True. Pins the gating to check_out_of_bound rather than debug: a release-build user who explicitly opts into bounds-checks gets the same RuntimeError as a debug-mode user. Excluded on Metal / Vulkan because Program::init force-disables check_out_of_bound on arches without Extension::assertion, so the explicit-flag spelling alone cannot light up the bounds check there.

tests/python/test_adstack.py::test_adstack_overflow_flag_resets_after_catch

Existing test, unchanged. Pins that check_adstack_overflow() clears the flag after raising so a subsequent qd.sync() returns normally.

Local AD test status with QD_OFFLINE_CACHE=0

1214 tests pass on this branch (tests/python/test_adstack.py plus the broader test_ad_*.py files) across arch=arm64, arch=metal-2, arch=vulkan-0. Same numbers on pristine origin/main.

Side-effect audit

Concern Where checked Verdict
Offline cache key compile_config.h::check_out_of_bound is already part of the config feeding the cache key; no schema change OK
task_attribs.ad_stack serialisation Layout unchanged: per_thread_stride_*_compile_time and allocas[] populated identically OK
info.count_var users All loads / stores go through load_variable / store_variable which accept kVariablePtr whether the underlying is a fresh OpVariable or an OpAccessChain element OK
Adstack overflow signal OpAtomicUMax(buffer, 0) is a no-op for the host-visible value, so the runtime still observes a clear flag iff some thread actually overflowed OK
Reverse-pass count semantics count++ is now unconditional, so push and pop are balanced even when the in-bounds check would have skipped the increment. LoadTop*/AccAdjoint already clamp via UMin so an overflowed count of UINT_MAX still produces a clamped in-bounds index OK
LLVM compile_config.debug integer-overflow checks (BinaryOpStmt / shift sites) Untouched - still gate on debug (codegen_llvm.cpp lines 503/514/525/560); only the six adstack visitors switched to check_out_of_bound OK
Vulkan push-constant / descriptor binding layout Unchanged OK
metal_device.mm raster-fallback site (build_mtl_render_pipeline) This PR does NOT touch the raster site - it still uses the pre-PR RHI_LOG_ERROR + silent-on-err==nil behaviour. Out of scope for this PR; flagged here so the audit table matches the diff Untouched (intentional)
Process-abort regression on the new metal nil err path Verified empirically: with QD_WARN + nullptr return, force-injecting the path produces a Python RuntimeError (not std::terminate) OK

Comment thread quadrants/rhi/metal/metal_device.mm Outdated
Comment thread quadrants/rhi/metal/metal_device.mm
@duburcqa duburcqa changed the title [SPIR-V] Shrink reverse-grad kernel MSL by ~50% to fit Apple Metal compiler [SPIR-V] Shrink reverse-grad kernel MSL by ~50% Apr 29, 2026
@duburcqa

Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread docs/source/user_guide/debug.md Outdated
@github-actions

Copy link
Copy Markdown

Coverage Report (78e94d3f1)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 73% · 6 lines, 0 missing

Full annotated report

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread docs/source/user_guide/debug.md Outdated
a = x[-1] # AssertionError in debug mode
```

The same flag also enables a deferred runtime check on the adstack used by reverse-mode autodiff: a push past the per-stack capacity (set via `qd.init(ad_stack_size=...)` or per-alloca by `determine_ad_stack_size`) raises `RuntimeError("[Aa]dstack overflow")` on the next `qd.sync()`. Without bounds-checking, an adstack overflow silently writes past the per-thread slab and produces a wrong gradient.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add some kind of adstack section/subsection header please.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe #### adstack ?

Comment thread docs/source/user_guide/debug.md Outdated

The same flag also enables a deferred runtime check on the adstack used by reverse-mode autodiff: a push past the per-stack capacity (set via `qd.init(ad_stack_size=...)` or per-alloca by `determine_ad_stack_size`) raises `RuntimeError("[Aa]dstack overflow")` on the next `qd.sync()`. Without bounds-checking, an adstack overflow silently writes past the per-thread slab and produces a wrong gradient.

`debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a mixture of general debug stuff, and adstack-speicfic stuff. Can we factorize out the general stuff to go outsdie fo the new adstack subsection, and keep just the adstack specific stuff here please.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you havent introduced check_out_of_bounds yet. It should be a sepearate section to 'debug' I feel. But .... why introduce a separate flag? Why not just have a single debug flag, for simplicity?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see you've started to provide the reasons, but I feel this could be structured more clearly, and I think it's confusing to have two flags, one of which is a subset of the other, so if we can avoid that that might be cleaner. I guess full debug is super slow?

what happens if debug is true, and check_out_of_bounds is false?

Comment thread docs/source/user_guide/debug.md Outdated

`debug=True` is a superset of `check_out_of_bound=True`. Setting `qd.init(check_out_of_bound=True)` without `debug=True` enables the field bounds check and the adstack overflow check, but skips kernel `assert` evaluation, integer overflow detection on arithmetic, and the other checks listed below. Use this when you want bounds-safety in a release-build app without paying the full debug-mode cost.

On the Metal and Vulkan backends, `check_out_of_bound=True` is silently disabled at `qd.init` time because those backends lack the in-kernel assertion extension that the field bounds check relies on; passing it on its own gives you neither the field bounds check nor the adstack overflow check. Pass `debug=True` instead: that keeps the adstack overflow check live (it is gated independently and does not need the assertion extension), but the field bounds check still does not fire on these backends.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_out_of_bounds tru seems alike a general thing, so lets also move it outside of the adstack section pelase.

@duburcqa

Copy link
Copy Markdown
Contributor Author

========== 649 passed, 3 skipped, 2 xfailed in 1089.97s (0:18:09) ===========

@duburcqa

Copy link
Copy Markdown
Contributor Author
env batch_size backend gjk_collision constraint_solver runtime_fps_590 runtime_fps_591 runtime_fps_delta_pct
anymal_random 30000 cuda - - 9314890 9225767 -0.96
anymal_uniform 30000 cuda - - 12361607 12197108 -1.33
anymal_uniform_kinematic 0 cpu - - 2028 2024 -0.20
anymal_uniform_kinematic 30000 cuda - - 10266957 10462211 +1.90
anymal_zero 0 cpu - - 7073 7295 +3.14
anymal_zero 30000 cuda - - 19096316 18941109 -0.81
box_pyramid_3 4096 cuda - - 968342 975716 +0.76
box_pyramid_4 4096 cuda - - 395824 389584 -1.58
box_pyramid_5 4096 cuda - - 141715 139459 -1.59
box_pyramid_6 4096 cuda False - 59547 58861 -1.15
box_pyramid_6 4096 cuda True - 61918 59940 -3.19
dex_hand 4096 cuda - - 17081 17181 +0.59
duck_in_box_easy 30000 cuda False - 26469341 26766339 +1.12
duck_in_box_easy 30000 cuda True - 9681964 9660875 -0.22
duck_in_box_hard 0 cpu - - 5204 5154 -0.96
duck_in_box_hard 30000 cuda False - 10077402 10149398 +0.71
duck_in_box_hard 30000 cuda True - 3493847 3354442 -3.99
franka 30000 cuda - - 21764102 21686917 -0.35
franka_accessors 0 cpu - - 1182 1214 +2.71
franka_accessors 30000 cuda - - 15853954 15474380 -2.39
franka_free 30000 cuda - - 32626193 32190516 -1.34
franka_random 0 cpu - - 6301 6287 -0.22
franka_random 30000 cuda - CG 16793645 16658204 -0.81
franka_random 30000 cuda - Newton 16535074 16288336 -1.49
franka_random 30000 cuda False - 16786850 16318058 -2.79
franka_random 30000 cuda True - 11327811 11463262 +1.20
g1_fall 4096 cuda - Newton 920318 918658 -0.18
go2 4096 cuda False CG 3668476 3653937 -0.40
go2 4096 cuda False Newton 4386696 4383261 -0.08
go2 4096 cuda True - 3230236 3290806 +1.88
shadow_hand_cubes 0 cpu - - 41 41 +0.00
shadow_hand_cubes_sparse 0 cpu - - 66 66 +0.00

speed_test_591.txt

duburcqa added a commit that referenced this pull request Apr 29, 2026
…pu] now that PR #591 codegen is in place via a398612; this is the test where the original Genesis CI failure was observed and where the local M4 measurement put mpm_grid_op_c65 at 85.6 MB peak phys_footprint - right at the 100 MB cap, so the matrix run on macos-15 / macos-26 will resolve whether PR #591 alone is enough
duburcqa added a commit that referenced this pull request Apr 29, 2026
…adstack-overflow checks (init_options.md) and for kernel print() (debug.md), restate that the adstack overflow check fires on all backends with debug=True regardless of whether the backend supports the assertion mechanism, and warn that kernel print() forces a queue sync after every dispatch of the containing kernel - significant overhead even when the surrounding control flow makes the print unreachable; also relocate the slot-pointer comment block in spirv_codegen.cpp from above ad_stack_count_ptr to above ad_stack_slot_ptr where it actually belongs (per the bot review on PR #591)
@github-actions

Copy link
Copy Markdown

Coverage Report (71cf4bdcb)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 67% · 6 lines, 0 missing

Full annotated report


**Note.** Output from GPU kernels appears in order despite parallel execution because all kernels are queued in the same compute stream.

**Important.** Avoid kernel `print()` calls in production code where you can. Quadrants synchronizes the compute queue after every dispatch of a kernel that contains a `print()` so the output appears as close as possible to the call site. The synchronization happens unconditionally on every launch of that kernel, even when the surrounding control flow leaves the `print()` unreached at runtime; the cost is the full per-launch sync overhead, not just the cost of the `print()` itself.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment thread docs/source/user_guide/init_options.md Outdated
| CPU | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` |
| CUDA | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` |
| AMDGPU | with `check_out_of_bound=True` or `debug=True` | with `check_out_of_bound=True` or `debug=True` |
| Metal | never (no in-kernel assertion mechanism) | with `debug=True` only |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait. why the inconcistency for 'adstack overflow check' on Metal and Vulkan?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behavior should be consistent across platforms, except for features not supported by a platform at all (so 'never' is ok for vulkan and metal for example (though not ideal of course)).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with that, but this is pre-existing in this PR, here I'm just documenting the current state. I could fix it in this PR if you want.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, I assumed that these were changes in this PR.

Yeah, ok, let's not feature-flate this PR :) Thank you for the doc :) I think it explains clearly the current situation. 🙌

Comment thread docs/source/user_guide/init_options.md Outdated
| Metal | never (no in-kernel assertion mechanism) | with `debug=True` only |
| Vulkan | never (no in-kernel assertion mechanism) | with `debug=True` only |

The adstack overflow check is gated independently of the assertion mechanism, so `debug=True` activates it on every backend - including Metal and Vulkan, where the field bounds check stays unavailable. On Metal and Vulkan, `check_out_of_bound` is silently reset to `False` at `qd.init` time (a warning is logged); passing it on its own gives neither check on those backends.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should lave it enabled on Metal, and narrow the warning to say that only adstack overflows will be checked, not out of bounds.

But actually, now I think about it, why should 'out of bound' track 'adstack overflow'?

I think these should be two different flags.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Do you want to use 'debug' for this or a new flag? You want to do the changes in this PR?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are just documenting the existing behavior, let's not change this in this doc. Thank you :)

Comment thread docs/source/user_guide/debug.md
@hughperkins

Copy link
Copy Markdown
Collaborator

Ok, doc looks good to me. Whilst it looks like these changes just target reverse-grad autodiff, lets get genesis unit test results and genesis benchmark results please, just to be sure.

@hughperkins

Copy link
Copy Markdown
Collaborator

oh they're already there.

checklist:

  • user-facing doc filled in
  • genesi benchmarks neutral
  • genesis unit tests passing

=> ok to merge

Comment thread quadrants/codegen/spirv/detail/spirv_codegen.h
…elease bounds-check elision, shared count-array
…EXC_RESOURCE mechanism observed on GitHub-hosted macos-15 Apple-M1 runners (XPC service hits a hard 100 MB working-set cap during AIR-to-GPU compile and is killed by the kernel) and add a Metal/Vulkan caveat to the new debug-mode paragraph clarifying that check_out_of_bound is silently disabled on those backends and only the adstack overflow check survives via debug=True
…cific 100 MB / EXC_RESOURCE framing in favor of a generic 'compiler service exceeds a per-process memory budget mid-compile' wording, since the cap is platform-specific and citing it inline overspecifies the failure
…lit the adstack overflow check into its own subsection of debug.md, move the check_out_of_bound flag interaction (table of debug/check_out_of_bound combinations + Metal/Vulkan caveat) into the dedicated check_out_of_bound entry of init_options.md so debug.md stays focused on the user-facing checks and the option-reference centralizes the flag-level details, and tighten both debug and check_out_of_bound entries to bullet/table form so the relevant facts are scannable instead of buried in prose
…ng section as a #### Adstack overflow subsection (per Hugh's #### suggestion: it's another bounds check, sharing the same check_out_of_bound flag), and add a back-cross-reference from init_options.md's Debugging section to debug.md so users landing on the option reference can find the runnable examples and develop/benchmark workflow
duburcqa added 12 commits April 30, 2026 00:26
…adstack-overflow checks (init_options.md) and for kernel print() (debug.md), restate that the adstack overflow check fires on all backends with debug=True regardless of whether the backend supports the assertion mechanism, and warn that kernel print() forces a queue sync after every dispatch of the containing kernel - significant overhead even when the surrounding control flow makes the print unreachable; also relocate the slot-pointer comment block in spirv_codegen.cpp from above ad_stack_count_ptr to above ad_stack_slot_ptr where it actually belongs (per the bot review on PR #591)
…to autodiff.md's bold-prefix style for consistency across the user_guide
…arking' to 'Avoid ... in production code' since the queue-sync overhead matters in any production path, not just during benchmarks
…rint sync warning - the print is in the kernel body and reachable in principle; it just may not be hit on a given launch
…de if possible' - print may be unavoidable in some debugging-in-production scenarios
… close as possible to the call site' - more precise about what the sync buys
… warning to avoid the doubled 'possible' against 'as close as possible'
…e debug=True implies check_out_of_bound=True relationship first, then the actionable recommendation that follows from it
…unnecessarily dropped from the kernel-print intro line
…note - the claim is unverified and Quadrants' per-kernel sync after dispatching a print-bearing kernel may already serialize in practice
…e all kernels share one compute stream) instead of dropping it, and switch the two 'Cost:' leads in init_options.md to '**Cost.**' bold-prefix style for consistency with the autodiff.md / debug.md Note / Important / Cost convention
…s check per Hugh's review on PR #591: gate the AdStack push/pop/load_top/load_top_adj sites on compile_config.debug instead of compile_config.check_out_of_bound on the LLVM side (matches the pre-#591 behavior verbatim) and on compile_config_->debug (no longer ORed with check_out_of_bound) on the SPIR-V side, so the two checks land on independent flags and PR #591 stops introducing the coupling Hugh flagged. Also relabel the AMDGPU print() table row from 'no (compile error)' to 'no (silently dropped)' since codegen_amdgpu.cpp visit(PrintStmt) overrides with a no-op (per bot review), fix the spirv_codegen.h cross-reference from the non-existent 'ensure_ad_stack_count_array_var' to the real 'ad_stack_count_ptr' helper (per bot review), and update the init_options.md per-backend table + flag-interaction bullets to reflect the new debug-only gating for the adstack overflow check
@duburcqa duburcqa force-pushed the duburcqa/spirv_reverse_grad_kernel_size branch from d9a1443 to a05e06a Compare April 29, 2026 22:27
@github-actions

Copy link
Copy Markdown

Coverage Report (d9a1443b5)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 73% · 6 lines, 0 missing

Full annotated report

Comment thread tests/python/test_adstack.py Outdated
@github-actions

Copy link
Copy Markdown

Coverage Report (a05e06a17)

File Coverage Missing
🔴 tests/python/test_adstack.py 33% 1024-1027

Diff coverage: 33% · Overall: 65% · 6 lines, 4 missing

Full annotated report

…'implied check_out_of_bound' references in the two adjacent overflow-test docstrings, per bot review on PR #591. After the earlier decoupling commit on this branch (which moved the LLVM adstack-visitor gates back to compile_config.debug and the SPIR-V push gate to compile_config_->debug only), check_out_of_bound=True alone no longer activates the adstack-overflow check on any backend - the test pinning that coupling is invalid by construction. The remaining test_adstack_overflow_raises[debug=True] still covers the user-facing 'I need the deferred RuntimeError on overflow' path
@github-actions

Copy link
Copy Markdown

Coverage Report (2713efbc8)

File Coverage Missing

Diff coverage: 0% · Overall: 73% · 0 lines, 0 missing

Full annotated report

@duburcqa duburcqa merged commit da5e8e0 into main Apr 30, 2026
53 checks passed
@duburcqa duburcqa deleted the duburcqa/spirv_reverse_grad_kernel_size branch April 30, 2026 06:31
npoulad1 added a commit to ROCm/quadrants that referenced this pull request Jun 8, 2026
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428)

* [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

* [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430)

* Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420)

* [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435)

* [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438)

* Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

* Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442)

* [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439)

* [Misc] Add named top-level loops (Genesis-Embodied-AI#440)

* [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

* [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

* [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456)

* [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461)

* [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

* [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

* [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

* [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465)

* [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

* [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471)

* [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

* [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

* [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475)

* [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

* Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485)

* [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484)

* [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477)

* [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)

* Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488)

* Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489)

* [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487)

* [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492)

* [CI] Serialize api doc workflow (Genesis-Embodied-AI#494)

* [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506)

* [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509)

* [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504)

* [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505)

* [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507)

* [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508)

* [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482)

* [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483)

* [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512)

* [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510)

* [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511)

* [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422)

* [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500)

* [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501)

* [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502)

* [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503)

* [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496)

* [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491)

* [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534)

* [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535)

* [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495)

* [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490)

* [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536)

* [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541)

* [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419)

* [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411)

* [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552)

* [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441)

* [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412)

* [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555)

* [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554)

* [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537)

* [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493)

* [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539)

* [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513)

* [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551)

* [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557)

* [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562)

* [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559)

* [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558)

* [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563)

* [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426)

Authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543)

* Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564)

* [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470)

* [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567)

* Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573)

* [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574)

* [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571)

* [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575)

* [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576)

* [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577)

* [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570)

* [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566)

* [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579)

* [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584)

* [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580)

* [Type] Tensor 24 (Genesis-Embodied-AI#561)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587)

* [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578)

* [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588)

* [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590)

* [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592)

* [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591)

* [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596)

* [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450)

* Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585)

Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598)

Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>

* [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599)

* [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606)

* [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610)

* [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611)

* [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Doc] Update README (Genesis-Embodied-AI#617)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

* [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Add PR Line change report (Genesis-Embodied-AI#624)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

* [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630)

* [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

Co-authored-by: Johnny Nunez and Hugh Perkins

* [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632)

* [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

* [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633)

* [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634)

* [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638)

* [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639)

* [Perf] Streams 1-4 (Genesis-Embodied-AI#410)

* [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643)

* [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650)

* [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640)

* [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641)

* [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635)

* [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658)

* [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655)

* [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653)

* [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659)

* [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654)

* [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660)

* [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669)

* [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668)

* [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667)

* [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671)

* [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675)

* [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677)

* [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Cross gpu atomics (Genesis-Embodied-AI#666)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664)

* [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685)

* [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670)

* [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662)

* [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687)

* [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672)

* [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679)

* [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665)

* [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691)

* [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694)

* [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690)

* Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698)

* [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692)

* [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696)

* [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683)

* [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676)

* [GPU] New QIPC ops for block (Genesis-Embodied-AI#684)

* [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693)

* [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701)

* [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700)

* [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702)

* [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708)

* [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707)

* Fix duplicate HIP graph driver-function declarations after v1.0.0 merge

The amd-integration fork had cherry-picked the HIP graph driver functions
(graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate /
graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set.
The per-file 3-way merge appended both copies into
amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the
AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures
are identical to the fork's existing declarations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge

- kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel
  rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream
  PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design,
  leaving references to undefined `ephemeral_context_ptr`. Restore the fork's
  coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced
  launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel
  groups adapted onto the AMD launch path.
- llvm_context.h: both the fork and upstream added `num_instructions`; the merge
  kept upstream's private placement, but the AMDGPU codegen force-inline heuristic
  calls it statically from outside the class. Move it back to the public section.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore async result D2H and hoist kernarg vectors in AMDGPU launcher

The v1.0.0 merge resolution regressed two amd-integration baseline
optimizations in launch_llvm_kernel / launch_offloaded_tasks:

  - The per-launch result-buffer copy was a blocking memcpy_device_to_host,
    forcing a host stall on every value-returning launch and serializing the
    GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it
    needs the value); external-array transfers still stream_synchronize once
    before reading back.

  - launch_task constructed the kernarg std::vectors from initializer lists
    ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free
    per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget

Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup
ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through
`amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside
`llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco`
(i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted
these constructs, which is why it was unaffected.

1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend.
   Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target
   (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the
   native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK`
   is now the default and still honored. This is the actual crash fix.

2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so
   `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries
   x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies
   but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm
   during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the
   wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources

CI pre-commit's clang-format hook reformatted these files (long
declarations/lambda signatures collapsed onto single lines per the repo's
clang-format config). Apply the same formatting so the hook passes.

No functional changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input)

clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged
`builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to
the `llvm::Value*` LHS parameter as a null pointer, not an integer zero.
Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper
zero constant -- identical intended semantics, and clang-tidy clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>
Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants