[AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch by duburcqa · Pull Request #635 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-05-06T14:33:49Z

Adstack max-reducer: parallel `MaxOverRange` dispatch with `1<<24` cap-hit tripwires

Fifteen commits, no behaviour change for users whose reverse-mode kernels never had a MaxOverRange axis above the existing 1<<24 adstack-sizer cap. Adds a per-tree parallel max-reducer that pre-evaluates recognized MaxOverRange shapes at launch and substitutes the result as a Const before any of the four adstack-sizer eval paths walks the tree. Promotes the silent truncation at the cap to a hard error on every backend whose sizer can detect it.

TL;DR

A reverse-mode kernel like

@qd.kernel
def compute(a: qd.types.ndarray(dtype=qd.i32, ndim=1)):
    for i in range(a.shape[0]):
        v = x[i]
        for _ in range(a[i]):
            v = v * 0.95 + 0.01
        y[None] += v

lowers to a per-stack SizeExpr containing MaxOverRange(0, a.shape[0], a[var]). Before this PR the adstack sizer enumerated that range linearly on every launch, with a hard 1<<24 cap above which the host evaluator raised RuntimeError, the LLVM device sizer silently truncated, and the SPIR-V on-device sizer silently clamped. Above-cap axes therefore either failed loud-but-confusing on CPU or produced wrong heap strides and corrupted gradients on GPU.

After this PR a recognize_adstack_max_reducer_specs pre-pass captures shapes that fit a deliberately narrow grammar (chains of nested MaxOverRanges across distinct bound variables; integer ndarray and field reads up to 32 bits wide indexed by literal constants or any captured chain bound variable; integer arithmetic combinators), the launcher dispatches a generic parallel-max compute kernel per captured spec at launch time, and substitute_precomputed_max_over_range rewrites the captured MaxOverRange to a Const carrying the dispatched value before any sizer eval path walks the tree. Out-of-grammar shapes whose iteration count exceeds the cap now raise via three explicit tripwires (host evaluator QD_ERROR_IF; SPIR-V on-device sizer metadata-trailing overflow-flag slot; LLVM device sizer cap-hit short-circuit + indirect stack_push overflow) instead of silently undersizing the heap.

Why

compute_bounded_adstack_size in quadrants/transforms/determine_ad_stack_size.cpp emits MaxOverRange(begin, end, body) nodes whose iteration count is bounded only by the underlying ndarray axis. Three eval paths consume the resulting trees per launch:

Host evaluator (adstack/eval.cpp::evaluate_node): hard QD_ERROR_IF at end - begin > 1<<24, on by default through evaluate_adstack_size_expr on the CPU host fast path.
LLVM device sizer interpreter (runtime_eval_adstack_size_expr in quadrants/runtime/llvm/runtime_module/runtime.cpp): break at the same threshold (silent truncation on CUDA / AMDGPU LLVM-GPU).
SPIR-V on-device sizer (adstack_sizer_shader.cpp): silent clamp effective_end = min(end, begin + (1<<24)) on Metal / Vulkan.

When the gating ndarray axis exceeds 1<<24 cells, every device path returned an under-bound on per-thread stack depth. The heap then either overflowed at qd.sync() with an opaque message naming the wrong kernel, or silently corrupted gradients with no error at all. The host path's hard error was the loud version, opt-in via QD_DEBUG_ADSTACK=1, and used as a tripwire today; it does not cover the GPU paths.

The fix preserves the cap as an internal safety latch (the per-thread sizer's serial walk is still bounded) but moves the actual evaluation of recognized shapes onto a parallel-dispatch path that scales past the cap, and turns cap-hits on the remaining out-of-grammar shapes into hard errors instead of silent truncation.

Surface API

None. The change is purely internal to the adstack-sizer pipeline. Users who never tripped the cap see no behaviour change; users whose recognized kernels did trip the cap stop seeing wrong gradients; users whose out-of-grammar kernels would have tripped the cap now see a RuntimeError / QuadrantsAssertionError at the next qd.sync() instead of silent truncation.

Mechanism end-to-end

1. Pre-pass shape recognition

quadrants/program/adstack/max_reducer.{h,cpp}::recognize_adstack_max_reducer_specs(size_exprs) walks each per-stack SerializedSizeExpr post-order and returns a std::vector<StaticAdStackMaxReducerSpec> describing every MaxOverRange node whose:

begin and end subtrees are closed-form (Const / ExternalTensorShape / Add / Sub / Mul / Max, plus any MaxOverRange already captured deeper in the same tree),
body subtree references only Const, ExternalTensorRead(arg, [...]) (single- or multi-axis, indexed by literal constants or any captured chain bound variable, leaf dtype restricted to 32-bit-or-narrower integer), FieldLoad(snode, [...]) (same index restriction; the literal-only path host-folds to Const at encode time, the bound-var path emits a kFieldLoad device node), ExternalTensorShape, and Add / Sub / Mul / Max of those.

Multi-axis support: the recognizer descends through nested MaxOverRanges as long as each inner [begin, end) is closed-form (Const / ExternalTensorShape / captured-deeper-MORs); each layer adds one axis to the captured spec, and the dispatch enumerates the cross-product of every axis. Specs come back in dependency order (deepest first); each dispatch's result becomes the substituted Const an outer spec's begin / end may reference. Captured ids are stored in task_attribs.ad_stack.max_reducer_specs (SPIR-V) and current_task->ad_stack.max_reducer_specs (LLVM); both backends populate the field at codegen time (spirv_codegen.cpp, codegen_llvm.cpp).

The integer-leaf dtype restriction (i8 / i16 / i32 / u8 / u16 / u32 only) gates the cache-revalidation sentinel: populate_max_reducer_body_observations records INT64_MIN as the observed value, and the replay path's gen-mismatch dereference must return a value strictly greater than the sentinel to force invalidation. A 64-bit leaf could legally hold INT64_MIN and false-hit on a mutated entry, so those leaves fall through to the per-task sizer's capped path.

StaticAdStackMaxReducerSpec lives in quadrants/transforms/static_adstack_analysis.h with a QD_IO_DEF so the spec round-trips through the offline cache. The struct carries axis_var_ids / axis_begin_node_idxs / axis_end_node_idxs (one entry per captured axis, outermost-first) plus dependent_mor_node_idxs listing the captured deeper-MOR keys the spec's begin / end references.

2. Generic max-reducer kernels - one per backend family

Backend	File	Mechanism
SPIR-V	`quadrants/codegen/spirv/adstack_max_reducer_shader.{h,cpp}`	Compute shader, `kAdStackMaxReducerWorkgroupSize=128`, strided `kElementsPerThread=64` per-thread iteration to keep `num_workgroups_x` under `maxComputeWorkGroupCount[0]=65535` for spec lengths up to ~536M. Body bytecode interpreter (`kConst / kBoundVariable / kExternalTensorRead / kFieldLoad / kAdd / kSub / kMul / kMax`). Per-spec output is two u32 slots: `[2k] = OpAtomicUMax` running max, `[2k+1] = OpAtomicOr` overflow flag. The u32+overflow split sidesteps spirv-cross's MSL backend gap on i64 atomics (`MSL currently does not support 64-bit atomics`), unlocking Metal and Vulkan-via-MoltenVK.
LLVM	`quadrants/runtime/llvm/runtime_module/runtime.cpp::runtime_eval_adstack_max_reduce`	Single-thread serial walk over the body bytecode, cross-product of `params.per_axis_length[]` iterations, atomic-max into `runtime->adstack_max_reducer_outputs[output_slot]`. Dispatched as a host call on CPU and as a `1x1x1` JIT-launched kernel on CUDA / AMDGPU. POD device params live in `quadrants/ir/static_adstack_max_reducer_device.h`.

The body bytecode reuses the existing AdStackSizeExprDeviceNode POD format from quadrants/ir/adstack_size_expr_device.h. encode_max_reducer_body_bytecode in quadrants/program/adstack/max_reducer.cpp extracts the body subtree, renumbers nodes to dense [0, body_node_count) indices, copies referenced index entries, and resolves kExternalTensorRead arg_buffer_offset via a closure passed by the per-backend launcher. Bound-var-indexed kFieldLoad leaves take a backend-specific base resolution: SPIR-V passes a FieldLoadDeviceEmitter whose fetch returns root_psb + place_byte_offset_in_root (pre-baked PSB address), LLVM passes a null emitter and the encoder stores (snode_root_id, place_byte_offset) in the device-node POD's arg_buffer_offset / const_value slots which the LLVM device interpreter resolves at runtime via runtime->roots[snode_root_id] + place_byte_offset.

3. Launch sequencing

Backend	File	Helper
SPIR-V	`quadrants/runtime/gfx/adstack_max_reducer_launch.cpp`	`GfxRuntime::dispatch_max_reducers(...)`
LLVM	`quadrants/runtime/llvm/llvm_adstack_lazy_claim.cpp`	`LlvmRuntimeExecutor::dispatch_max_reducers_for_tasks(...)` (overload taking `std::vector<OffloadedTask>`; per-arch launchers in `runtime/cpu/`, `runtime/cuda/`, `runtime/amdgpu/` call into it as a one-liner)

Both helpers share a level-based round dispatch:

Pass 1 - cache lookup keyed by (registry_id, stack_id, mor_node_idx) packed into a single uint64_t via pack_max_reducer_key in adstack/max_reducer.cpp. Hits drop straight into the result map; misses go to the pending list with back-references to the source SerializedSizeExpr and StaticAdStackMaxReducerSpec.
Per-round prepare + dispatch. Each round picks every undispatched spec whose dependent_mor_node_idxs are all already in the result map (cache hits + earlier rounds), substitutes those values into the working tree via substitute_precomputed_max_over_range, host-evaluates begin / end against the substituted tree, encodes the body bytecode, and dispatches the round as one cmdlist (gfx) / one batched runtime-function call sequence (LLVM). Most kernels finish in one round; nested patterns (e.g. an outer MaxOverRange whose end contains a captured inner max-of-array) take one round per dependency depth. A no-progress round drops every remaining pending spec and falls back to the per-task sizer's cap-hit path.
Per-round readback. Read u32 output slots (gfx) or i64 output slots (LLVM) at round-local indices, fall back to host-eval on overflow specs (SPIR-V; the host walks the substituted tree so already-resolved deps are folded in), record into AdStackCache::record_max_reducer_eval so the next launch can short-circuit. The recorded read observations come from populate_max_reducer_body_observations which snapshots observed_devalloc + observed_gen (ndarray) and snode_write_gen (field) so a host-side mutation of either source invalidates the cache cleanly.

The dispatch must precede publish_adstack_metadata_spirv (gfx) / publish_adstack_metadata (LLVM) so the substituted Consts are in place before the sizer eval pipeline runs.

On Apple Silicon Metal the body interpreter loads ndarray data buffers and SNode tree root buffers via PSB (raw bufferDeviceAddress), bypassing the descriptor-bound resource tracking, so the gfx launcher calls track_physical_buffer(...) once per cmdlist for every ndarray_alloc and every root_buffer_ (the useResource: hint Metal needs to mark those buffers resident for the dispatch).

4. Substitution into per-stack trees

quadrants/program/adstack/max_reducer.cpp::substitute_precomputed_max_over_range(expr, registry_id, stack_id, results) walks expr.nodes and replaces every captured MaxOverRange whose key is in results with a Const(dispatched_value). Empty-input fast path: when no captured spec matches, returns expr unchanged with no allocation.

Three eval paths consume the substituted tree:

Host fast path (eval_per_task_metadata_on_host in runtime/gfx/adstack_sizer_launch.cpp; LLVM host-eval branch in llvm_adstack_lazy_claim.cpp). The host evaluator's pointer-keyed size_expr_cache_ cannot accept a stack-local substituted tree (a transient stack address would alias unrelated cache entries across launches and return wrong cached values), so the substitution-active branch routes through a dedicated evaluate_adstack_size_expr_no_cache(...) variant; the empty-results fast path keeps the live a.size_expr reference and the cache stays warm for kernels that never trigger the recognizer.
SPIR-V on-device sizer encoder (encode_adstack_size_expr_device_bytecode_for_spirv). The encoder walks the substituted tree where each captured MaxOverRange is already a Const, so the body's ExternalTensorRead / FieldLoad leaves are not in the encoder's reads list; AdStackCache::lookup_max_reducer_reads(...) returns the recorded body observations for each captured spec, and the encoder appends them to its reads list before recording into spirv_bytecode_cache_. A mutation to the gating ndarray / field then invalidates the cached bytecode via the same gen-counter replay path the existing per-task metadata cache uses.
LLVM device sizer encoder (encode_adstack_size_expr_device_bytecode). Same substitution; same downstream llvm_per_task_ad_stack_cache_ machinery.

5. Cap-hit tripwires (`1<<24`)

The 1<<24 per-task sizer cap is structurally unreachable for max-reducer-recognized shapes (those are dispatched in parallel and substituted to Const before the sizer walks). It is reachable only for out-of-grammar shapes whose iteration count exceeds the cap. Three explicit tripwires:

Site	Mechanism
Host evaluator (`evaluate_node`)	Existing hard `QD_ERROR_IF`; surfaces as `RuntimeError` to Python on the CPU host fast path.
SPIR-V on-device sizer (`adstack_sizer_shader.cpp`)	Metadata buffer layout grew a trailing u32 overflow-flag slot at index `2 + 2*n_stacks`. The shader writes 1 there on `end - begin > cap`, and clamps `effective_end = begin` so the walk stays bounded. The host post-readback in `publish_adstack_metadata_spirv` raises `QD_ERROR_IF` when the slot is non-zero.
LLVM device sizer (`device_eval_node`)	Cap-hit short-circuit: `kMaxOverRange` returns 0 immediately on `end - begin > cap` to keep the single-thread on-device dispatch within the driver's TDR window. The cap-hit then surfaces indirectly through the existing `stack_push` overflow infrastructure on the subsequent main-kernel launch. The diagnostic message attribution depends on the kernel layout.

6. Cache invalidation

The per-spec result cache integrates into the existing AdStackCache four-layer cascade:

try_max_reducer_cache_hit (one entry per captured (registry_id, stack_id, mor_node_idx)). Hit -> no max-reducer dispatch, the cached Const is substituted into the per-stack tree.
try_size_expr_cache_hit (per-SerializedSizeExpr after substitution). Hit -> no per-thread sizer eval call.
try_per_task_ad_stack_cache_hit / try_llvm_per_task_ad_stack_cache_hit (per-task metadata blob). Hit -> no per-task sizer dispatch.
try_spirv_bytecode_cache_hit (per-task bytecode blob). Hit -> no SPIR-V bytecode encode + upload.

In steady state with an unchanged gating source every layer hits and the per-launch overhead of the option-D pipeline collapses to zero. A host-side Ndarray.write bumps ndarray_data_gen_; a host-side field write bumps snode_write_gen. Either bump propagates through every layer's gen-counter replay walk and forces a fresh dispatch.

FieldLoadObs records produced by the bound-var FieldLoad encoder path carry indices = {} since the body is evaluated at every cross-product iteration and there is no canonical scalar to re-read; replay_one_observation's FieldLoadObs arm treats the gen counter as the sole staleness signal in that mode and unconditionally invalidates on a gen mismatch.

Per-backend coverage matrix

Backend	Recognized `MaxOverRange` dispatch	Cap-hit tripwire (out-of-grammar `MaxOverRange`)
CPU (LLVM host eval)	Host call to `runtime_eval_adstack_max_reduce` ✓	`evaluate_node` `QD_ERROR_IF` ✓ (raised as `RuntimeError`)
CUDA	LLVM-GPU `1x1x1` kernel ✓	`device_eval_node` short-circuit + indirect `stack_push` overflow
AMDGPU	LLVM-GPU `1x1x1` kernel ✓	same as CUDA
Vulkan (native + MoltenVK)	SPIR-V compute shader (u32 atomicMax + atomicOr overflow) ✓	sizer metadata-trailing slot ✓ (raised as `QuadrantsAssertionError`)
Metal	same as Vulkan ✓	same as Vulkan ✓

Tests - `tests/python/test_adstack.py`

Six new regression tests, all parametrized over every available backend.

`test_max_reducer_pins_stride_for_oversized_axis`

Parametrized over (shape, body_kind) matrix that exercises the recognizer's accepted body grammar (single-axis ETR, ETR + ExternalTensorShape host-fold, closed FieldLoad host-fold, and the Add / Sub / Mul / Max arithmetic combinator). For each shape the dispatch + substitution produces the correct heap stride and the kernel runs to completion; the above-cap variants additionally pin the contract that a recognized spec ranges over an arbitrarily large axis. Uses qd.ndarray rather than numpy passthrough so the device buffer is not capped at backend-specific H2D-blit limits.

`test_max_reducer_dispatch_counts_advance_on_input_mutation`

Pins the dispatch + cache invalidation pipeline via a new Program.get_max_reducer_dispatch_count / reset_max_reducer_dispatch_count python binding (counter on AdStackCache, bumped at every record_max_reducer_eval). The first launch fires at least one dispatch; a host mutation of the gating ndarray bumps ndarray_data_gen and the next launch re-dispatches.

`test_max_reducer_grammar_fallback`

A reverse-mode kernel whose inner trip count is a compile-time constant produces no MaxOverRange and the recognizer captures nothing. The dispatch counter stays at zero; the kernel still produces the correct gradient. Pins the contract that any kernel outside the captured grammar runs unchanged so future grammar broadening cannot silently drop the fallback path.

`test_max_reducer_field_load_bound_var_dispatch`

Eight-variant parametrized test pinning the bound-var-indexed FieldLoad body grammar. Body shapes cover field[i] on its own, field[i] + arr[i] (mixed FieldLoad + ETR via Add), arr[i] + field[i] (commuted), max(field[i], arr[i]), max(field[i], const), max(field[i] + 0, field[i] * 1 - 0) (full arithmetic combinator), and the conservative-wrapper path field[field[i]] / arr[field[i]] (the trip-count builder substitutes MaxOverRange(var, 0, leaf_snode.shape, body=Load(snode, [var])) for any nested-load index that does not reduce to a single bound-var or const). Across all variants the body's max value over the indexed range is N_X and the gradient assertion is uniform.

`test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation`

Pins the cache invalidation contract for the bound-var FieldLoad body path: the encoder pushes a FieldLoadObs keyed on the snode's write generation, mutating field_a[M-1] from Python bumps snode_write_gen, and the next launch redispatches.

`test_above_cap_out_of_grammar_kernel_raises`

A kernel whose inner range bound is an i64 ndarray read fails the recognizer's dtype restriction (the INT64_MIN cache-revalidation sentinel is unreachable for sub-i64 dtypes; for i64 a mutated cell could legally hold the sentinel and false-hit on revalidation). The whole spec is dropped and the per-task sizer walks the outer MaxOverRange itself. With a.shape[0] > 1<<24 the cap fires on every adstack-sizer eval path: RuntimeError from the host evaluator on CPU, QuadrantsAssertionError from the SPIR-V on-device sizer on Metal / Vulkan, and an indirect raise via stack_push overflow on CUDA / AMDGPU LLVM-GPU.

Side-effect audit

Concern	Where checked	Verdict
Offline cache key (per-task attribs)	`StaticAdStackMaxReducerSpec` round-trips via `QD_IO_DEF`; `max_reducer_specs` added to the `QD_IO_DEF` of the SPIR-V `AdStackSizingAttribs` and the LLVM `AdStackSizingInfo`	Auto-covered
`size_expr_cache_` pointer aliasing	`evaluate_adstack_size_expr_no_cache` for the substitution-active branch only	Direct fix
`spirv_bytecode_cache_` observations	`lookup_max_reducer_reads` accessor + encoder appends body reads to the cache entry's read list	Direct fix
`per_task_ad_stack_cache_` deps	`collect_size_expr_dep_keys` walks the original tree (pre-substitution) so body reads' arg-ids are still tracked	Auto-covered
Encoder enum translation	`encode_max_reducer_body_bytecode` maps `SizeExpr::Kind` -> `AdStackSizeExprDeviceKind` per kind explicitly	Direct fix (matches the per-task-sizer encoder's existing pattern)
Inter-spec dependency ordering	Round-based dispatch picks specs whose `dependent_mor_node_idxs` are all resolved; substitutes earlier-round results into the working tree before host-evaluating begin / end and encoding the body	Direct fix (replaces a prior single-pass dispatch that walked through unresolved nested MORs)
Apple Silicon Metal PSB residency	`track_physical_buffer` called once per cmdlist on every ndarray data buffer and every `root_buffers_` SNode tree root buffer (covers both `kExternalTensorRead` and `kFieldLoad` body leaves)	Direct fix
Hard cap-gate at `publish_adstack_metadata_spirv`	`QD_ERROR_IF(!spirv_has_physical_storage_buffer ...)` + `QD_ERROR_IF(!spirv_has_int64 ...)` at the entry; drops redundant per-helper cap gates	New gate, no backend regression - Vulkan 1.3 promotes both caps into core, Metal Tier 2 advertises both
`kElementsPerThread` shader strided iteration	`num_workgroups_x = ceil(length / (kAdStackMaxReducerWorkgroupSize * kElementsPerThread))`, capped at 65535 in the launcher	Covers spec lengths up to ~536M elements per dispatch
Per-launch dispatch cost in steady state	Four-layer cache cascade short-circuits when neither `ndarray_data_gen_` nor `snode_write_gen` advances	Zero per-launch overhead in cache-warm steady state
Out-of-grammar shapes	Recognizer skips silently; per-task sizer falls through; cap-hit tripwires raise hard errors	No silent gradient corruption on any backend with explicit tripwire support

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f95788e1a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-06T15:36:27Z

Total: 23 file(s) changed, +1731 -29 code lines.

github-actions · 2026-05-06T16:48:36Z

Diff coverage: 99% · 123 lines, 1 missing

duburcqa · 2026-05-06T17:15:07Z

@claude review

github-actions · 2026-05-06T17:51:18Z

Total: 23 file(s) changed, +1993 -28 code lines.

github-actions · 2026-05-06T18:41:11Z

Diff coverage: 95% · 123 lines, 6 missing

github-actions · 2026-05-06T20:22:13Z

Total: 36 file(s) changed, +4083 -1861 code lines.

github-actions · 2026-05-06T20:57:55Z

Diff coverage: 95% · 126 lines, 6 missing

github-actions · 2026-05-07T07:32:22Z

Total: 36 file(s) changed, +4190 -1861 code lines.

github-actions · 2026-05-07T07:45:20Z

Diff coverage: 95% · 124 lines, 6 missing

github-actions · 2026-05-07T08:18:36Z

Total: 42 file(s) changed, +5441 -2768 code lines.

github-actions · 2026-05-07T09:11:47Z

Diff coverage: 97% · 204 lines, 6 missing

hughperkins · 2026-05-07T12:36:21Z

This file is getting big. Thoughts?

hughperkins · 2026-05-07T12:39:09Z

  - *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
 - **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
 - **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
+- **Inner reverse-mode loop with a complex bound at very large extent.** An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that. Workaround: rewrite the trip count to stay within the supported subset, or shrink the enclosing loop below the threshold.


can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?

Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.

can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?

for j in range(arr[i // 2]) with arr[0] > (1 << 24)`. Nothing more.

Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.

This bullet should be fairly simple. Maybe I should removed details that are just confusing?

I feel this bullets are all pretty long tbh. I'm not sure if they graudally 'boil frogged' grew over time?

My hunch is taht it might be better reformatting more in the style of an 'FAQ', with a subsection heading for each current bullet point.

Understood. I can do this.

duburcqa · 2026-05-07T12:46:48Z

This file is getting big. Thoughts? quadrants/runtime/llvm/runtime_module/runtime.cpp

I'm not in favour of refactoring this file in this PR. It is a central part of Quadrant's kernel launch orchestration. Could be nice to refactor it though.

hughperkins · 2026-05-07T12:48:36Z

This file is getting big. Thoughts? quadrants/runtime/llvm/runtime_module/runtime.cpp

I'm not in favour of refactoring this file in this PR. It is a central part of Quadrant's kernel launch orchestration. Could be nice to refactor it though.

It is a central part of Quadrant's kernel launch orchestration, yes.

The ask is not to refactor the file en masse, but to find a way to move autodiff-specific things outside of it (at least, the new autodiff-related things you are adding in this pr). Please.

duburcqa · 2026-05-07T12:50:50Z

The ask is not to refactor the file en masse, but to find a way to move autodiff-specific things outside of it (at least, the new autodiff-related things you are adding in this pr). Please.

Ok I will refactor this file.

github-actions · 2026-05-07T15:32:09Z

Total: 46 file(s) changed, +5986 -3261 code lines.

github-actions · 2026-05-07T19:19:21Z

Total: 49 file(s) changed, +6008 -3265 code lines.

hughperkins · 2026-05-07T19:38:25Z

+
+### Inner reverse-mode loop with a complex bound at very large extent
+
+An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.


Can you give an example of what this means? When I read this, things my brain stumbles on:

"enclosing range"

I assume this means some kind of for loop over range, but I ahve to think about it

'inner trip count'

inner, I suppose means an inner loop

count, counts something, but not iterations, but ... trips

not sure waht a 'trip' is

'shapes' again is not a term I'm familiar with in this context

'enclosing iterations'

does this mean the iterations ofr the 'enclosing range'

the iteraitons of the 'inner' loop?

something else?

I think it would be nice to have an example, that illustrates what this is talking about clealry.

hughperkins · 2026-05-07T19:40:22Z

+
+An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.
+
+Two categories of bound expression:


You havent used 'bound exprssion' in this subsub section yet. Not sure what it refers to. Again, would be nice if the example showed this I feel.

(I kind of feel maybe this deserves its own section, outside of 'what can go wrong' potentially. Or... if you've explained all the above concepts before, then maybe refer me back to a consise definition of each concept earlier in the readme, perhaps?)

hughperkins · 2026-05-07T19:40:43Z

+Two categories of bound expression:
+
+- *Works at any enclosing-range size:* integer ndarray reads up to 32 bits wide (single- or multi-axis, indexed by literal constants or enclosing loop variables), field reads of the same width indexed by literal constants or enclosing loop variables (`my_field[None]`, `my_field[k]` for a constant `k`, `my_field[i]` where `i` is an enclosing loop variable), `arr.shape[k]` shape terms, literal integer constants, and `+`, `-`, `*`, `max` of those.
+- *Caps at the threshold:* 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`, `arr[i % 4]`), and ragged inner ranges whose own bound depends on an enclosing loop variable through an unsupported leaf shape.


which threshold? Again, no mention of threshold in this sub sub section.

also, what is 'leaf shape'? (again, feel free to link back to where it's concisely explained potentially)

github-actions · 2026-05-08T10:12:47Z

Total: 50 file(s) changed, +6244 -3265 code lines.

github-actions · 2026-05-08T10:56:35Z

Total: 50 file(s) changed, +6251 -3265 code lines.

github-actions · 2026-05-08T12:06:21Z

Total: 50 file(s) changed, +6251 -3265 code lines.

hughperkins · 2026-05-08T12:07:10Z

checklist:
[x] files look reasonably sized
[x] sampled several .cpp files, and there existed matching .h files (I really need to improve the file check job to show these for me)
[x] genesis benchmarks ~neutral
[x] genesis unit tests passing
[ ] (need to check docs)

hughperkins · 2026-05-08T12:08:53Z

+        ...
+```
+
+The enclosed loop's iteration count `arr[i // 2]` is what we call the enclosed loop's *bound expression*. Reverse-mode autodiff needs an upper bound on how many times the enclosed loop body executes across the whole kernel. To do so, the compiler analyses the bound expression at launch time by taking one of the two evaluation paths based on its structure:


Super nice, up to and including this paragraph. Thank you 🙌 Very clear to me :)

hughperkins · 2026-05-08T12:11:37Z

+
+The enclosed loop's iteration count `arr[i // 2]` is what we call the enclosed loop's *bound expression*. Reverse-mode autodiff needs an upper bound on how many times the enclosed loop body executes across the whole kernel. To do so, the compiler analyses the bound expression at launch time by taking one of the two evaluation paths based on its structure:
+
+- **Parallel:** integer ndarray reads up to 32 bits wide, single- or multi-axis, indexed by literal constants or outer loop variables are evaluated in parallel. Field reads of the same width and the same indexing rules apply: `my_field[None]`, `my_field[k]` for a constant `k`, or `my_field[i]` where `i` is an outer loop variable. The shape term `arr.shape[k]`. Literal integer constants. And any `+`, `-`, `*`, `max` of those. The outer loop can run any number of iterations.


I got lost after 'parallel'. I know what the word 'parallel' means. But I dont know what it means in this context.

does it mean there are multiple enclosed loops running in parallel?

does it mean the outer loop is running in parallel?

does it mean that the enclosed loop runs in parallel? (probably not, but, it's not certain to me, and anyway, even if I can probably reject this option, I still had to think about it in order to probably rejct it.

Please could you, after 'parallel', describe what is meant by 'parallel', in this context

Ditto for 'Sequential.

I would suggest having the bullet points only defining what is 'parallel' and 'sequential' in this context, then underneath the bullet points have the paragraph for each approach, stating the various things you are stating.

github-actions · 2026-05-08T12:47:57Z

Total: 50 file(s) changed, +6244 -3265 code lines.

github-actions · 2026-05-08T13:22:55Z

Total: 50 file(s) changed, +6244 -3265 code lines.

hughperkins · 2026-05-08T13:34:30Z

+
+Reverse-mode autodiff needs the worst-case inner-loop trip count to size the adstack, which is allocated per outer-loop iteration. In this example each outer iteration must accommodate up to 5 inner pushes - the maximum the bound expression takes across all `i`. Quadrants computes that maximum at launch time and uses it to size the adstack. With deeper loop nests each enclosed loop's bound expression is reduced separately and the adstack is sized as the product of those maxes.
+
+The compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure:


"The compiler creates a separate kernel which will be run at runtime to compute the maximum. This kernel is called the 'bounds kernel'. There are two bounds kernels available. The parallel bounds kernel is faster, but only supports some loop types. For other loop types, a serial bounds kernel is used, which is slower, but more general.

"The following table shows the loop types compatible with the fast parallel bounds kernel:

loop type loop type descriptoin

integer fields

ndarray reads up to 32 bits

single or multi-axis ndarrays

(ok this table seems odd. Are these constraints overlapping, like a Venn diagram, and 'AND'ed together? They don't read like an orthogonal set of possible loop types to me? I think this needs a bit of clarification too.)

hughperkins · 2026-05-08T14:29:01Z

+
+### Nested loops
+
+Quadrants supports arbitrarily nested loops. When the bound expression itself contains another enclosed loop whose own bound expression must be reduced first, the enclosing bound expression takes the parallel path only if every nested bound expression also fits the parallel-path grammar; otherwise it falls back to the sequential walk. This keeps the runtime from mixing parallel and sequential evaluators inside a single bound expression, which would otherwise force per-iteration kernel launches.


This is good I feel 🙌

hughperkins · 2026-05-08T14:29:19Z

+
+In the example above, the iteration count of the enclosed loop takes the sequential path because of the `i // 2` index. As such, it would raise at launch if `arr.shape[0] > (1 << 24)`.
+
+### Workaround


workaround for what?

hughperkins · 2026-05-08T14:31:26Z

+
+### Inner reverse-mode loop with a complex bound at very large extent
+
+A reverse-mode kernel with two nested loops is in some cases limited to an outer-loop extent of at most `1 << 24`. In particular when the enclosed loop's trip count is an uncommon expression of the outer-loop variable, e.g. `for i in range(arr.shape[0]): ... for j in range(arr[i // 2]):`. See [Appendix C](#appendix-c-evaluation-of-the-enclosed-loops-bound-expression) for a complete walkthrough of the enclosed loop's bound expression and workarounds. When the limit applies and the outer extent exceeds it, the kernel raises `RuntimeError: ... iteration count ... exceeds the 16777216 guard` at launch.


I cant help suspecting this is very likely the original bullet point, just with a reference to Appendix C added :) However, having the link to appendix C does reduce the burden on being easliy undersatndable I feel. So, ok :)

hughperkins · 2026-05-08T14:46:35Z

UPdated checklist:
[x] Docs look good to me

=> ok to merge

…to answer Hugh's review: explain what 'parallel' / 'sequential' mean

github-actions · 2026-05-08T15:28:48Z

Total: 50 file(s) changed, +6244 -3265 code lines.

github-actions · 2026-05-08T16:24:30Z

Diff coverage: 97% · 204 lines, 6 missing

* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

Comment thread quadrants/program/adstack_size_expr_eval.cpp Outdated

Comment thread quadrants/python/export_lang.cpp Outdated

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 23a7daf to eaf4ba9 Compare May 6, 2026 17:17

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from eaf4ba9 to 2ef85c1 Compare May 6, 2026 18:40

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch 2 times, most recently from 75800cc to e80f9dd Compare May 6, 2026 19:37

duburcqa changed the title ~~[AutoDiff] Adstack max-reducer: parallel MaxOverRange dispatch with 1<<24 cap-hit tripwires~~ [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch + bound-var FieldLoad with cap-hit tripwires May 7, 2026

duburcqa mentioned this pull request May 7, 2026

DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31) #589

Closed

hughperkins reviewed May 7, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch 2 times, most recently from 95cdf06 to f3deef7 Compare May 7, 2026 18:24

hughperkins reviewed May 7, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 64ec62a to f19244c Compare May 8, 2026 09:40

hughperkins reviewed May 8, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from bf63707 to 6a03cdd Compare May 8, 2026 12:09

hughperkins reviewed May 8, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 6a03cdd to f19244c Compare May 8, 2026 12:12

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from b5af10c to acf1104 Compare May 8, 2026 13:27

hughperkins reviewed May 8, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch 2 times, most recently from 3a69d60 to 5f99e0f Compare May 8, 2026 14:26

hughperkins reviewed May 8, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 36d4173 to 3eaef5e Compare May 8, 2026 14:45

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 3eaef5e to 5f99e0f Compare May 8, 2026 14:49

[Docs] Reword 'Inner reverse-mode loop with a complex bound' section …

9ca862f

…to answer Hugh's review: explain what 'parallel' / 'sequential' mean

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 5f99e0f to 9ca862f Compare May 8, 2026 14:54

duburcqa merged commit 14bd3a9 into main May 8, 2026
79 of 80 checks passed

duburcqa deleted the duburcqa/adstack_max_reducer_shader branch May 8, 2026 18:03

duburcqa mentioned this pull request May 9, 2026

[Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id #671

Merged

hughperkins mentioned this pull request May 19, 2026

[Refactor] Extract per-task adstack codegen state and helpers from TaskCodeGenLLVM into TaskCodeGenLLVMAdStack #612

Closed


		### Inner reverse-mode loop with a complex bound at very large extent

		An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.


		An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.

		Two categories of bound expression:


		The enclosed loop's iteration count `arr[i // 2]` is what we call the enclosed loop's bound expression. Reverse-mode autodiff needs an upper bound on how many times the enclosed loop body executes across the whole kernel. To do so, the compiler analyses the bound expression at launch time by taking one of the two evaluation paths based on its structure:

		- Parallel: integer ndarray reads up to 32 bits wide, single- or multi-axis, indexed by literal constants or outer loop variables are evaluated in parallel. Field reads of the same width and the same indexing rules apply: `my_field[None]`, `my_field[k]` for a constant `k`, or `my_field[i]` where `i` is an outer loop variable. The shape term `arr.shape[k]`. Literal integer constants. And any `+`, `-`, `*`, `max` of those. The outer loop can run any number of iterations.


		Reverse-mode autodiff needs the worst-case inner-loop trip count to size the adstack, which is allocated per outer-loop iteration. In this example each outer iteration must accommodate up to 5 inner pushes - the maximum the bound expression takes across all `i`. Quadrants computes that maximum at launch time and uses it to size the adstack. With deeper loop nests each enclosed loop's bound expression is reduced separately and the adstack is sized as the product of those maxes.

		The compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure:

loop type	loop type descriptoin
integer fields
ndarray reads up to 32 bits
single or multi-axis ndarrays


		### Nested loops

		Quadrants supports arbitrarily nested loops. When the bound expression itself contains another enclosed loop whose own bound expression must be reduced first, the enclosing bound expression takes the parallel path only if every nested bound expression also fits the parallel-path grammar; otherwise it falls back to the sequential walk. This keeps the runtime from mixing parallel and sequential evaluators inside a single bound expression, which would otherwise force per-iteration kernel launches.


		In the example above, the iteration count of the enclosed loop takes the sequential path because of the `i // 2` index. As such, it would raise at launch if `arr.shape[0] > (1 << 24)`.

		### Workaround


		### Inner reverse-mode loop with a complex bound at very large extent

		A reverse-mode kernel with two nested loops is in some cases limited to an outer-loop extent of at most `1 << 24`. In particular when the enclosed loop's trip count is an uncommon expression of the outer-loop variable, e.g. `for i in range(arr.shape[0]): ... for j in range(arr[i // 2]):`. See [Appendix C](#appendix-c-evaluation-of-the-enclosed-loops-bound-expression) for a complete walkthrough of the enclosed loop's bound expression and workarounds. When the limit applies and the outer extent exceeds it, the kernel raises `RuntimeError: ... iteration count ... exceeds the 16777216 guard` at launch.

Uh oh!

Conversation

duburcqa commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adstack max-reducer: parallel MaxOverRange dispatch with 1<<24 cap-hit tripwires

TL;DR

Why

Surface API

Mechanism end-to-end

1. Pre-pass shape recognition

2. Generic max-reducer kernels - one per backend family

3. Launch sequencing

4. Substitution into per-stack trees

5. Cap-hit tripwires (1<<24)

6. Cache invalidation

Per-backend coverage matrix

Tests - tests/python/test_adstack.py

test_max_reducer_pins_stride_for_oversized_axis

test_max_reducer_dispatch_counts_advance_on_input_mutation

test_max_reducer_grammar_fallback

test_max_reducer_field_load_bound_var_dispatch

test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation

test_above_cap_out_of_grammar_kernel_raises

Side-effect audit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

duburcqa commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

hughperkins commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa commented May 7, 2026

Uh oh!

hughperkins commented May 7, 2026

Uh oh!

duburcqa commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

duburcqa commented May 6, 2026 •

edited

Loading

Adstack max-reducer: parallel `MaxOverRange` dispatch with `1<<24` cap-hit tripwires

5. Cap-hit tripwires (`1<<24`)

Tests - `tests/python/test_adstack.py`

`test_max_reducer_pins_stride_for_oversized_axis`

`test_max_reducer_dispatch_counts_advance_on_input_mutation`

`test_max_reducer_grammar_fallback`

`test_max_reducer_field_load_bound_var_dispatch`

`test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation`

`test_above_cap_out_of_grammar_kernel_raises`

hughperkins commented May 7, 2026 •

edited

Loading

hughperkins May 7, 2026 •

edited

Loading