Skip to content

[AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch#635

Merged
duburcqa merged 32 commits into
mainfrom
duburcqa/adstack_max_reducer_shader
May 8, 2026
Merged

[AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch#635
duburcqa merged 32 commits into
mainfrom
duburcqa/adstack_max_reducer_shader

Conversation

@duburcqa

@duburcqa duburcqa commented May 6, 2026

Copy link
Copy Markdown
Contributor

Adstack max-reducer: parallel MaxOverRange dispatch with 1<<24 cap-hit tripwires

Fifteen commits, no behaviour change for users whose reverse-mode kernels never had a MaxOverRange axis above the existing 1<<24 adstack-sizer cap. Adds a per-tree parallel max-reducer that pre-evaluates recognized MaxOverRange shapes at launch and substitutes the result as a Const before any of the four adstack-sizer eval paths walks the tree. Promotes the silent truncation at the cap to a hard error on every backend whose sizer can detect it.

TL;DR

A reverse-mode kernel like

@qd.kernel
def compute(a: qd.types.ndarray(dtype=qd.i32, ndim=1)):
    for i in range(a.shape[0]):
        v = x[i]
        for _ in range(a[i]):
            v = v * 0.95 + 0.01
        y[None] += v

lowers to a per-stack SizeExpr containing MaxOverRange(0, a.shape[0], a[var]). Before this PR the adstack sizer enumerated that range linearly on every launch, with a hard 1<<24 cap above which the host evaluator raised RuntimeError, the LLVM device sizer silently truncated, and the SPIR-V on-device sizer silently clamped. Above-cap axes therefore either failed loud-but-confusing on CPU or produced wrong heap strides and corrupted gradients on GPU.

After this PR a recognize_adstack_max_reducer_specs pre-pass captures shapes that fit a deliberately narrow grammar (chains of nested MaxOverRanges across distinct bound variables; integer ndarray and field reads up to 32 bits wide indexed by literal constants or any captured chain bound variable; integer arithmetic combinators), the launcher dispatches a generic parallel-max compute kernel per captured spec at launch time, and substitute_precomputed_max_over_range rewrites the captured MaxOverRange to a Const carrying the dispatched value before any sizer eval path walks the tree. Out-of-grammar shapes whose iteration count exceeds the cap now raise via three explicit tripwires (host evaluator QD_ERROR_IF; SPIR-V on-device sizer metadata-trailing overflow-flag slot; LLVM device sizer cap-hit short-circuit + indirect stack_push overflow) instead of silently undersizing the heap.

Why

compute_bounded_adstack_size in quadrants/transforms/determine_ad_stack_size.cpp emits MaxOverRange(begin, end, body) nodes whose iteration count is bounded only by the underlying ndarray axis. Three eval paths consume the resulting trees per launch:

  • Host evaluator (adstack/eval.cpp::evaluate_node): hard QD_ERROR_IF at end - begin > 1<<24, on by default through evaluate_adstack_size_expr on the CPU host fast path.
  • LLVM device sizer interpreter (runtime_eval_adstack_size_expr in quadrants/runtime/llvm/runtime_module/runtime.cpp): break at the same threshold (silent truncation on CUDA / AMDGPU LLVM-GPU).
  • SPIR-V on-device sizer (adstack_sizer_shader.cpp): silent clamp effective_end = min(end, begin + (1<<24)) on Metal / Vulkan.

When the gating ndarray axis exceeds 1<<24 cells, every device path returned an under-bound on per-thread stack depth. The heap then either overflowed at qd.sync() with an opaque message naming the wrong kernel, or silently corrupted gradients with no error at all. The host path's hard error was the loud version, opt-in via QD_DEBUG_ADSTACK=1, and used as a tripwire today; it does not cover the GPU paths.

The fix preserves the cap as an internal safety latch (the per-thread sizer's serial walk is still bounded) but moves the actual evaluation of recognized shapes onto a parallel-dispatch path that scales past the cap, and turns cap-hits on the remaining out-of-grammar shapes into hard errors instead of silent truncation.

Surface API

None. The change is purely internal to the adstack-sizer pipeline. Users who never tripped the cap see no behaviour change; users whose recognized kernels did trip the cap stop seeing wrong gradients; users whose out-of-grammar kernels would have tripped the cap now see a RuntimeError / QuadrantsAssertionError at the next qd.sync() instead of silent truncation.

Mechanism end-to-end

1. Pre-pass shape recognition

quadrants/program/adstack/max_reducer.{h,cpp}::recognize_adstack_max_reducer_specs(size_exprs) walks each per-stack SerializedSizeExpr post-order and returns a std::vector<StaticAdStackMaxReducerSpec> describing every MaxOverRange node whose:

  • begin and end subtrees are closed-form (Const / ExternalTensorShape / Add / Sub / Mul / Max, plus any MaxOverRange already captured deeper in the same tree),
  • body subtree references only Const, ExternalTensorRead(arg, [...]) (single- or multi-axis, indexed by literal constants or any captured chain bound variable, leaf dtype restricted to 32-bit-or-narrower integer), FieldLoad(snode, [...]) (same index restriction; the literal-only path host-folds to Const at encode time, the bound-var path emits a kFieldLoad device node), ExternalTensorShape, and Add / Sub / Mul / Max of those.

Multi-axis support: the recognizer descends through nested MaxOverRanges as long as each inner [begin, end) is closed-form (Const / ExternalTensorShape / captured-deeper-MORs); each layer adds one axis to the captured spec, and the dispatch enumerates the cross-product of every axis. Specs come back in dependency order (deepest first); each dispatch's result becomes the substituted Const an outer spec's begin / end may reference. Captured ids are stored in task_attribs.ad_stack.max_reducer_specs (SPIR-V) and current_task->ad_stack.max_reducer_specs (LLVM); both backends populate the field at codegen time (spirv_codegen.cpp, codegen_llvm.cpp).

The integer-leaf dtype restriction (i8 / i16 / i32 / u8 / u16 / u32 only) gates the cache-revalidation sentinel: populate_max_reducer_body_observations records INT64_MIN as the observed value, and the replay path's gen-mismatch dereference must return a value strictly greater than the sentinel to force invalidation. A 64-bit leaf could legally hold INT64_MIN and false-hit on a mutated entry, so those leaves fall through to the per-task sizer's capped path.

StaticAdStackMaxReducerSpec lives in quadrants/transforms/static_adstack_analysis.h with a QD_IO_DEF so the spec round-trips through the offline cache. The struct carries axis_var_ids / axis_begin_node_idxs / axis_end_node_idxs (one entry per captured axis, outermost-first) plus dependent_mor_node_idxs listing the captured deeper-MOR keys the spec's begin / end references.

2. Generic max-reducer kernels - one per backend family

Backend File Mechanism
SPIR-V quadrants/codegen/spirv/adstack_max_reducer_shader.{h,cpp} Compute shader, kAdStackMaxReducerWorkgroupSize=128, strided kElementsPerThread=64 per-thread iteration to keep num_workgroups_x under maxComputeWorkGroupCount[0]=65535 for spec lengths up to ~536M. Body bytecode interpreter (kConst / kBoundVariable / kExternalTensorRead / kFieldLoad / kAdd / kSub / kMul / kMax). Per-spec output is two u32 slots: [2*k] = OpAtomicUMax running max, [2*k+1] = OpAtomicOr overflow flag. The u32+overflow split sidesteps spirv-cross's MSL backend gap on i64 atomics (MSL currently does not support 64-bit atomics), unlocking Metal and Vulkan-via-MoltenVK.
LLVM quadrants/runtime/llvm/runtime_module/runtime.cpp::runtime_eval_adstack_max_reduce Single-thread serial walk over the body bytecode, cross-product of params.per_axis_length[] iterations, atomic-max into runtime->adstack_max_reducer_outputs[output_slot]. Dispatched as a host call on CPU and as a 1x1x1 JIT-launched kernel on CUDA / AMDGPU. POD device params live in quadrants/ir/static_adstack_max_reducer_device.h.

The body bytecode reuses the existing AdStackSizeExprDeviceNode POD format from quadrants/ir/adstack_size_expr_device.h. encode_max_reducer_body_bytecode in quadrants/program/adstack/max_reducer.cpp extracts the body subtree, renumbers nodes to dense [0, body_node_count) indices, copies referenced index entries, and resolves kExternalTensorRead arg_buffer_offset via a closure passed by the per-backend launcher. Bound-var-indexed kFieldLoad leaves take a backend-specific base resolution: SPIR-V passes a FieldLoadDeviceEmitter whose fetch returns root_psb + place_byte_offset_in_root (pre-baked PSB address), LLVM passes a null emitter and the encoder stores (snode_root_id, place_byte_offset) in the device-node POD's arg_buffer_offset / const_value slots which the LLVM device interpreter resolves at runtime via runtime->roots[snode_root_id] + place_byte_offset.

3. Launch sequencing

Backend File Helper
SPIR-V quadrants/runtime/gfx/adstack_max_reducer_launch.cpp GfxRuntime::dispatch_max_reducers(...)
LLVM quadrants/runtime/llvm/llvm_adstack_lazy_claim.cpp LlvmRuntimeExecutor::dispatch_max_reducers_for_tasks(...) (overload taking std::vector<OffloadedTask>; per-arch launchers in runtime/cpu/, runtime/cuda/, runtime/amdgpu/ call into it as a one-liner)

Both helpers share a level-based round dispatch:

  1. Pass 1 - cache lookup keyed by (registry_id, stack_id, mor_node_idx) packed into a single uint64_t via pack_max_reducer_key in adstack/max_reducer.cpp. Hits drop straight into the result map; misses go to the pending list with back-references to the source SerializedSizeExpr and StaticAdStackMaxReducerSpec.
  2. Per-round prepare + dispatch. Each round picks every undispatched spec whose dependent_mor_node_idxs are all already in the result map (cache hits + earlier rounds), substitutes those values into the working tree via substitute_precomputed_max_over_range, host-evaluates begin / end against the substituted tree, encodes the body bytecode, and dispatches the round as one cmdlist (gfx) / one batched runtime-function call sequence (LLVM). Most kernels finish in one round; nested patterns (e.g. an outer MaxOverRange whose end contains a captured inner max-of-array) take one round per dependency depth. A no-progress round drops every remaining pending spec and falls back to the per-task sizer's cap-hit path.
  3. Per-round readback. Read u32 output slots (gfx) or i64 output slots (LLVM) at round-local indices, fall back to host-eval on overflow specs (SPIR-V; the host walks the substituted tree so already-resolved deps are folded in), record into AdStackCache::record_max_reducer_eval so the next launch can short-circuit. The recorded read observations come from populate_max_reducer_body_observations which snapshots observed_devalloc + observed_gen (ndarray) and snode_write_gen (field) so a host-side mutation of either source invalidates the cache cleanly.

The dispatch must precede publish_adstack_metadata_spirv (gfx) / publish_adstack_metadata (LLVM) so the substituted Consts are in place before the sizer eval pipeline runs.

On Apple Silicon Metal the body interpreter loads ndarray data buffers and SNode tree root buffers via PSB (raw bufferDeviceAddress), bypassing the descriptor-bound resource tracking, so the gfx launcher calls track_physical_buffer(...) once per cmdlist for every ndarray_alloc and every root_buffer_ (the useResource: hint Metal needs to mark those buffers resident for the dispatch).

4. Substitution into per-stack trees

quadrants/program/adstack/max_reducer.cpp::substitute_precomputed_max_over_range(expr, registry_id, stack_id, results) walks expr.nodes and replaces every captured MaxOverRange whose key is in results with a Const(dispatched_value). Empty-input fast path: when no captured spec matches, returns expr unchanged with no allocation.

Three eval paths consume the substituted tree:

  • Host fast path (eval_per_task_metadata_on_host in runtime/gfx/adstack_sizer_launch.cpp; LLVM host-eval branch in llvm_adstack_lazy_claim.cpp). The host evaluator's pointer-keyed size_expr_cache_ cannot accept a stack-local substituted tree (a transient stack address would alias unrelated cache entries across launches and return wrong cached values), so the substitution-active branch routes through a dedicated evaluate_adstack_size_expr_no_cache(...) variant; the empty-results fast path keeps the live a.size_expr reference and the cache stays warm for kernels that never trigger the recognizer.
  • SPIR-V on-device sizer encoder (encode_adstack_size_expr_device_bytecode_for_spirv). The encoder walks the substituted tree where each captured MaxOverRange is already a Const, so the body's ExternalTensorRead / FieldLoad leaves are not in the encoder's reads list; AdStackCache::lookup_max_reducer_reads(...) returns the recorded body observations for each captured spec, and the encoder appends them to its reads list before recording into spirv_bytecode_cache_. A mutation to the gating ndarray / field then invalidates the cached bytecode via the same gen-counter replay path the existing per-task metadata cache uses.
  • LLVM device sizer encoder (encode_adstack_size_expr_device_bytecode). Same substitution; same downstream llvm_per_task_ad_stack_cache_ machinery.

5. Cap-hit tripwires (1<<24)

The 1<<24 per-task sizer cap is structurally unreachable for max-reducer-recognized shapes (those are dispatched in parallel and substituted to Const before the sizer walks). It is reachable only for out-of-grammar shapes whose iteration count exceeds the cap. Three explicit tripwires:

Site Mechanism
Host evaluator (evaluate_node) Existing hard QD_ERROR_IF; surfaces as RuntimeError to Python on the CPU host fast path.
SPIR-V on-device sizer (adstack_sizer_shader.cpp) Metadata buffer layout grew a trailing u32 overflow-flag slot at index 2 + 2*n_stacks. The shader writes 1 there on end - begin > cap, and clamps effective_end = begin so the walk stays bounded. The host post-readback in publish_adstack_metadata_spirv raises QD_ERROR_IF when the slot is non-zero.
LLVM device sizer (device_eval_node) Cap-hit short-circuit: kMaxOverRange returns 0 immediately on end - begin > cap to keep the single-thread on-device dispatch within the driver's TDR window. The cap-hit then surfaces indirectly through the existing stack_push overflow infrastructure on the subsequent main-kernel launch. The diagnostic message attribution depends on the kernel layout.

6. Cache invalidation

The per-spec result cache integrates into the existing AdStackCache four-layer cascade:

  1. try_max_reducer_cache_hit (one entry per captured (registry_id, stack_id, mor_node_idx)). Hit -> no max-reducer dispatch, the cached Const is substituted into the per-stack tree.
  2. try_size_expr_cache_hit (per-SerializedSizeExpr after substitution). Hit -> no per-thread sizer eval call.
  3. try_per_task_ad_stack_cache_hit / try_llvm_per_task_ad_stack_cache_hit (per-task metadata blob). Hit -> no per-task sizer dispatch.
  4. try_spirv_bytecode_cache_hit (per-task bytecode blob). Hit -> no SPIR-V bytecode encode + upload.

In steady state with an unchanged gating source every layer hits and the per-launch overhead of the option-D pipeline collapses to zero. A host-side Ndarray.write bumps ndarray_data_gen_; a host-side field write bumps snode_write_gen. Either bump propagates through every layer's gen-counter replay walk and forces a fresh dispatch.

FieldLoadObs records produced by the bound-var FieldLoad encoder path carry indices = {} since the body is evaluated at every cross-product iteration and there is no canonical scalar to re-read; replay_one_observation's FieldLoadObs arm treats the gen counter as the sole staleness signal in that mode and unconditionally invalidates on a gen mismatch.

Per-backend coverage matrix

Backend Recognized MaxOverRange dispatch Cap-hit tripwire (out-of-grammar MaxOverRange)
CPU (LLVM host eval) Host call to runtime_eval_adstack_max_reduce evaluate_node QD_ERROR_IF ✓ (raised as RuntimeError)
CUDA LLVM-GPU 1x1x1 kernel ✓ device_eval_node short-circuit + indirect stack_push overflow
AMDGPU LLVM-GPU 1x1x1 kernel ✓ same as CUDA
Vulkan (native + MoltenVK) SPIR-V compute shader (u32 atomicMax + atomicOr overflow) ✓ sizer metadata-trailing slot ✓ (raised as QuadrantsAssertionError)
Metal same as Vulkan ✓ same as Vulkan ✓

Tests - tests/python/test_adstack.py

Six new regression tests, all parametrized over every available backend.

test_max_reducer_pins_stride_for_oversized_axis

Parametrized over (shape, body_kind) matrix that exercises the recognizer's accepted body grammar (single-axis ETR, ETR + ExternalTensorShape host-fold, closed FieldLoad host-fold, and the Add / Sub / Mul / Max arithmetic combinator). For each shape the dispatch + substitution produces the correct heap stride and the kernel runs to completion; the above-cap variants additionally pin the contract that a recognized spec ranges over an arbitrarily large axis. Uses qd.ndarray rather than numpy passthrough so the device buffer is not capped at backend-specific H2D-blit limits.

test_max_reducer_dispatch_counts_advance_on_input_mutation

Pins the dispatch + cache invalidation pipeline via a new Program.get_max_reducer_dispatch_count / reset_max_reducer_dispatch_count python binding (counter on AdStackCache, bumped at every record_max_reducer_eval). The first launch fires at least one dispatch; a host mutation of the gating ndarray bumps ndarray_data_gen and the next launch re-dispatches.

test_max_reducer_grammar_fallback

A reverse-mode kernel whose inner trip count is a compile-time constant produces no MaxOverRange and the recognizer captures nothing. The dispatch counter stays at zero; the kernel still produces the correct gradient. Pins the contract that any kernel outside the captured grammar runs unchanged so future grammar broadening cannot silently drop the fallback path.

test_max_reducer_field_load_bound_var_dispatch

Eight-variant parametrized test pinning the bound-var-indexed FieldLoad body grammar. Body shapes cover field[i] on its own, field[i] + arr[i] (mixed FieldLoad + ETR via Add), arr[i] + field[i] (commuted), max(field[i], arr[i]), max(field[i], const), max(field[i] + 0, field[i] * 1 - 0) (full arithmetic combinator), and the conservative-wrapper path field[field[i]] / arr[field[i]] (the trip-count builder substitutes MaxOverRange(var, 0, leaf_snode.shape, body=Load(snode, [var])) for any nested-load index that does not reduce to a single bound-var or const). Across all variants the body's max value over the indexed range is N_X and the gradient assertion is uniform.

test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation

Pins the cache invalidation contract for the bound-var FieldLoad body path: the encoder pushes a FieldLoadObs keyed on the snode's write generation, mutating field_a[M-1] from Python bumps snode_write_gen, and the next launch redispatches.

test_above_cap_out_of_grammar_kernel_raises

A kernel whose inner range bound is an i64 ndarray read fails the recognizer's dtype restriction (the INT64_MIN cache-revalidation sentinel is unreachable for sub-i64 dtypes; for i64 a mutated cell could legally hold the sentinel and false-hit on revalidation). The whole spec is dropped and the per-task sizer walks the outer MaxOverRange itself. With a.shape[0] > 1<<24 the cap fires on every adstack-sizer eval path: RuntimeError from the host evaluator on CPU, QuadrantsAssertionError from the SPIR-V on-device sizer on Metal / Vulkan, and an indirect raise via stack_push overflow on CUDA / AMDGPU LLVM-GPU.

Side-effect audit

Concern Where checked Verdict
Offline cache key (per-task attribs) StaticAdStackMaxReducerSpec round-trips via QD_IO_DEF; max_reducer_specs added to the QD_IO_DEF of the SPIR-V AdStackSizingAttribs and the LLVM AdStackSizingInfo Auto-covered
size_expr_cache_ pointer aliasing evaluate_adstack_size_expr_no_cache for the substitution-active branch only Direct fix
spirv_bytecode_cache_ observations lookup_max_reducer_reads accessor + encoder appends body reads to the cache entry's read list Direct fix
per_task_ad_stack_cache_ deps collect_size_expr_dep_keys walks the original tree (pre-substitution) so body reads' arg-ids are still tracked Auto-covered
Encoder enum translation encode_max_reducer_body_bytecode maps SizeExpr::Kind -> AdStackSizeExprDeviceKind per kind explicitly Direct fix (matches the per-task-sizer encoder's existing pattern)
Inter-spec dependency ordering Round-based dispatch picks specs whose dependent_mor_node_idxs are all resolved; substitutes earlier-round results into the working tree before host-evaluating begin / end and encoding the body Direct fix (replaces a prior single-pass dispatch that walked through unresolved nested MORs)
Apple Silicon Metal PSB residency track_physical_buffer called once per cmdlist on every ndarray data buffer and every root_buffers_ SNode tree root buffer (covers both kExternalTensorRead and kFieldLoad body leaves) Direct fix
Hard cap-gate at publish_adstack_metadata_spirv QD_ERROR_IF(!spirv_has_physical_storage_buffer ...) + QD_ERROR_IF(!spirv_has_int64 ...) at the entry; drops redundant per-helper cap gates New gate, no backend regression - Vulkan 1.3 promotes both caps into core, Metal Tier 2 advertises both
kElementsPerThread shader strided iteration num_workgroups_x = ceil(length / (kAdStackMaxReducerWorkgroupSize * kElementsPerThread)), capped at 65535 in the launcher Covers spec lengths up to ~536M elements per dispatch
Per-launch dispatch cost in steady state Four-layer cache cascade short-circuits when neither ndarray_data_gen_ nor snode_write_gen advances Zero per-launch overhead in cache-warm steady state
Out-of-grammar shapes Recognizer skips silently; per-task sizer falls through; cap-hit tripwires raise hard errors No silent gradient corruption on any backend with explicit tripwire support

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f95788e1a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/program/adstack_size_expr_eval.cpp Outdated
Comment thread quadrants/python/export_lang.cpp Outdated
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

@duburcqa

duburcqa commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

@claude review

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 23a7daf to eaf4ba9 Compare May 6, 2026 17:17
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from eaf4ba9 to 2ef85c1 Compare May 6, 2026 18:40
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch 2 times, most recently from 75800cc to e80f9dd Compare May 6, 2026 19:37
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

@duburcqa duburcqa changed the title [AutoDiff] Adstack max-reducer: parallel MaxOverRange dispatch with 1<<24 cap-hit tripwires [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch + bound-var FieldLoad with cap-hit tripwires May 7, 2026
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

@hughperkins

hughperkins commented May 7, 2026

Copy link
Copy Markdown
Collaborator

This file is getting big. Thoughts?

Screenshot 2026-05-07 at 08 35 45

Comment thread docs/source/user_guide/autodiff.md Outdated
- *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
- **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
- **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
- **Inner reverse-mode loop with a complex bound at very large extent.** An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that. Workaround: rewrite the trip count to stay within the supported subset, or shrink the enclosing loop below the threshold.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?

Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?

for j in range(arr[i // 2]) with arr[0] > (1 << 24)`. Nothing more.

Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.

This bullet should be fairly simple. Maybe I should removed details that are just confusing?

@hughperkins hughperkins May 7, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this bullets are all pretty long tbh. I'm not sure if they graudally 'boil frogged' grew over time?

My hunch is taht it might be better reformatting more in the style of an 'FAQ', with a subsection heading for each current bullet point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I can do this.

@duburcqa

duburcqa commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

This file is getting big. Thoughts? quadrants/runtime/llvm/runtime_module/runtime.cpp

I'm not in favour of refactoring this file in this PR. It is a central part of Quadrant's kernel launch orchestration. Could be nice to refactor it though.

@hughperkins

Copy link
Copy Markdown
Collaborator

This file is getting big. Thoughts? quadrants/runtime/llvm/runtime_module/runtime.cpp

I'm not in favour of refactoring this file in this PR. It is a central part of Quadrant's kernel launch orchestration. Could be nice to refactor it though.

It is a central part of Quadrant's kernel launch orchestration, yes.

The ask is not to refactor the file en masse, but to find a way to move autodiff-specific things outside of it (at least, the new autodiff-related things you are adding in this pr). Please.

@duburcqa

duburcqa commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

The ask is not to refactor the file en masse, but to find a way to move autodiff-specific things outside of it (at least, the new autodiff-related things you are adding in this pr). Please.

Ok I will refactor this file.

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch 2 times, most recently from 95cdf06 to f3deef7 Compare May 7, 2026 18:24
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

Comment thread docs/source/user_guide/autodiff.md Outdated

### Inner reverse-mode loop with a complex bound at very large extent

An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of what this means? When I read this, things my brain stumbles on:

  • "enclosing range"
    • I assume this means some kind of for loop over range, but I ahve to think about it
  • 'inner trip count'
    • inner, I suppose means an inner loop
    • count, counts something, but not iterations, but ... trips
    • not sure waht a 'trip' is
  • 'shapes' again is not a term I'm familiar with in this context
  • 'enclosing iterations'
    • does this mean the iterations ofr the 'enclosing range'
    • the iteraitons of the 'inner' loop?
    • something else?

I think it would be nice to have an example, that illustrates what this is talking about clealry.

Comment thread docs/source/user_guide/autodiff.md Outdated

An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.

Two categories of bound expression:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You havent used 'bound exprssion' in this subsub section yet. Not sure what it refers to. Again, would be nice if the example showed this I feel.

(I kind of feel maybe this deserves its own section, outside of 'what can go wrong' potentially. Or... if you've explained all the above concepts before, then maybe refer me back to a consise definition of each concept earlier in the readme, perhaps?)

Comment thread docs/source/user_guide/autodiff.md Outdated
Two categories of bound expression:

- *Works at any enclosing-range size:* integer ndarray reads up to 32 bits wide (single- or multi-axis, indexed by literal constants or enclosing loop variables), field reads of the same width indexed by literal constants or enclosing loop variables (`my_field[None]`, `my_field[k]` for a constant `k`, `my_field[i]` where `i` is an enclosing loop variable), `arr.shape[k]` shape terms, literal integer constants, and `+`, `-`, `*`, `max` of those.
- *Caps at the threshold:* 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`, `arr[i % 4]`), and ragged inner ranges whose own bound depends on an enclosing loop variable through an unsupported leaf shape.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which threshold? Again, no mention of threshold in this sub sub section.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what is 'leaf shape'? (again, feel free to link back to where it's concisely explained potentially)

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 64ec62a to f19244c Compare May 8, 2026 09:40
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@hughperkins

Copy link
Copy Markdown
Collaborator

checklist:
[x] files look reasonably sized
[x] sampled several .cpp files, and there existed matching .h files (I really need to improve the file check job to show these for me)
[x] genesis benchmarks ~neutral
[x] genesis unit tests passing
[ ] (need to check docs)

Comment thread docs/source/user_guide/autodiff.md Outdated
...
```

The enclosed loop's iteration count `arr[i // 2]` is what we call the enclosed loop's *bound expression*. Reverse-mode autodiff needs an upper bound on how many times the enclosed loop body executes across the whole kernel. To do so, the compiler analyses the bound expression at launch time by taking one of the two evaluation paths based on its structure:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice, up to and including this paragraph. Thank you 🙌 Very clear to me :)

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from bf63707 to 6a03cdd Compare May 8, 2026 12:09
Comment thread docs/source/user_guide/autodiff.md Outdated

The enclosed loop's iteration count `arr[i // 2]` is what we call the enclosed loop's *bound expression*. Reverse-mode autodiff needs an upper bound on how many times the enclosed loop body executes across the whole kernel. To do so, the compiler analyses the bound expression at launch time by taking one of the two evaluation paths based on its structure:

- **Parallel:** integer ndarray reads up to 32 bits wide, single- or multi-axis, indexed by literal constants or outer loop variables are evaluated in parallel. Field reads of the same width and the same indexing rules apply: `my_field[None]`, `my_field[k]` for a constant `k`, or `my_field[i]` where `i` is an outer loop variable. The shape term `arr.shape[k]`. Literal integer constants. And any `+`, `-`, `*`, `max` of those. The outer loop can run any number of iterations.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got lost after 'parallel'. I know what the word 'parallel' means. But I dont know what it means in this context.

  • does it mean there are multiple enclosed loops running in parallel?
  • does it mean the outer loop is running in parallel?
  • does it mean that the enclosed loop runs in parallel? (probably not, but, it's not certain to me, and anyway, even if I can probably reject this option, I still had to think about it in order to probably rejct it.

Please could you, after 'parallel', describe what is meant by 'parallel', in this context

Ditto for 'Sequential.

I would suggest having the bullet points only defining what is 'parallel' and 'sequential' in this context, then underneath the bullet points have the paragraph for each approach, stating the various things you are stating.

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 6a03cdd to f19244c Compare May 8, 2026 12:12
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from b5af10c to acf1104 Compare May 8, 2026 13:27
Comment thread docs/source/user_guide/autodiff.md Outdated

Reverse-mode autodiff needs the worst-case inner-loop trip count to size the adstack, which is allocated per outer-loop iteration. In this example each outer iteration must accommodate up to 5 inner pushes - the maximum the bound expression takes across all `i`. Quadrants computes that maximum at launch time and uses it to size the adstack. With deeper loop nests each enclosed loop's bound expression is reduced separately and the adstack is sized as the product of those maxes.

The compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The compiler creates a separate kernel which will be run at runtime to compute the maximum. This kernel is called the 'bounds kernel'. There are two bounds kernels available. The parallel bounds kernel is faster, but only supports some loop types. For other loop types, a serial bounds kernel is used, which is slower, but more general.

"The following table shows the loop types compatible with the fast parallel bounds kernel:

loop type loop type descriptoin
integer fields
ndarray reads up to 32 bits
single or multi-axis ndarrays

(ok this table seems odd. Are these constraints overlapping, like a Venn diagram, and 'AND'ed together? They don't read like an orthogonal set of possible loop types to me? I think this needs a bit of clarification too.)

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch 2 times, most recently from 3a69d60 to 5f99e0f Compare May 8, 2026 14:26

### Nested loops

Quadrants supports arbitrarily nested loops. When the bound expression itself contains another enclosed loop whose own bound expression must be reduced first, the enclosing bound expression takes the parallel path only if every nested bound expression also fits the parallel-path grammar; otherwise it falls back to the sequential walk. This keeps the runtime from mixing parallel and sequential evaluators inside a single bound expression, which would otherwise force per-iteration kernel launches.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good I feel 🙌

Comment thread docs/source/user_guide/autodiff.md Outdated

In the example above, the iteration count of the enclosed loop takes the sequential path because of the `i // 2` index. As such, it would raise at launch if `arr.shape[0] > (1 << 24)`.

### Workaround

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workaround for what?


### Inner reverse-mode loop with a complex bound at very large extent

A reverse-mode kernel with two nested loops is in some cases limited to an outer-loop extent of at most `1 << 24`. In particular when the enclosed loop's trip count is an uncommon expression of the outer-loop variable, e.g. `for i in range(arr.shape[0]): ... for j in range(arr[i // 2]):`. See [Appendix C](#appendix-c-evaluation-of-the-enclosed-loops-bound-expression) for a complete walkthrough of the enclosed loop's bound expression and workarounds. When the limit applies and the outer extent exceeds it, the kernel raises `RuntimeError: ... iteration count ... exceeds the 16777216 guard` at launch.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant help suspecting this is very likely the original bullet point, just with a reference to Appendix C added :) However, having the link to appendix C does reduce the burden on being easliy undersatndable I feel. So, ok :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not!

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 36d4173 to 3eaef5e Compare May 8, 2026 14:45
@hughperkins

Copy link
Copy Markdown
Collaborator

UPdated checklist:
[x] Docs look good to me

=> ok to merge

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 3eaef5e to 5f99e0f Compare May 8, 2026 14:49
…to answer Hugh's review: explain what 'parallel' / 'sequential' mean
@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 5f99e0f to 9ca862f Compare May 8, 2026 14:54
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@duburcqa duburcqa merged commit 14bd3a9 into main May 8, 2026
79 of 80 checks passed
@duburcqa duburcqa deleted the duburcqa/adstack_max_reducer_shader branch May 8, 2026 18:03
npoulad1 added a commit to ROCm/quadrants that referenced this pull request Jun 8, 2026
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428)

* [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

* [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430)

* Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420)

* [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435)

* [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438)

* Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

* Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442)

* [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439)

* [Misc] Add named top-level loops (Genesis-Embodied-AI#440)

* [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

* [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

* [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456)

* [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461)

* [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

* [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

* [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

* [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465)

* [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

* [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471)

* [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

* [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

* [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475)

* [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

* Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485)

* [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484)

* [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477)

* [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)

* Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488)

* Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489)

* [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487)

* [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492)

* [CI] Serialize api doc workflow (Genesis-Embodied-AI#494)

* [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506)

* [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509)

* [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504)

* [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505)

* [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507)

* [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508)

* [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482)

* [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483)

* [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512)

* [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510)

* [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511)

* [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422)

* [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500)

* [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501)

* [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502)

* [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503)

* [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496)

* [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491)

* [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534)

* [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535)

* [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495)

* [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490)

* [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536)

* [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541)

* [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419)

* [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411)

* [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552)

* [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441)

* [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412)

* [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555)

* [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554)

* [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537)

* [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493)

* [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539)

* [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513)

* [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551)

* [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557)

* [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562)

* [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559)

* [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558)

* [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563)

* [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426)

Authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543)

* Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564)

* [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470)

* [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567)

* Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573)

* [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574)

* [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571)

* [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575)

* [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576)

* [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577)

* [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570)

* [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566)

* [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579)

* [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584)

* [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580)

* [Type] Tensor 24 (Genesis-Embodied-AI#561)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587)

* [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578)

* [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588)

* [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590)

* [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592)

* [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591)

* [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596)

* [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450)

* Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585)

Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598)

Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>

* [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599)

* [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606)

* [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610)

* [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611)

* [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Doc] Update README (Genesis-Embodied-AI#617)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

* [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Add PR Line change report (Genesis-Embodied-AI#624)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

* [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630)

* [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

Co-authored-by: Johnny Nunez and Hugh Perkins

* [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632)

* [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

* [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633)

* [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634)

* [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638)

* [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639)

* [Perf] Streams 1-4 (Genesis-Embodied-AI#410)

* [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643)

* [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650)

* [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640)

* [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641)

* [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635)

* [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658)

* [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655)

* [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653)

* [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659)

* [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654)

* [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660)

* [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669)

* [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668)

* [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667)

* [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671)

* [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675)

* [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677)

* [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Cross gpu atomics (Genesis-Embodied-AI#666)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664)

* [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685)

* [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670)

* [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662)

* [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687)

* [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672)

* [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679)

* [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665)

* [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691)

* [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694)

* [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690)

* Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698)

* [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692)

* [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696)

* [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683)

* [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676)

* [GPU] New QIPC ops for block (Genesis-Embodied-AI#684)

* [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693)

* [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701)

* [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700)

* [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702)

* [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708)

* [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707)

* Fix duplicate HIP graph driver-function declarations after v1.0.0 merge

The amd-integration fork had cherry-picked the HIP graph driver functions
(graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate /
graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set.
The per-file 3-way merge appended both copies into
amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the
AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures
are identical to the fork's existing declarations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge

- kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel
  rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream
  PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design,
  leaving references to undefined `ephemeral_context_ptr`. Restore the fork's
  coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced
  launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel
  groups adapted onto the AMD launch path.
- llvm_context.h: both the fork and upstream added `num_instructions`; the merge
  kept upstream's private placement, but the AMDGPU codegen force-inline heuristic
  calls it statically from outside the class. Move it back to the public section.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore async result D2H and hoist kernarg vectors in AMDGPU launcher

The v1.0.0 merge resolution regressed two amd-integration baseline
optimizations in launch_llvm_kernel / launch_offloaded_tasks:

  - The per-launch result-buffer copy was a blocking memcpy_device_to_host,
    forcing a host stall on every value-returning launch and serializing the
    GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it
    needs the value); external-array transfers still stream_synchronize once
    before reading back.

  - launch_task constructed the kernarg std::vectors from initializer lists
    ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free
    per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget

Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup
ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through
`amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside
`llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco`
(i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted
these constructs, which is why it was unaffected.

1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend.
   Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target
   (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the
   native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK`
   is now the default and still honored. This is the actual crash fix.

2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so
   `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries
   x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies
   but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm
   during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the
   wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources

CI pre-commit's clang-format hook reformatted these files (long
declarations/lambda signatures collapsed onto single lines per the repo's
clang-format config). Apply the same formatting so the hook passes.

No functional changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input)

clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged
`builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to
the `llvm::Value*` LHS parameter as a null pointer, not an integer zero.
Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper
zero constant -- identical intended semantics, and clang-tidy clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>
Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants