Skip to content

[AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions#621

Merged
duburcqa merged 6 commits into
mainfrom
duburcqa/adstack_load_store_eliminations
May 5, 2026
Merged

[AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions#621
duburcqa merged 6 commits into
mainfrom
duburcqa/adstack_load_store_eliminations

Conversation

@duburcqa

@duburcqa duburcqa commented May 4, 2026

Copy link
Copy Markdown
Contributor

Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions

Standalone perf PR. Applies to every backend's reverse-mode AD; no dependency on the launcher / sizer-cache work in #619 / #620. Single-file change (quadrants/transforms/auto_diff.cpp). Adds a new transform that drops adstack pushes whose loaded value is recomputable from operands the reverse pass can already reach, plus a judger gate that stops dynamic-RangeFor depth from over-promoting unrolled allocas onto the adstack.

TL;DR

The reverse-mode adstack carries every primal value the backward pass needs to read - one push per loaded value per iteration of the enclosing dynamic loop. For values that are pure functions of inputs the reverse pass can already reach (constants, kernel arguments, top-of-stack reads of an adstack the reverse pass already pops, recomputable global loads), the push is wasted: the backward pass can recompute the value directly from the available operands rather than read it back from the adstack.

This PR adds EliminateRecomputableAdStackPushes as a MakeAdjoint post-pass that walks each adstack's push sites, identifies pushes whose value is reachable from a recomputable interior op chain, and rewrites the matching reverse-pass LoadTopAdj consumers to recompute the value in place instead of reading the adstack. Recomputable leaves are: ConstStmt, ArgLoadStmt, AdStackLoadTopStmt (only on adstacks the reverse pass already pops, with reverse-position correctness check), and GlobalLoadStmt (for read-only global chain leaves - the dominant Genesis FPS gain in this PR comes from inlining these).

A separate gate change (AdStackAllocaJudger) stops the load+store rule from promoting an alloca onto the adstack when the alloca's enclosing loop is statically-unrollable (compile-time bound, no dynamic-RangeFor on the path); those allocas were getting adstack-promoted unnecessarily and paying push/pop traffic the unroll otherwise made redundant.

Why

Profiling Genesis test_differentiable_rigid showed adstack push/pop traffic on dynamic loops was the dominant per-step cost on every GPU backend. A representative reverse-mode kernel pushes a GlobalLoadStmt value 8 inner-iters x 4 outer-iters = 32 times per launch per adstack just to read it back at the same offset on the reverse pass. The value is arr[k] for a read-only arr - the backward kernel can re-read arr[k] directly with one extra GlobalLoad instead of carrying the value through the adstack heap.

Cumulatively across all the load-bearing primal reads in a rigid step, this is a ~55% Metal FPS gain and a ~13% CPU FPS gain on test_differentiable_rigid (validated against duburcqa/integration_perf_adstack).

The judger gate is a smaller correctness/perf clarification: the load+store rule was over-firing on allocas whose enclosing scope was statically unrollable, so the unroll path ended up paying adstack push/pop on values that don't need history-tracking at all.

Surface API

Zero user-visible change. Same qd.kernel.grad(...) shape; same gradient values; same correctness contract.

The pass is gated on cfg_optimization defaulting on (the same gate as every other reverse-mode optimisation pass) and runs unconditionally when adstack is enabled.

Mechanism end-to-end

1. EliminateRecomputableAdStackPushes - new pass

quadrants/transforms/auto_diff.cpp. The pass runs as a MakeAdjoint post-step; it walks each AdStackAllocaStmt in the kernel and considers every AdStackPushStmt that targets it as a candidate.

For each push:

  1. Walk the push's value SSA through interior recomputable ops (binary, unary, casts, ternary selects) until every leaf in the chain hits one of the recomputable-leaf categories (Const, ArgLoad, AdStackLoadTop on a reverse-pass-popped adstack, GlobalLoad on a read-only global). If any leaf is non-recomputable the push is left in place.
  2. Reverse-position correctness check for AdStackLoadTopStmt leaves: the leaf's source adstack must be popped by the reverse pass at a position that's earlier or equal in reverse order to the consumer push site. Without this check the recomputation chain could try to load from an already-popped adstack frame.
  3. Backup-SSA DAG-clone fallback: when the recomputation chain references SSA values defined in the forward block but consumed from the reverse block, the pass clones the relevant SSA subtree into the reverse block (with proper interior-op scoping). This handles the common case where the recomputation chain crosses the forward/reverse boundary.
  4. Rewrite consumers: every AdStackLoadTopStmt that was reading the eliminated push is replaced by the cloned recomputation chain. The push itself is then dead and DCE'd in the next simplifier pass.

2. Recomputable-leaf categories

Two foundational leaves (always-safe):

  • ConstStmt: integer / float literal. Recomputation is trivial.
  • ArgLoadStmt: kernel argument read. The forward and reverse passes share the same arg buffer, so the value is available at recompute time without any extra plumbing.

Two extension leaves (gated on additional checks):

  • AdStackLoadTopStmt (b7f1730e1-equivalent): the leaf's source adstack must be popped by the reverse pass at a reverse-position earlier than or equal to the consumer push site. This is the correctness check above. Without this guard, a chain leaf could try to read from an adstack frame that the reverse pass has already discarded.
  • GlobalLoadStmt (fb5ac6328-equivalent): the global must be read-only across the kernel (no GlobalStoreStmt on the same global anywhere in the kernel). The dominant Metal FPS recovery comes from this leaf - dynamic-RangeFor primal reads of read-only fields stop being adstack-pushed and instead get recomputed from the read-only global on the reverse pass. The read-only check is conservative (any GlobalStoreStmt to the same global anywhere in the kernel disqualifies the leaf, even if the store is in an unrelated block).

3. AdStackAllocaJudger gate on dynamic-RangeFor depth

quadrants/transforms/auto_diff.cpp. The AdStackAllocaJudger decides which AllocaStmts should be promoted onto the adstack (i.e. their values pushed/popped per iteration of the enclosing dynamic loop). The previous load+store heuristic flagged any alloca whose value was both stored AND loaded inside any loop body, even if every enclosing loop on the alloca's path was statically unrollable. After unrolling, those allocas are effectively top-level; pushing them is wasted traffic.

The gate now climbs the alloca's enclosing scope and disqualifies the load+store promotion rule when no enclosing scope is a dynamic RangeForStmt (positive trip count not statically known). The alloca remains unpromoted, the unroll path runs as normal, and reverse-mode reads come from the unrolled SSA values directly.

Per-backend coverage matrix

Mechanism CPU CUDA AMDGPU Metal Vulkan
EliminateRecomputableAdStackPushes (Const + ArgLoad leaves)
AdStackLoadTopStmt chain leaf with reverse-position guard
GlobalLoadStmt chain leaf for read-only globals
AdStackAllocaJudger dynamic-RangeFor depth gate

The pass operates on Quadrants IR before backend codegen, so every backend benefits identically. The Metal FPS gain shows up biggest because Metal was the most adstack-traffic-bound; CPU is less bound by adstack heap traffic so the gain is smaller in relative terms.

Side-effect audit

Concern Verdict
AdStackLoadTopStmt leaf reading from already-popped adstack Reverse-position guard checks the leaf's source adstack reverse-pop position against the consumer push site; mismatch keeps the original push.
GlobalLoadStmt leaf on a global that's actually written Read-only check scans the kernel for any GlobalStoreStmt on the same global; any match disqualifies the leaf. Conservative (false negatives only - a write in a never-taken branch still disqualifies).
Backup-SSA cloning corrupting forward block ordering Clone targets the reverse block, not the forward block; cloned SSA is inserted in reverse-block order matching the consumer's reverse position.
AdStackAllocaJudger gate change letting through wrong allocas The gate is strictly more conservative than the previous behaviour: an alloca that the previous heuristic would have promoted may now stay unpromoted, but no alloca that should have been promoted is left out. The unroll path that catches the un-promoted allocas is the same path that handles statically-bounded loops in the no-promotion baseline.
Pass interaction with cfg_optimization=False Same gate as every other reverse-mode opt pass; with cfg-opt off the pass runs but downstream simplifier/DCE doesn't, so eliminated pushes stay as dead AdStackPushStmts in the IR (visible in regression tests that exercise this state).
Cross-pass interaction with MakeAdjoint Pass runs as a MakeAdjoint post-step; consumes the forward_to_reverse_range_for_map_ populated by MakeAdjoint::visit(RangeForStmt*) so the reverse-position check has the necessary forward-to-reverse mapping for LoopIndexStmt-style chain dependencies.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9482e17775

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/transforms/auto_diff.cpp Outdated
Comment thread quadrants/transforms/auto_diff.cpp Outdated
@duburcqa

duburcqa commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

@claude review

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

Coverage Report (9482e1777)

File Coverage Missing

Diff coverage: 0% · Overall: 74% · 0 lines, 0 missing

Full annotated report

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

Coverage Report (43f137f70)

File Coverage Missing
🔴 python/quadrants/_tensor_wrapper.py 38% 192,200-203
🟢 python/quadrants/lang/_ndarray.py 83% 490
🔴 python/quadrants/lang/field.py 78% 220,613
🟢 python/quadrants/lang/matrix.py 86% 1445,1878,2023

Diff coverage: 76% · Overall: 73% · 45 lines, 11 missing

Full annotated report

@duburcqa duburcqa force-pushed the duburcqa/adstack_load_store_eliminations branch 2 times, most recently from 9482e17 to f9420a3 Compare May 4, 2026 15:42
@hughperkins

Copy link
Copy Markdown
Collaborator

Nice PR description. Thank you 🙌

hughperkins added a commit that referenced this pull request May 4, 2026
…only

The agent step on auto_diff.cpp (a 2540-LoC file with a 690-line diff) was
hanging on claude-4.6-opus-high-thinking for 15+ minutes because the agent
was doing both function identification AND line counting AND output
formatting, with the slowest model variant available.

Refactor: the agent now produces ONLY a JSONL stream of function ranges
(name + HEAD/base line bounds), and only for functions whose body
intersects a diff hunk. Everything else -- per-function totals, +/-
attribution, and the report shape -- is computed deterministically from
the diff and the HEAD/base files in `render_report.py`.

To support this:
  - `build_inputs.py` now also dumps `head/<path>` (so `render_report.py`
    is self-contained), and writes `touched_ranges.txt` -- per-file HEAD
    and base hunk ranges -- as a hint so the agent only has to read code
    near touched regions.
  - `render_report.py` consumes phase-1 outputs + the agent's JSONL,
    computes total/added/removed per function (using the same
    code-line-set definition as the file-level totals), and emits the
    final report.txt.
  - The workflow agent step is rewritten with the narrow JSONL prompt,
    and the model is dropped from `claude-4.6-opus-high-thinking` to
    `claude-4.6-sonnet-thinking` (function identification is structural
    pattern-matching, not deep reasoning).
  - `touched_ranges.txt` and `function_ranges.jsonl` are uploaded with
    the existing artifacts for debuggability.

Verified locally with synthetic JSONL on PR #621's auto_diff.cpp diff
(NEW + modified functions both render correctly) and on the empty-JSONL
case (file headers preserved, with a "no per-function attribution
available" note so the report is still useful if the agent fails).

Co-authored-by: Cursor <cursoragent@cursor.com>
hughperkins added a commit that referenced this pull request May 4, 2026
…only

The agent step on auto_diff.cpp (a 2540-LoC file with a 690-line diff) was
hanging on claude-4.6-opus-high-thinking for 15+ minutes because the agent
was doing both function identification AND line counting AND output
formatting, with the slowest model variant available.

Refactor: the agent now produces ONLY a JSONL stream of function ranges
(name + HEAD/base line bounds), and only for functions whose body
intersects a diff hunk. Everything else -- per-function totals, +/-
attribution, and the report shape -- is computed deterministically from
the diff and the HEAD/base files in `render_report.py`.

To support this:
  - `build_inputs.py` now also dumps `head/<path>` (so `render_report.py`
    is self-contained), and writes `touched_ranges.txt` -- per-file HEAD
    and base hunk ranges -- as a hint so the agent only has to read code
    near touched regions.
  - `render_report.py` consumes phase-1 outputs + the agent's JSONL,
    computes total/added/removed per function (using the same
    code-line-set definition as the file-level totals), and emits the
    final report.txt.
  - The workflow agent step is rewritten with the narrow JSONL prompt,
    and the model is dropped from `claude-4.6-opus-high-thinking` to
    `claude-4.6-sonnet-thinking` (function identification is structural
    pattern-matching, not deep reasoning).
  - `touched_ranges.txt` and `function_ranges.jsonl` are uploaded with
    the existing artifacts for debuggability.

Verified locally with synthetic JSONL on PR #621's auto_diff.cpp diff
(NEW + modified functions both render correctly) and on the empty-JSONL
case (file headers preserved, with a "no per-function attribution
available" note so the report is still useful if the agent fails).

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

Coverage Report (caab5053e)

File Coverage Missing

Diff coverage: 0% · Overall: 74% · 0 lines, 0 missing

Full annotated report

hughperkins added a commit that referenced this pull request May 4, 2026
The first end-to-end run on PR #621's branch surfaced two cosmetic / accuracy
issues:

1. The agent emitted a single "<module>" entry whose range covered the entire
   file (lines 1-N), which then double-counted every function it also listed
   in that file. Sharpen the agent prompt: "<module>" is now restricted to
   top-level non-function code (imports, module-level constants, dataclass /
   type-alias declarations, free static definitions) AND must be omitted
   entirely if every changed line lands inside a function or method. Function
   ranges within a single file must not overlap each other.

2. The drift note printed "+-288 -0" when added_drift was negative because we
   were prefixing a leading sign before the number's own sign. Switch to
   ``{:+d}`` formatting so the note reads "added_drift=-288 removed_drift=+0",
   which is unambiguous.

Co-authored-by: Cursor <cursoragent@cursor.com>
hughperkins added a commit that referenced this pull request May 4, 2026
The first end-to-end run on PR #621's branch surfaced two cosmetic / accuracy
issues:

1. The agent emitted a single "<module>" entry whose range covered the entire
   file (lines 1-N), which then double-counted every function it also listed
   in that file. Sharpen the agent prompt: "<module>" is now restricted to
   top-level non-function code (imports, module-level constants, dataclass /
   type-alias declarations, free static definitions) AND must be omitted
   entirely if every changed line lands inside a function or method. Function
   ranges within a single file must not overlap each other.

2. The drift note printed "+-288 -0" when added_drift was negative because we
   were prefixing a leading sign before the number's own sign. Switch to
   ``{:+d}`` formatting so the note reads "added_drift=-288 removed_drift=+0",
   which is unambiguous.

Co-authored-by: Cursor <cursoragent@cursor.com>
@duburcqa duburcqa force-pushed the duburcqa/adstack_load_store_eliminations branch from 673c0ea to 836847a Compare May 4, 2026 20:56
@duburcqa duburcqa marked this pull request as draft May 4, 2026 21:20
@duburcqa duburcqa marked this pull request as ready for review May 4, 2026 21:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 836847a271

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/transforms/auto_diff/eliminate_recomputable_pushes.cpp Outdated
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

PR change report (bd597b119)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
quadrants/transforms/auto_diff/make_adjoint.cpp 565 +565
quadrants/transforms/auto_diff/ir_shaping.cpp 413 +413
quadrants/transforms/auto_diff/forward_state_spill.cpp 395 +395
quadrants/transforms/auto_diff/eliminate_recomputable_pushes.cpp 295 +295
quadrants/transforms/auto_diff/make_dual.cpp 276 +276
quadrants/transforms/auto_diff/auto_diff_common.h 264 +264
quadrants/transforms/auto_diff/validation.cpp 200 +200
quadrants/transforms/auto_diff/post_adjoint_cleanup.cpp 168 +168
quadrants/transforms/auto_diff/auto_diff.cpp 57 +57
tests/python/test_adstack.py 2509 +11 -10
quadrants/transforms/auto_diff/ir_shaping.h 9 +9
quadrants/transforms/auto_diff/validation.h 8 +8
quadrants/transforms/auto_diff/forward_state_spill.h 7 +7
quadrants/transforms/auto_diff/post_adjoint_cleanup.h 7 +7
quadrants/transforms/auto_diff/make_adjoint.h 5 +5
quadrants/transforms/auto_diff/make_dual.h 5 +5

Total: 16 file(s) changed, +2685 -10 code lines.

Full per-function report

@duburcqa duburcqa force-pushed the duburcqa/adstack_load_store_eliminations branch 2 times, most recently from 0744039 to 1db816d Compare May 4, 2026 22:36
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

PR change report (1db816d92)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
quadrants/transforms/auto_diff/make_adjoint.cpp 565 +565
quadrants/transforms/auto_diff/ir_shaping.cpp 413 +413
quadrants/transforms/auto_diff/forward_state_spill.cpp 395 +395
quadrants/transforms/auto_diff/eliminate_recomputable_pushes.cpp 303 +303
quadrants/transforms/auto_diff/make_dual.cpp 276 +276
quadrants/transforms/auto_diff/auto_diff_common.h 264 +264
quadrants/transforms/auto_diff/validation.cpp 200 +200
quadrants/transforms/auto_diff/post_adjoint_cleanup.cpp 168 +168
quadrants/transforms/auto_diff/auto_diff.cpp 57 +57
tests/python/test_adstack.py 2537 +39 -10
quadrants/transforms/auto_diff/ir_shaping.h 9 +9
quadrants/transforms/auto_diff/validation.h 8 +8
quadrants/transforms/auto_diff/forward_state_spill.h 7 +7
quadrants/transforms/auto_diff/post_adjoint_cleanup.h 7 +7
quadrants/transforms/auto_diff/make_adjoint.h 5 +5
quadrants/transforms/auto_diff/make_dual.h 5 +5

Total: 16 file(s) changed, +2721 -10 code lines.

Full per-function report

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

Coverage Report (1db816d92)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 74% · 39 lines, 0 missing

Full annotated report

@duburcqa duburcqa force-pushed the duburcqa/adstack_load_store_eliminations branch from 1db816d to c1b78e5 Compare May 5, 2026 05:50
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

PR change report (c1b78e5c0)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
quadrants/transforms/auto_diff/make_adjoint.cpp 565 +565
quadrants/transforms/auto_diff/ir_shaping.cpp 413 +413
quadrants/transforms/auto_diff/forward_state_spill.cpp 395 +395
quadrants/transforms/auto_diff/eliminate_recomputable_pushes.cpp 303 +303
quadrants/transforms/auto_diff/make_dual.cpp 276 +276
quadrants/transforms/auto_diff/auto_diff_common.h 264 +264
quadrants/transforms/auto_diff/validation.cpp 200 +200
quadrants/transforms/auto_diff/post_adjoint_cleanup.cpp 168 +168
quadrants/transforms/auto_diff/auto_diff.cpp 57 +57
tests/python/test_adstack.py 2537 +39 -10
quadrants/transforms/auto_diff/ir_shaping.h 9 +9
quadrants/transforms/auto_diff/validation.h 8 +8
quadrants/transforms/auto_diff/forward_state_spill.h 7 +7
quadrants/transforms/auto_diff/post_adjoint_cleanup.h 7 +7
quadrants/transforms/auto_diff/make_adjoint.h 5 +5
quadrants/transforms/auto_diff/make_dual.h 5 +5

Total: 16 file(s) changed, +2721 -10 code lines.

Full per-function report

@duburcqa duburcqa force-pushed the duburcqa/adstack_load_store_eliminations branch from c1b78e5 to 8f20dac Compare May 5, 2026 06:50
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

Coverage Report (c1b78e5c0)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 74% · 39 lines, 0 missing

Full annotated report

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

PR change report (8f20dac58)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
quadrants/transforms/auto_diff/make_adjoint.cpp 565 +565
quadrants/transforms/auto_diff/ir_shaping.cpp 413 +413
quadrants/transforms/auto_diff/forward_state_spill.cpp 395 +395
quadrants/transforms/auto_diff/eliminate_recomputable_pushes.cpp 303 +303
quadrants/transforms/auto_diff/make_dual.cpp 276 +276
quadrants/transforms/auto_diff/auto_diff_common.h 264 +264
quadrants/transforms/auto_diff/validation.cpp 200 +200
quadrants/transforms/auto_diff/post_adjoint_cleanup.cpp 168 +168
quadrants/transforms/auto_diff/auto_diff.cpp 57 +57
tests/python/test_adstack.py 2537 +39 -10
quadrants/transforms/auto_diff/ir_shaping.h 9 +9
quadrants/transforms/auto_diff/validation.h 8 +8
quadrants/transforms/auto_diff/forward_state_spill.h 7 +7
quadrants/transforms/auto_diff/post_adjoint_cleanup.h 7 +7
quadrants/transforms/auto_diff/make_adjoint.h 5 +5
quadrants/transforms/auto_diff/make_dual.h 5 +5

Total: 16 file(s) changed, +2721 -10 code lines.

Full per-function report

@duburcqa duburcqa force-pushed the duburcqa/adstack_load_store_eliminations branch 2 times, most recently from 54e103a to 8775e27 Compare May 5, 2026 08:14
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

PR change report (8775e2721)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
quadrants/transforms/auto_diff/make_adjoint.cpp 565 +565
quadrants/transforms/auto_diff/ir_shaping.cpp 413 +413
quadrants/transforms/auto_diff/forward_state_spill.cpp 395 +395
quadrants/transforms/auto_diff/eliminate_recomputable_pushes.cpp 303 +303
quadrants/transforms/auto_diff/make_dual.cpp 276 +276
quadrants/transforms/auto_diff/auto_diff_common.h 264 +264
quadrants/transforms/auto_diff/validation.cpp 200 +200
quadrants/transforms/auto_diff/post_adjoint_cleanup.cpp 168 +168
quadrants/transforms/auto_diff/auto_diff.cpp 57 +57
tests/python/test_adstack.py 2537 +39 -10
quadrants/transforms/auto_diff/ir_shaping.h 9 +9
quadrants/transforms/auto_diff/validation.h 8 +8
quadrants/transforms/auto_diff/forward_state_spill.h 7 +7
quadrants/transforms/auto_diff/post_adjoint_cleanup.h 7 +7
quadrants/transforms/auto_diff/make_adjoint.h 5 +5
quadrants/transforms/auto_diff/make_dual.h 5 +5

Total: 16 file(s) changed, +2721 -10 code lines.

Full per-function report

…ro value, to preserve real x = 0.0 body stores
@duburcqa duburcqa force-pushed the duburcqa/adstack_load_store_eliminations branch from 8775e27 to bb5a62d Compare May 5, 2026 09:11
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

Coverage Report (8775e2721)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 79% · 39 lines, 0 missing

Full annotated report

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

PR change report (bb5a62d40)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File LoC Added Removed
quadrants/transforms/auto_diff/make_adjoint.cpp 565 +565
quadrants/transforms/auto_diff/ir_shaping.cpp 413 +413
quadrants/transforms/auto_diff/forward_state_spill.cpp 395 +395
quadrants/transforms/auto_diff/eliminate_recomputable_pushes.cpp 303 +303
quadrants/transforms/auto_diff/make_dual.cpp 276 +276
quadrants/transforms/auto_diff/auto_diff_common.h 264 +264
quadrants/transforms/auto_diff/validation.cpp 200 +200
quadrants/transforms/auto_diff/post_adjoint_cleanup.cpp 168 +168
quadrants/transforms/auto_diff/auto_diff.cpp 57 +57
tests/python/test_adstack.py 2537 +39 -10
quadrants/transforms/auto_diff/ir_shaping.h 9 +9
quadrants/transforms/auto_diff/validation.h 8 +8
quadrants/transforms/auto_diff/forward_state_spill.h 7 +7
quadrants/transforms/auto_diff/post_adjoint_cleanup.h 7 +7
quadrants/transforms/auto_diff/make_adjoint.h 5 +5
quadrants/transforms/auto_diff/make_dual.h 5 +5

Total: 16 file(s) changed, +2721 -10 code lines.

Full per-function report

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

Coverage Report (bb5a62d40)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 72% · 39 lines, 0 missing

Full annotated report

@hughperkins

Copy link
Copy Markdown
Collaborator

Nice! Thank you

I like that you've grouped together the autodiff files in their own directory 🙌

Checklist:

  • files all a reasonable size (subjective, but 600-800 sounds reasonable ot me)
  • no user facing doc changes needed
  • no core (non-autodiff) files changed

=> ok to merge

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

Coverage Report (bb5a62d40)

File Coverage Missing
🟢 tests/python/test_adstack.py 100%

Diff coverage: 100% · Overall: 72% · 39 lines, 0 missing

Full annotated report

@duburcqa duburcqa merged commit 2fc5472 into main May 5, 2026
58 of 59 checks passed
@duburcqa duburcqa deleted the duburcqa/adstack_load_store_eliminations branch May 5, 2026 14:04
npoulad1 added a commit to ROCm/quadrants that referenced this pull request Jun 8, 2026
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428)

* [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

* [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430)

* Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420)

* [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435)

* [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438)

* Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

* Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442)

* [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439)

* [Misc] Add named top-level loops (Genesis-Embodied-AI#440)

* [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

* [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

* [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456)

* [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461)

* [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

* [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

* [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

* [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465)

* [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

* [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471)

* [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

* [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

* [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475)

* [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

* Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485)

* [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484)

* [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477)

* [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)

* Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488)

* Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489)

* [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487)

* [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492)

* [CI] Serialize api doc workflow (Genesis-Embodied-AI#494)

* [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506)

* [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509)

* [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504)

* [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505)

* [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507)

* [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508)

* [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482)

* [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483)

* [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512)

* [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510)

* [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511)

* [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422)

* [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500)

* [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501)

* [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502)

* [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503)

* [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496)

* [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491)

* [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534)

* [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535)

* [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495)

* [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490)

* [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536)

* [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541)

* [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419)

* [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411)

* [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552)

* [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441)

* [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412)

* [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555)

* [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554)

* [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537)

* [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493)

* [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539)

* [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513)

* [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551)

* [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557)

* [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562)

* [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559)

* [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558)

* [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563)

* [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426)

Authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543)

* Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564)

* [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470)

* [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567)

* Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573)

* [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574)

* [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571)

* [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575)

* [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576)

* [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577)

* [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570)

* [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566)

* [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579)

* [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584)

* [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580)

* [Type] Tensor 24 (Genesis-Embodied-AI#561)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587)

* [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578)

* [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588)

* [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590)

* [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592)

* [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591)

* [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596)

* [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450)

* Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585)

Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598)

Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>

* [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599)

* [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606)

* [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610)

* [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611)

* [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Doc] Update README (Genesis-Embodied-AI#617)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

* [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Add PR Line change report (Genesis-Embodied-AI#624)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

* [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630)

* [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

Co-authored-by: Johnny Nunez and Hugh Perkins

* [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632)

* [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

* [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633)

* [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634)

* [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638)

* [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639)

* [Perf] Streams 1-4 (Genesis-Embodied-AI#410)

* [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643)

* [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650)

* [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640)

* [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641)

* [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635)

* [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658)

* [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655)

* [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653)

* [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659)

* [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654)

* [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660)

* [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669)

* [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668)

* [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667)

* [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671)

* [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675)

* [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677)

* [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Cross gpu atomics (Genesis-Embodied-AI#666)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664)

* [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685)

* [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670)

* [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662)

* [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687)

* [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672)

* [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679)

* [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665)

* [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691)

* [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694)

* [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690)

* Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698)

* [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692)

* [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696)

* [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683)

* [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676)

* [GPU] New QIPC ops for block (Genesis-Embodied-AI#684)

* [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693)

* [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701)

* [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700)

* [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702)

* [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708)

* [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707)

* Fix duplicate HIP graph driver-function declarations after v1.0.0 merge

The amd-integration fork had cherry-picked the HIP graph driver functions
(graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate /
graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set.
The per-file 3-way merge appended both copies into
amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the
AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures
are identical to the fork's existing declarations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge

- kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel
  rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream
  PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design,
  leaving references to undefined `ephemeral_context_ptr`. Restore the fork's
  coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced
  launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel
  groups adapted onto the AMD launch path.
- llvm_context.h: both the fork and upstream added `num_instructions`; the merge
  kept upstream's private placement, but the AMDGPU codegen force-inline heuristic
  calls it statically from outside the class. Move it back to the public section.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore async result D2H and hoist kernarg vectors in AMDGPU launcher

The v1.0.0 merge resolution regressed two amd-integration baseline
optimizations in launch_llvm_kernel / launch_offloaded_tasks:

  - The per-launch result-buffer copy was a blocking memcpy_device_to_host,
    forcing a host stall on every value-returning launch and serializing the
    GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it
    needs the value); external-array transfers still stream_synchronize once
    before reading back.

  - launch_task constructed the kernarg std::vectors from initializer lists
    ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free
    per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget

Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup
ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through
`amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside
`llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco`
(i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted
these constructs, which is why it was unaffected.

1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend.
   Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target
   (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the
   native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK`
   is now the default and still honored. This is the actual crash fix.

2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so
   `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries
   x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies
   but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm
   during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the
   wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources

CI pre-commit's clang-format hook reformatted these files (long
declarations/lambda signatures collapsed onto single lines per the repo's
clang-format config). Apply the same formatting so the hook passes.

No functional changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input)

clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged
`builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to
the `llvm::Value*` LHS parameter as a null pointer, not an integer zero.
Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper
zero constant -- identical intended semantics, and clang-tidy clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>
Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants