[AutoDiff] Speed up reverse-mode kernel launches on GPU backends#578
Conversation
There was a problem hiding this comment.
I didn't find any bugs, but this PR touches the autodiff runtime on both LLVM and SPIR-V backends and introduces a host-side replica of the sizer shader's metadata layout, so it's worth a human look from someone who owns this code.
Extended reasoning...
Overview
This PR adds a host-eval fast path that bypasses the on-device adstack sizer dispatch when no SizeExpr in any adstack alloca contains an ExternalTensorRead leaf. Two files change: quadrants/runtime/gfx/adstack_sizer_launch.cpp (SPIR-V — Metal/Vulkan/MoltenVK) gets two new anonymous-namespace helpers and an early-return at the top of publish_adstack_metadata_spirv; quadrants/runtime/llvm/llvm_runtime_executor.cpp (CUDA/AMDGPU) generalizes the existing CPU-only host-eval branch to also fire on GPU when the same condition holds.
Security risks
None apparent. No auth, crypto, IPC, or network surface; this is a launch-path optimization in the autodiff runtime. The host-eval path writes the same metadata fields the GPU sizer would have written, via the same copy_h2d helpers already used elsewhere in the file.
Level of scrutiny
Higher than a typical perf tweak, lower than a public-API change. The motivating workload (Genesis test_differentiable_rigid[gpu]) and the reasoning are well-documented, but the SPIR-V path now contains a host replica of the sizer shader's metadata layout ([stride_float, stride_int, (offset_i, max_size_i)*], with 2 * max_size for Float / max_size for Int heap accumulation, and a max(_, 1) floor). Any future change to adstack_sizer_shader.cpp's running_off_f / running_off_i accumulator now has to be mirrored in eval_per_task_metadata_on_host, and divergence will silently mis-size GPU heaps for autodiff kernels — a class of bug that does not surface as a build failure or a clean test signal.
Other factors
The bug hunter found nothing, the description is thorough and includes a side-effect audit, and CPU + test_ad_basics tests pass. The author explicitly notes that the motivating CUDA/GPU benchmark wasn't run locally, which is the main thing I'd want a human owner to verify before merging. The change is also touching the same area as several recent commits (f1dc891, 5d7cf54, 03ab3b3), so a reviewer with that recent context will catch interactions I can't see from the diff alone.
Coverage Report (
|
| Metric | Value |
|---|---|
| Diff coverage (changed lines only) | 0% |
| Overall project coverage | 61% |
Total: 0 lines, 0 missing, 0% covered
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata - skips the on-device sizer dispatch and per-launch DtoH stride readback when every alloca's SizeExpr is host-resolvable, and pipelines the remaining metadata copies through a per-launch event-guarded pinned-host scratch on CUDA / AMDGPU
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata - skips the on-device sizer dispatch and per-launch DtoH stride readback when every alloca's SizeExpr is host-resolvable, and pipelines the remaining metadata copies through a per-launch event-guarded pinned-host scratch on CUDA / AMDGPU
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata
b1414d0 to
4a9d231
Compare
|
Cool. Since I remember the user-facing doc talks a lot about sizing, is there anything that needs updating in them? |
I don't think so, this is purely internal. It adds a host-eval fast path to bypass the on-device sizer kernel whenever it is possible. Code logics, just different execution path. |
There was a problem hiding this comment.
💡 Codex Review
The new GPU host-eval fast path dereferences program_impl_ unconditionally in program_impl_->program != nullptr. In the same function, the on-device branch explicitly documents/supports program_impl_ == nullptr for C++-only setups; with this change, kernels whose size exprs are host-resolvable can now hit this branch first and crash with a null dereference instead of taking the existing compile-time fallback behavior.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
P1 fixed in P2 fixed in |
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_adstack.py |
100% |
Diff coverage: 100% · Overall: 73% · 3 lines, 0 missing
32b803f to
037c453
Compare
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 python/quadrants/_kernels.py |
100% | |
🔴 python/quadrants/lang/_fast_caching/args_hasher.py |
67% | 9,37 |
🔴 python/quadrants/lang/_func_base.py |
75% | 21 |
🟢 python/quadrants/lang/_kernel_impl_dataclass.py |
100% | |
🟢 python/quadrants/lang/_ndarray.py |
86% | 175,325 |
🔴 python/quadrants/lang/_ndarray_pickle.py |
67% | 8 |
🟢 python/quadrants/lang/_template_mapper_hotpath.py |
100% | |
🔴 python/quadrants/lang/any_array.py |
67% | 20 |
🟢 python/quadrants/lang/ast/ast_transformer.py |
88% | 669 |
🟢 python/quadrants/lang/ast/ast_transformers/call_transformer.py |
100% | |
🟢 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
100% | |
🟢 python/quadrants/lang/field.py |
100% | |
🟢 python/quadrants/lang/impl.py |
80% | 18 |
🟢 python/quadrants/lang/kernel.py |
100% | |
🟢 python/quadrants/lang/kernel_arguments.py |
100% | |
🔴 python/quadrants/lang/matrix.py |
70% | 964,967,973,978,1161,1720,1831 |
🟢 tests/python/quadrants/lang/test_dlpack.py |
100% | |
🟢 tests/python/test_adstack.py |
100% | |
🟢 tests/python/test_api.py |
100% | |
🟢 tests/python/test_pickle.py |
100% |
Diff coverage: 84% · Overall: 73% · 99 lines, 16 missing
…en every alloca's SizeExpr is host-resolvable (no ExternalTensorRead leaf): take the host evaluator path on CUDA / AMDGPU / Metal / Vulkan and write the metadata buffer directly via copy_h2d / unmapped scratch fill, eliminating one kernel launch and one DtoH stride-readback per launch on LLVM GPU backends and one cmdlist submit_synced + wait_idle pair per launch on SPIR-V backends; the on-device sizer still runs unchanged for kernels whose SizeExprs reach into ndarray data (which lives in GPU-private memory and cannot be touched from the host)
…sizes, stride) asynchronously from a pinned-host scratch on CUDA / AMDGPU; the host returns immediately after queuing the three copies on the default stream and the subsequent main-kernel launch stream-orders after them, eliminating three serial host stalls per launch in the host-eval fast path. Pinned scratch is allocated lazily via cuMemAllocHost / hipHostMalloc and grown amortised-doubling; a per-launch CUDA / HIP event guards scratch reuse against in-flight DMAs
… lines to better fill the 120-col budget
…escribe the overhead in absolute terms
… - the host fold goes through SNodeRwAccessorsBank::read_int whose nested accessor-kernel launch corrupts the publish-time launcher state and produces wrong gradients on kernels that mix FieldLoad with the on-device sizer fallback
…plified-unused-x value so any cross-stack push / pop misroute fails as a 200.0 vs 0.0 mismatch instead of a 0.2 vs 0.0 'looks-like-tolerance' delta
…ta HtoDs through CUDAContext::get_instance().get_stream() instead of a hard-coded nullptr so they stream-order against the main-kernel dispatch when the user has set a non-default stream via CUDAContext::set_stream; AMDGPU keeps nullptr because AMDGPUContext::launch always uses the default stream
…rogram against a null program_impl_ in the C++-only-tests setup, mirroring the on-device branch's existing nullptr fallback to max_size_compile_time
…pl_ null-guard comment to fit the 120-col budget
037c453 to
8d02758
Compare
|
|
Comparing Quadrants main vs this PR on Genesis main:
|
|
checklist:
=> ok to merge |
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_adstack.py |
100% |
Diff coverage: 100% · Overall: 73% · 3 lines, 0 missing
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>
Speed up reverse-mode kernel launches on GPU backends via an adstack-sizer host-eval fast path
TL;DR
Every reverse-mode kernel launch with adstack allocas runs the
SizeExprcapacity computation. Pre-PR that always meant a GPU dispatch:HtoDof the encodedSizeExprbytecode -> single-threadruntime_eval_adstack_size_exprkernel launch -> synchronousDtoHof the per-thread stride. TheDtoHis a stream-sync that stalls the host until the sizer kernel has finished. With ~100 substeps x forward + backward x several reverse-mode tasks per substep, the test launches the sizer thousands of times and pays one host stall per launch.flush()+device_->wait_idle()for PSB visibility -> sizer-bytecode upload -> per-task descriptor bind / dispatch ->submit_synced-> blocking metadata readback. Two host-side GPU stalls per kernel launch.This PR detects the common case where no
SizeExprleaf needs device-resident memory and skips the entire dispatch on both backends.Reported impact on the motivating workload (Genesis
test_differentiable_rigid[gpu]): roughly 4x faster on CUDA after enabling this path.Why
evaluate_adstack_size_expralready handles every leaf the on-device sizer was designed to interpret, exceptExternalTensorReadwhose data pointer is GPU-private. Detecting that one leaf at host time and skipping the dispatch is straightforward, and theunrolling_limitbaseline is exactly the all-host-resolvable case (no adstacks at all => noSizeExprat all => no dispatch), so this fast path is the closest the adstack mode can get to the unrolled baseline's launch overhead.Mechanism
LLVM path (
LlvmRuntimeExecutor::publish_adstack_metadata)The function already had a CPU branch that host-evals each
SerializedSizeExprand writes the metadata arrays directly viacopy_h2d. The branch was gated on!is_gpu_llvm. The new code:size_expr.nodesfor any node whose kind isSizeExpr::Kind::ExternalTensorRead.O(total node count across allocas).use_host_eval = !is_gpu_llvm || all_size_exprs_host_resolvable.use_host_evalis true. CUDA / AMDGPU now reach this branch when noExternalTensorReadis present.runtime_eval_adstack_size_exprJIT call for kernels withExternalTensorRead.What dropped per launch when the fast path fires: one bytecode
HtoD, one device sizer kernel launch, oneDtoHstream-sync.LLVM async pinned-host metadata HtoD
The fast path's three small per-launch HtoD copies (
offsets,max_sizes,stride) are issued asynchronously from a pinned-host scratch viacuMemcpyHtoDAsync/hipMemcpyHtoDAsync. The host returns immediately after queueing the three copies on the active CUDA stream (CUDAContext::get_instance().get_stream()so user-set custom streams stream-order against the main-kernel dispatch correctly; AMDGPU keepsnullptrbecauseAMDGPUContext::launchalways uses the default stream). Pinned scratch is allocated lazily viacuMemAllocHost/hipHostMallocand grown amortised-doubling; a per-launch CUDA / HIP event guards scratch reuse against in-flight DMAs. Eliminates the three serial host stalls per launch the synchronouscuMemcpyHtoD_v2path had.SPIR-V path (
GfxRuntime::publish_adstack_metadata_spirv)Two helpers in an anonymous namespace:
all_size_exprs_host_resolvable(adstack_task_indices, task_attribs): scans every adstack-bearing task's allocas for anExternalTensorReadorFieldLoadleaf.FieldLoadis the correctness gate: the host evaluator'sFieldLoadpath goes throughSNodeRwAccessorsBank::read_int, whose nested accessor-kernel launch from inside the publish corrupts the SPIR-V launcher's per-task metadata-upload state and produces wrong gradients on every kernel that hits it. The on-device sizer was specifically built to handleFieldLoadon-device via PSB loads precisely because of this; the host-eval predicate must therefore reject bothExternalTensorRead(host can't read GPU-private memory) andFieldLoad(nested launch is unsafe).eval_per_task_metadata_on_host(adstack_task_indices, task_attribs, prog, host_ctx, per_task_ad_stack): replicates the sizer shader's per-task metadata layout ([stride_float, stride_int, (offset_i, max_size_i)*]) on the host. Float-heap accumulator advances by2 * max_size(primal + adjoint), Int-heap bymax_size, matching therunning_off_f/running_off_iarithmetic inquadrants/codegen/spirv/adstack_sizer_shader.cpp.The fast path runs after the
adstack_task_indicesearly-out and before the sizer-pipeline build / bytecode upload / cmdlist record. When it fires, the function returns the host-computedper_task_ad_stackvector and never touches the sizer pipeline, the bytecode scratch buffer, the per-task metadata-buffer allocation, theflush(), thedevice_->wait_idle(), the sizer cmdlist record, thesubmit_synced, or the metadata readback - all of which are skipped entirely.Per-backend coverage matrix
ExternalTensorRead, noFieldLoad)ExternalTensorReadorFieldLoadon SPIR-V)HtoDbytecode + sizer kernel +DtoHstride syncHtoDs on the active stream, no kernel, no syncHtoDbytecode + sizer kernel +DtoHstride syncHtoDs on the default stream, no kernel, no syncflush+wait_idle+ bytecode upload + sizer cmdlist + readbackExternalTensorReadandFieldLoadfree)LLVM's host-eval
FieldLoadis serviced bySNodeRwAccessorsBankexactly as before - no change for the LLVM CPU / CUDA / AMDGPU paths because the launcher reentrancy issue that gates SPIR-V doesn't apply there.Tests
pytest tests/python/test_adstack.py -n 8: 770 passed, 10 xfailed locally on macOS Vulkan / arm64 / Metal.test_adstack_sub_of_max_over_range_fusion_does_not_mix_fieldload_and_extreadis parametrized onx_unused_val=[0.1, 100.0]. Theamplified_unused_xvariant pins any future cross-stack push / pop misroute as a200.0 vs 0.0mismatch (5+ orders of magnitude) instead of a0.2 vs 0.0"looks-like-tolerance" delta - the original0.1setup was added by this PR's predecessor and made the SPIR-V FieldLoad-during-publish corruption look like a numerical tolerance issue rather than the structural correctness bug it was.test_differentiable_rigid[gpu]end-to-end: ~4x faster on CUDA per the reported repro.tests/test_grad.py::test_differentiable_rigid[cpu]end-to-end: passes.Codex / Claude bot review fixes
program_impl_->programunconditionally; the on-device branch already supportsprogram_impl_ == nullptr(C++-only tests) and falls back tomax_size_compile_timeprogram_impl_null-check before theevaluate_adstack_size_exprcall, mirroring the on-device branchdefault_stream = nullptr; user calls toCUDAContext::set_streamwould leave kernels reading stale metadataCUDAContext::get_instance().get_stream()so they stream-order againstCUDAContext::launch's dispatch handle. AMDGPU keepsnullptrbecauseAMDGPUContext::launchalways passesnullptrtohipLaunchKernelSide-effect audit
evaluate_adstack_size_expris the same function the on-device-bytecode encoder already calls during pre-substitution, so the leaves it can fold (Const/FieldLoad/BoundVariable/ExternalTensorShape/ arithmetic /MaxOverRange) produce identical valuesFieldLoadreentrancyFieldLoadso the SPIR-V publish never callsread_intfrom insidepublish_adstack_metadata_spirv; the on-device sizer's PSB-load path handlesFieldLoadcorrectly2 * max_sizefor the Float heap andmax_sizefor the Int heap, matchingadstack_sizer_shader.cpp'srunning_off_f/running_off_iaccumulation; final stride values written intometadata[0]/metadata[1]align_up_8(sizeof(int64_t) + entry_size_bytes * max_size)formula already used by the existing CPU host-eval branch; no changeExternalTensorReadfalls back to on-device sizersize_expr.nodesreturns false on the firstExternalTensorReadkind; theelsearm of the dispatch retains the original LLVMruntime_eval_adstack_size_exprcall and the SPIR-V cmdlist record pathCUDAContext::get_instance().get_stream(); pinned scratch reuse guarded by per-launch CUDA / HIP event so DMAs cannot race the host overwritesize_expr.nodes.empty()(offline-cache hit, symbolic tree not serialised) the host-eval path usesmax_size_compile_timewith the samemax(_, 1)lower clamp the shader appliesProgramImpl/ program back-referenceprogram_impl_ != nullptrbefore dereferencingprogram_impl_->program(P1 review fix); SPIR-V path's existingQD_ASSERT_INFOprecondition is unchanged