[AutoDiff] Speed up reverse-mode kernel launches on GPU backends by duburcqa · Pull Request #578 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-27T16:43:46Z

Speed up reverse-mode kernel launches on GPU backends via an adstack-sizer host-eval fast path

Adds a host-eval fast path on every GPU backend (CUDA / AMDGPU on the LLVM runtime; Metal / Vulkan / MoltenVK on the SPIR-V runtime) that bypasses the on-device sizer kernel + readback whenever every adstack alloca's SizeExpr is host-resolvable. Includes the LLVM-side async pinned-host HtoD optimisation, the SPIR-V predicate's FieldLoad correctness fix, a Codex/Claude-bot review pass on the LLVM path (active-stream routing, program_impl_ null-guard), a parametrized regression test, and several rounds of comment / wrap polish.

TL;DR

Every reverse-mode kernel launch with adstack allocas runs the SizeExpr capacity computation. Pre-PR that always meant a GPU dispatch:

LLVM (CUDA / AMDGPU): HtoD of the encoded SizeExpr bytecode -> single-thread runtime_eval_adstack_size_expr kernel launch -> synchronous DtoH of the per-thread stride. The DtoH is a stream-sync that stalls the host until the sizer kernel has finished. With ~100 substeps x forward + backward x several reverse-mode tasks per substep, the test launches the sizer thousands of times and pays one host stall per launch.
SPIR-V (Metal / Vulkan / MoltenVK): explicit flush() + device_->wait_idle() for PSB visibility -> sizer-bytecode upload -> per-task descriptor bind / dispatch -> submit_synced -> blocking metadata readback. Two host-side GPU stalls per kernel launch.

This PR detects the common case where no SizeExpr leaf needs device-resident memory and skips the entire dispatch on both backends.

Reported impact on the motivating workload (Genesis test_differentiable_rigid[gpu]): roughly 4x faster on CUDA after enabling this path.

Why

evaluate_adstack_size_expr already handles every leaf the on-device sizer was designed to interpret, except ExternalTensorRead whose data pointer is GPU-private. Detecting that one leaf at host time and skipping the dispatch is straightforward, and the unrolling_limit baseline is exactly the all-host-resolvable case (no adstacks at all => no SizeExpr at all => no dispatch), so this fast path is the closest the adstack mode can get to the unrolled baseline's launch overhead.

Mechanism

LLVM path (`LlvmRuntimeExecutor::publish_adstack_metadata`)

The function already had a CPU branch that host-evals each SerializedSizeExpr and writes the metadata arrays directly via copy_h2d. The branch was gated on !is_gpu_llvm. The new code:

Scans every alloca's size_expr.nodes for any node whose kind is SizeExpr::Kind::ExternalTensorRead. O(total node count across allocas).
Sets use_host_eval = !is_gpu_llvm || all_size_exprs_host_resolvable.
Routes to the existing host-eval branch when use_host_eval is true. CUDA / AMDGPU now reach this branch when no ExternalTensorRead is present.
Falls back to the existing runtime_eval_adstack_size_expr JIT call for kernels with ExternalTensorRead.

What dropped per launch when the fast path fires: one bytecode HtoD, one device sizer kernel launch, one DtoH stream-sync.

LLVM async pinned-host metadata HtoD

The fast path's three small per-launch HtoD copies (offsets, max_sizes, stride) are issued asynchronously from a pinned-host scratch via cuMemcpyHtoDAsync / hipMemcpyHtoDAsync. The host returns immediately after queueing the three copies on the active CUDA stream (CUDAContext::get_instance().get_stream() so user-set custom streams stream-order against the main-kernel dispatch correctly; AMDGPU keeps nullptr because AMDGPUContext::launch always uses the default stream). Pinned scratch is allocated lazily via cuMemAllocHost / hipHostMalloc and grown amortised-doubling; a per-launch CUDA / HIP event guards scratch reuse against in-flight DMAs. Eliminates the three serial host stalls per launch the synchronous cuMemcpyHtoD_v2 path had.

SPIR-V path (`GfxRuntime::publish_adstack_metadata_spirv`)

Two helpers in an anonymous namespace:

all_size_exprs_host_resolvable(adstack_task_indices, task_attribs): scans every adstack-bearing task's allocas for an ExternalTensorRead or FieldLoad leaf. FieldLoad is the correctness gate: the host evaluator's FieldLoad path goes through SNodeRwAccessorsBank::read_int, whose nested accessor-kernel launch from inside the publish corrupts the SPIR-V launcher's per-task metadata-upload state and produces wrong gradients on every kernel that hits it. The on-device sizer was specifically built to handle FieldLoad on-device via PSB loads precisely because of this; the host-eval predicate must therefore reject both ExternalTensorRead (host can't read GPU-private memory) and FieldLoad (nested launch is unsafe).
eval_per_task_metadata_on_host(adstack_task_indices, task_attribs, prog, host_ctx, per_task_ad_stack): replicates the sizer shader's per-task metadata layout ([stride_float, stride_int, (offset_i, max_size_i)*]) on the host. Float-heap accumulator advances by 2 * max_size (primal + adjoint), Int-heap by max_size, matching the running_off_f / running_off_i arithmetic in quadrants/codegen/spirv/adstack_sizer_shader.cpp.

The fast path runs after the adstack_task_indices early-out and before the sizer-pipeline build / bytecode upload / cmdlist record. When it fires, the function returns the host-computed per_task_ad_stack vector and never touches the sizer pipeline, the bytecode scratch buffer, the per-task metadata-buffer allocation, the flush(), the device_->wait_idle(), the sizer cmdlist record, the submit_synced, or the metadata readback - all of which are skipped entirely.

Per-backend coverage matrix

Backend	Pre-PR per-launch sizer cost	Post-PR (host-resolvable: no `ExternalTensorRead`, no `FieldLoad`)	Post-PR (with `ExternalTensorRead` or `FieldLoad` on SPIR-V)
CPU	Already host-eval (no GPU dispatch)	Unchanged	Unchanged
CUDA	`HtoD` bytecode + sizer kernel + `DtoH` stride sync	3 small async pinned `HtoD`s on the active stream, no kernel, no sync	Unchanged (sizer kernel + readback)
AMDGPU	`HtoD` bytecode + sizer kernel + `DtoH` stride sync	3 small async pinned `HtoD`s on the default stream, no kernel, no sync	Unchanged
Metal / MoltenVK	`flush` + `wait_idle` + bytecode upload + sizer cmdlist + readback	None - host-eval only (when SizeExpr is `ExternalTensorRead` and `FieldLoad` free)	Unchanged
Vulkan	Same as Metal (shared SPIR-V path)	Same as Metal	Unchanged

LLVM's host-eval FieldLoad is serviced by SNodeRwAccessorsBank exactly as before - no change for the LLVM CPU / CUDA / AMDGPU paths because the launcher reentrancy issue that gates SPIR-V doesn't apply there.

Tests

pytest tests/python/test_adstack.py -n 8: 770 passed, 10 xfailed locally on macOS Vulkan / arm64 / Metal.
test_adstack_sub_of_max_over_range_fusion_does_not_mix_fieldload_and_extread is parametrized on x_unused_val=[0.1, 100.0]. The amplified_unused_x variant pins any future cross-stack push / pop misroute as a 200.0 vs 0.0 mismatch (5+ orders of magnitude) instead of a 0.2 vs 0.0 "looks-like-tolerance" delta - the original 0.1 setup was added by this PR's predecessor and made the SPIR-V FieldLoad-during-publish corruption look like a numerical tolerance issue rather than the structural correctness bug it was.
test_differentiable_rigid[gpu] end-to-end: ~4x faster on CUDA per the reported repro.
tests/test_grad.py::test_differentiable_rigid[cpu] end-to-end: passes.

Codex / Claude bot review fixes

Severity	Issue	Fix
P1	Host-eval branch dereferenced `program_impl_->program` unconditionally; the on-device branch already supports `program_impl_ == nullptr` (C++-only tests) and falls back to `max_size_compile_time`	Hoisted the `program_impl_` null-check before the `evaluate_adstack_size_expr` call, mirroring the on-device branch
P2	Async metadata HtoDs were hard-coded to `default_stream = nullptr`; user calls to `CUDAContext::set_stream` would leave kernels reading stale metadata	Routed CUDA copies through `CUDAContext::get_instance().get_stream()` so they stream-order against `CUDAContext::launch`'s dispatch handle. AMDGPU keeps `nullptr` because `AMDGPUContext::launch` always passes `nullptr` to `hipLaunchKernel`

Side-effect audit

Concern	Where checked	Verdict
Host-eval correctness vs on-device sizer for ExternalTensorRead-free SizeExprs	`evaluate_adstack_size_expr` is the same function the on-device-bytecode encoder already calls during pre-substitution, so the leaves it can fold (`Const` / `FieldLoad` / `BoundVariable` / `ExternalTensorShape` / arithmetic / `MaxOverRange`) produce identical values	Same code path; no divergence between host fold and on-device sizer when the leaf set is identical
SPIR-V `FieldLoad` reentrancy	Predicate explicitly rejects `FieldLoad` so the SPIR-V publish never calls `read_int` from inside `publish_adstack_metadata_spirv`; the on-device sizer's PSB-load path handles `FieldLoad` correctly	Documented in the predicate's docstring and pinned by the parametrized regression test
Metadata layout match between SPIR-V host-eval and shader output	Per-stack offset uses `2 * max_size` for the Float heap and `max_size` for the Int heap, matching `adstack_sizer_shader.cpp`'s `running_off_f` / `running_off_i` accumulation; final stride values written into `metadata[0]` / `metadata[1]`	Bit-identical to shader output for the host-eval'able subset
LLVM byte-offset / 8-byte alignment match	Same `align_up_8(sizeof(int64_t) + entry_size_bytes * max_size)` formula already used by the existing CPU host-eval branch; no change	Identical to the previously CPU-only branch's behaviour
`ExternalTensorRead` falls back to on-device sizer	Single linear scan of `size_expr.nodes` returns false on the first `ExternalTensorRead` kind; the `else` arm of the dispatch retains the original LLVM `runtime_eval_adstack_size_expr` call and the SPIR-V cmdlist record path	Behaviour identical to pre-PR for kernels that hit this case
LLVM async metadata stream-ordering	Active-stream routing via `CUDAContext::get_instance().get_stream()`; pinned scratch reuse guarded by per-launch CUDA / HIP event so DMAs cannot race the host overwrite	Stream-ordered against the main-kernel dispatch on every CUDA stream the user can configure
Cache-load fallback for empty SizeExpr	When `size_expr.nodes.empty()` (offline-cache hit, symbolic tree not serialised) the host-eval path uses `max_size_compile_time` with the same `max(_, 1)` lower clamp the shader applies	Identical to shader behaviour
`ProgramImpl` / program back-reference	LLVM host-eval branch now guards on `program_impl_ != nullptr` before dereferencing `program_impl_->program` (P1 review fix); SPIR-V path's existing `QD_ASSERT_INFO` precondition is unchanged	C++-only-tests setups fall back to compile-time bounds rather than crashing

claude

I didn't find any bugs, but this PR touches the autodiff runtime on both LLVM and SPIR-V backends and introduces a host-side replica of the sizer shader's metadata layout, so it's worth a human look from someone who owns this code.

Extended reasoning...

Overview

This PR adds a host-eval fast path that bypasses the on-device adstack sizer dispatch when no SizeExpr in any adstack alloca contains an ExternalTensorRead leaf. Two files change: quadrants/runtime/gfx/adstack_sizer_launch.cpp (SPIR-V — Metal/Vulkan/MoltenVK) gets two new anonymous-namespace helpers and an early-return at the top of publish_adstack_metadata_spirv; quadrants/runtime/llvm/llvm_runtime_executor.cpp (CUDA/AMDGPU) generalizes the existing CPU-only host-eval branch to also fire on GPU when the same condition holds.

Security risks

None apparent. No auth, crypto, IPC, or network surface; this is a launch-path optimization in the autodiff runtime. The host-eval path writes the same metadata fields the GPU sizer would have written, via the same copy_h2d helpers already used elsewhere in the file.

Level of scrutiny

Higher than a typical perf tweak, lower than a public-API change. The motivating workload (Genesis test_differentiable_rigid[gpu]) and the reasoning are well-documented, but the SPIR-V path now contains a host replica of the sizer shader's metadata layout ([stride_float, stride_int, (offset_i, max_size_i)*], with 2 * max_size for Float / max_size for Int heap accumulation, and a max(_, 1) floor). Any future change to adstack_sizer_shader.cpp's running_off_f / running_off_i accumulator now has to be mirrored in eval_per_task_metadata_on_host, and divergence will silently mis-size GPU heaps for autodiff kernels — a class of bug that does not surface as a build failure or a clean test signal.

Other factors

The bug hunter found nothing, the description is thorough and includes a side-effect audit, and CPU + test_ad_basics tests pass. The author explicitly notes that the motivating CUDA/GPU benchmark wasn't run locally, which is the main thing I'd want a human owner to verify before merging. The change is also touching the same area as several recent commits (f1dc891, 5d7cf54, 03ab3b3), so a reviewer with that recent context will catch interactions I can't see from the diff alone.

github-actions · 2026-04-27T18:31:08Z

Coverage Report (`4f0688fca`)

Metric	Value
Diff coverage (changed lines only)	0%
Overall project coverage	61%

Total: 0 lines, 0 missing, 0% covered

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata - skips the on-device sizer dispatch and per-launch DtoH stride readback when every alloca's SizeExpr is host-resolvable, and pipelines the remaining metadata copies through a per-launch event-guarded pinned-host scratch on CUDA / AMDGPU

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

hughperkins · 2026-04-28T15:06:46Z

Cool. Since I remember the user-facing doc talks a lot about sizing, is there anything that needs updating in them?

duburcqa · 2026-04-28T15:09:40Z

Cool. Since I remember the user-facing doc talks a lot about sizing, is there anything that needs updating in them?

I don't think so, this is purely internal. It adds a host-eval fast path to bypass the on-device sizer kernel whenever it is possible. Code logics, just different execution path.

chatgpt-codex-connector

💡 Codex Review

quadrants/quadrants/runtime/llvm/llvm_runtime_executor.cpp

Line 742 in 4a9d231

    
           if (expr != nullptr && !expr->nodes.empty() && program_impl_->program != nullptr) {

Guard host-eval path when ProgramImpl is unavailable

The new GPU host-eval fast path dereferences program_impl_ unconditionally in program_impl_->program != nullptr. In the same function, the on-device branch explicitly documents/supports program_impl_ == nullptr for C++-only setups; with this change, kernels whose size exprs are host-resolvable can now hit this branch first and crash with a null dereference instead of taking the existing compile-time fallback behavior.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

duburcqa · 2026-04-28T15:43:33Z

P1 fixed in e269abe94: gated program_impl_->program behind a null-check on program_impl_, falling back to max_size_compile_time like the on-device branch.

P2 fixed in bf7c12607: routed the CUDA async copies through CUDAContext::get_instance().get_stream() so they share the kernel-launch stream.

github-actions · 2026-04-28T16:58:03Z

Coverage Report (`32b803f0c`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	100%

Diff coverage: 100% · Overall: 73% · 3 lines, 0 missing

Full annotated report

github-actions · 2026-04-28T18:46:05Z

Coverage Report (`037c4530f`)

File	Coverage	Missing
🟢 `python/quadrants/_kernels.py`	100%
🔴 `python/quadrants/lang/_fast_caching/args_hasher.py`	67%	9,37
🔴 `python/quadrants/lang/_func_base.py`	75%	21
🟢 `python/quadrants/lang/_kernel_impl_dataclass.py`	100%
🟢 `python/quadrants/lang/_ndarray.py`	86%	175,325
🔴 `python/quadrants/lang/_ndarray_pickle.py`	67%	8
🟢 `python/quadrants/lang/_template_mapper_hotpath.py`	100%
🔴 `python/quadrants/lang/any_array.py`	67%	20
🟢 `python/quadrants/lang/ast/ast_transformer.py`	88%	669
🟢 `python/quadrants/lang/ast/ast_transformers/call_transformer.py`	100%
🟢 `python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	100%
🟢 `python/quadrants/lang/field.py`	100%
🟢 `python/quadrants/lang/impl.py`	80%	18
🟢 `python/quadrants/lang/kernel.py`	100%
🟢 `python/quadrants/lang/kernel_arguments.py`	100%
🔴 `python/quadrants/lang/matrix.py`	70%	964,967,973,978,1161,1720,1831
🟢 `tests/python/quadrants/lang/test_dlpack.py`	100%
🟢 `tests/python/test_adstack.py`	100%
🟢 `tests/python/test_api.py`	100%
🟢 `tests/python/test_pickle.py`	100%

Diff coverage: 84% · Overall: 73% · 99 lines, 16 missing

Full annotated report

…en every alloca's SizeExpr is host-resolvable (no ExternalTensorRead leaf): take the host evaluator path on CUDA / AMDGPU / Metal / Vulkan and write the metadata buffer directly via copy_h2d / unmapped scratch fill, eliminating one kernel launch and one DtoH stride-readback per launch on LLVM GPU backends and one cmdlist submit_synced + wait_idle pair per launch on SPIR-V backends; the on-device sizer still runs unchanged for kernels whose SizeExprs reach into ndarray data (which lives in GPU-private memory and cannot be touched from the host)

…sizes, stride) asynchronously from a pinned-host scratch on CUDA / AMDGPU; the host returns immediately after queuing the three copies on the default stream and the subsequent main-kernel launch stream-orders after them, eliminating three serial host stalls per launch in the host-eval fast path. Pinned scratch is allocated lazily via cuMemAllocHost / hipHostMalloc and grown amortised-doubling; a per-launch CUDA / HIP event guards scratch reuse against in-flight DMAs

… lines to better fill the 120-col budget

…ting

…escribe the overhead in absolute terms

… - the host fold goes through SNodeRwAccessorsBank::read_int whose nested accessor-kernel launch corrupts the publish-time launcher state and produces wrong gradients on kernels that mix FieldLoad with the on-device sizer fallback

…plified-unused-x value so any cross-stack push / pop misroute fails as a 200.0 vs 0.0 mismatch instead of a 0.2 vs 0.0 'looks-like-tolerance' delta

…ta HtoDs through CUDAContext::get_instance().get_stream() instead of a hard-coded nullptr so they stream-order against the main-kernel dispatch when the user has set a non-default stream via CUDAContext::set_stream; AMDGPU keeps nullptr because AMDGPUContext::launch always uses the default stream

…rogram against a null program_impl_ in the C++-only-tests setup, mirroring the on-device branch's existing nullptr fallback to max_size_compile_time

…pl_ null-guard comment to fit the 120-col budget

duburcqa · 2026-04-28T20:45:06Z

GS_ENABLE_NDARRAY=1 QD_OFFLINE_CACHE=0 pytest -v -ra --backend gpu --dev --forked ./tests

================== 649 passed, 3 skipped, 2 xfailed in 1108.25s (0:18:28) ===============

duburcqa · 2026-04-28T20:49:37Z

Comparing Quadrants main vs this PR on Genesis main:

env	batch_size	backend	gjk_collision	constraint_solver	runtime_fps_main	runtime_fps_578	runtime_fps_delta_pct
anymal_random	30000	cuda	-	-	9236721	9279341	+0.46
anymal_uniform	30000	cuda	-	-	12290801	12341098	+0.41
anymal_uniform_kinematic	0	cpu	-	-	2054	2022	-1.56
anymal_uniform_kinematic	30000	cuda	-	-	10466141	10482269	+0.15
anymal_zero	0	cpu	-	-	7257	7148	-1.50
anymal_zero	30000	cuda	-	-	18701483	18688757	-0.07
box_pyramid_3	4096	cuda	-	-	969382	972933	+0.37
box_pyramid_4	4096	cuda	-	-	388357	389805	+0.37
box_pyramid_5	4096	cuda	-	-	141016	139732	-0.91
box_pyramid_6	4096	cuda	False	-	59335	58789	-0.92
box_pyramid_6	4096	cuda	True	-	60333	59872	-0.76
dex_hand	4096	cuda	-	-	17048	17162	+0.67
duck_in_box_easy	30000	cuda	False	-	26662843	26715336	+0.20
duck_in_box_easy	30000	cuda	True	-	9549541	9736801	+1.96
duck_in_box_hard	0	cpu	-	-	5185	5159	-0.50
duck_in_box_hard	30000	cuda	False	-	10239009	10119850	-1.16
duck_in_box_hard	30000	cuda	True	-	3405035	3454206	+1.44
franka	30000	cuda	-	-	22058180	21724297	-1.51
franka_accessors	0	cpu	-	-	1227	1217	-0.81
franka_accessors	30000	cuda	-	-	15590143	15553821	-0.23
franka_free	30000	cuda	-	-	31981995	32202953	+0.69
franka_random	0	cpu	-	-	6457	6324	-2.06
franka_random	30000	cuda	-	CG	16737391	16698827	-0.23
franka_random	30000	cuda	-	Newton	16343418	16525809	+1.12
franka_random	30000	cuda	False	-	16534089	16369501	-1.00
franka_random	30000	cuda	True	-	11429393	11448748	+0.17
g1_fall	4096	cuda	-	Newton	921076	923492	+0.26
go2	4096	cuda	False	CG	3678736	3676077	-0.07
go2	4096	cuda	False	Newton	4438961	4451375	+0.28
go2	4096	cuda	True	-	3273783	3272905	-0.03
shadow_hand_cubes	0	cpu	-	-	40	41	+2.50
shadow_hand_cubes_sparse	0	cpu	-	-	66	65	-1.52

speed_comparison_main_vs_578.csv

hughperkins · 2026-04-28T20:52:45Z

checklist:

nothing to update in user facing docs
Genesis unit tests passing
Genesis benchmarks appear neutral

=> ok to merge

github-actions · 2026-04-28T21:05:56Z

Coverage Report (`8d0275803`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	100%

Diff coverage: 100% · Overall: 73% · 3 lines, 0 missing

Full annotated report

* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>

duburcqa marked this pull request as draft April 27, 2026 16:45

duburcqa changed the title ~~[AutoDiff] Skip GPU adstack-sizer dispatch when SizeExprs are host-resolvable: drops the per-launch DtoH stall on CUDA / AMDGPU and the wait_idle pair on Metal / Vulkan~~ [AutoDiff] Speed up reverse-mode kernel launches on GPU backends Apr 27, 2026

claude Bot reviewed Apr 27, 2026

View reviewed changes

duburcqa added a commit that referenced this pull request Apr 27, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

cb90671

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 27, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

acdbd22

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 27, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

36b6e9e

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 27, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

65bd504

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 27, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

aebbaa6

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 27, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

6e7c96a

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 28, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

2747c3e

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 28, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

6d46ae9

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 28, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

34c6821

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 28, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

e0e9ef4

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa added a commit that referenced this pull request Apr 28, 2026

[Integration] PR #578: adstack-sizer host-eval fast path on CUDA / AM…

94fa236

…DGPU / Metal / Vulkan plus async pinned HtoD for LLVM adstack metadata

duburcqa force-pushed the duburcqa/adstack_sizer_host_eval branch from b1414d0 to 4a9d231 Compare April 28, 2026 15:02

duburcqa marked this pull request as ready for review April 28, 2026 15:05

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread quadrants/runtime/llvm/llvm_runtime_executor.cpp Outdated

duburcqa force-pushed the duburcqa/adstack_sizer_host_eval branch from 32b803f to 037c453 Compare April 28, 2026 17:33

duburcqa added 4 commits April 28, 2026 21:47

[Lang] adstack-sizer host-eval comments: extend short-wrapped comment…

188507d

… lines to better fill the 120-col budget

[Lang] llvm_runtime_executor.h: tighten pinned-scratch comment format…

6490879

…ting

duburcqa added 6 commits April 28, 2026 21:47

[Lang] adstack-sizer host-eval comments: drop 'regression' framing, d…

f16a511

…escribe the overhead in absolute terms

[Test] Parametrize the FieldLoad+ExternalTensorRead fusion test on am…

0390175

…plified-unused-x value so any cross-stack push / pop misroute fails as a 200.0 vs 0.0 mismatch instead of a 0.2 vs 0.0 'looks-like-tolerance' delta

[Lang] adstack-sizer host-eval LLVM fast path: guard program_impl_->p…

cb80478

…rogram against a null program_impl_ in the C++-only-tests setup, mirroring the on-device branch's existing nullptr fallback to max_size_compile_time

[Lang] adstack-sizer host-eval LLVM fast path: tighten the program_im…

8d02758

…pl_ null-guard comment to fit the 120-col budget

duburcqa force-pushed the duburcqa/adstack_sizer_host_eval branch from 037c453 to 8d02758 Compare April 28, 2026 19:49

duburcqa merged commit 90e605c into main Apr 28, 2026
75 of 76 checks passed

duburcqa deleted the duburcqa/adstack_sizer_host_eval branch April 28, 2026 21:13

Uh oh!

Conversation

duburcqa commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speed up reverse-mode kernel launches on GPU backends via an adstack-sizer host-eval fast path

TL;DR

Why

Mechanism

LLVM path (LlvmRuntimeExecutor::publish_adstack_metadata)

LLVM async pinned-host metadata HtoD

SPIR-V path (GfxRuntime::publish_adstack_metadata_spirv)

Per-backend coverage matrix

Tests

Codex / Claude bot review fixes

Side-effect audit

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

github-actions Bot commented Apr 27, 2026

Coverage Report (4f0688fca)

Uh oh!

hughperkins commented Apr 28, 2026

Uh oh!

duburcqa commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

duburcqa commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Coverage Report (32b803f0c)

Uh oh!

github-actions Bot commented Apr 28, 2026

Coverage Report (037c4530f)

Uh oh!

duburcqa commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duburcqa commented Apr 28, 2026

Uh oh!

hughperkins commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Coverage Report (8d0275803)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented Apr 27, 2026 •

edited

Loading

LLVM path (`LlvmRuntimeExecutor::publish_adstack_metadata`)

SPIR-V path (`GfxRuntime::publish_adstack_metadata_spirv`)

Coverage Report (`4f0688fca`)

duburcqa commented Apr 28, 2026 •

edited

Loading

Coverage Report (`32b803f0c`)

Coverage Report (`037c4530f`)

duburcqa commented Apr 28, 2026 •

edited

Loading

Coverage Report (`8d0275803`)