Skip to content

[GPU] Cross-GPU for grid ops#670

Merged
hughperkins merged 5 commits into
mainfrom
hp/cross-gpu-grid
May 11, 2026
Merged

[GPU] Cross-GPU for grid ops#670
hughperkins merged 5 commits into
mainfrom
hp/cross-gpu-grid

Conversation

@hughperkins

@hughperkins hughperkins commented May 9, 2026

Copy link
Copy Markdown
Collaborator

Independent grid-side rollout (off main, no longer stacked on PR #664). Adds full cross-GPU support for qd.simt.grid.mem_fence() and renames qd.simt.grid.memfence() -> qd.simt.grid.mem_fence() with a deprecated alias for back-compat.

What this adds

User-facing surface:

  • New canonical name qd.simt.grid.mem_fence() works on CUDA, AMDGPU, Vulkan, and Metal (with an atomic-only ordering caveat on Metal / Vulkan-on-macOS, documented in grid.md).
  • The old qd.simt.grid.memfence() becomes a deprecated alias that emits DeprecationWarning on first use and forwards to mem_fence().

Per-backend lowering:

  • CUDA: nvvm_membar_gl (= __threadfence()).
  • AMDGPU: LLVM IR fence acquire_release syncscope("agent"), installed via a patch_fence body-replacement helper in llvm_context.cpp (defined inline in this PR). We can't compile this from runtime.cpp because that runtime is built with host-targeted clang and only retargeted to amdgcn here.
  • Vulkan / Metal: SPIR-V OpMemoryBarrier(ScopeDevice, UniformMemory | WorkgroupMemory | AcquireRelease). MoltenVK translates this to MSL atomic_thread_fence(metal::memory_scope_device).

C++ ops:

  • internal_ops.inc.h + type_system.cpp: register grid_mem_fence (CUDA / AMDGPU dispatch) and gridMemoryBarrier (SPIR-V dispatch). Drop the now-aliased grid_memfence registration.
  • runtime.cpp + locked_task.h: rename grid_memfence -> grid_mem_fence everywhere; rename system_memfence -> system_mem_fence along with it for naming consistency.
  • spirv_codegen.cpp: add the gridMemoryBarrier case.

Tests:

  • test_grid_mem_fence -- broadened to arch=qd.gpu, with exclude=qd.metal and a runtime pytest.skip for vulkan on macOS. Replaces the old CUDA-only test_grid_memfence.
  • test_grid_memfence_deprecated_alias -- verifies the deprecation warning fires.

Docs:

  • grid.md -- rewritten to reflect cross-GPU portability: new support table, per-backend lowering section, deprecated-alias section, updated code examples + spin-wait gotcha.
  • atomics.md -- the device-scope-fence portability claim is refreshed to "portable across all four backends" with the Metal caveat link; grid.memfence() references switch to grid.mem_fence(). The "Backend caveat for the fence-pair pattern" paragraph is reduced to the still-extant block.mem_sync() AMDGPU gap (block-side AMDGPU lowering lands in PR [GPU] Make block operations portable cross-gpu #664).
  • parallelization.md -- grid.memfence() -> grid.mem_fence() rename in the cross-block-coordination paragraph. block.md itself is intentionally not touched here (the dedup of ## Grid-scope fence lives in PR [GPU] Make block operations portable cross-gpu #664).

Test plan

  • CUDA: pytest tests/python/test_simt.py -k grid on rtx-high.
  • AMDGPU: same, on amdcloud.
  • Vulkan (Linux): same, on rtx-high with Vulkan build.
  • Metal / Vulkan-on-macOS: rely on Mac CI -- test_grid_mem_fence is excluded on Metal and runtime-skipped on Vulkan-on-macOS, so the deprecated-alias test is what runs there.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 475d6cd876

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.memfence()` is safe.
The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.mem_fence()` is safe.

> **Metal / Vulkan-on-macOS portability:** the example as written above relies on the producer's plain (non-atomic) store `flags[bid] = STATE_AGGREGATE` becoming visible to other workgroups once `grid.mem_fence()` retires. CUDA, AMDGPU, and native Vulkan honor this strictly; Metal (and therefore Vulkan-on-macOS) does **not** — `atomic_thread_fence(memory_scope_device)` only orders atomic accesses across the device. To make this idiom Metal-portable, publish through an atomic store (`qd.atomic_or(flags[bid], STATE_AGGREGATE)`) or split the producer and consumer phases into separate kernel launches.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct the Metal-portable publication advice

For the Metal / Vulkan-on-macOS path described here, making only flags[bid] atomic does not make this example portable: the payload the reader consumes is still the plain store to partials[bid], and this same note says Metal's device fence only orders atomic accesses. Users following the suggested qd.atomic_or(flags[bid], STATE_AGGREGATE) replacement can still observe the atomic flag and then read a stale partials[bid]; the guidance should require publishing the payload itself through atomics or splitting the phases into separate kernel launches.

Useful? React with 👍 / 👎.

hughperkins added a commit that referenced this pull request May 9, 2026
This PR was meant to be block-primitive cross-GPU work. It accidentally
also took on `qd.simt.grid.memfence()` -> `qd.simt.grid.mem_fence()`
rollout because the original block.md (on main) bundled the two together
in a single page. The grid work needs its own review and its own
backend coverage notes (Metal / Vulkan-on-macOS atomic-only ordering
caveat, AMDGPU agent-scope fence), so it is now stacked separately on
hp/cross-gpu-grid.

What this commit reverts on this branch:

- Python: `grid.py` back to main (no `mem_fence()`, no deprecated
  `memfence()` alias).
- C++ ops: drop `grid_mem_fence` / `gridMemoryBarrier` registrations
  in `internal_ops.inc.h` + `type_system.cpp`. Restore main's
  `grid_memfence` spelling everywhere (`runtime.cpp`, `locked_task.h`,
  `llvm_context.cpp`).
- LLVM: drop the AMDGPU `grid_mem_fence` patch_fence("agent") hook;
  trim its rationale comment to mention only the workgroup-scope fence
  (the block-side one stays).
- SPIR-V: drop the `gridMemoryBarrier` -> `OpMemoryBarrier(ScopeDevice,
  ...)` case in `spirv_codegen.cpp`. (The `globalInvocationId` case
  for `block.global_thread_idx()` and the Metal atomic scope/semantics
  fix for `sync_*_nonzero` emulation stay -- those are block-side.)
- Tests: drop `test_grid_mem_fence` and the deprecated-alias test
  `test_grid_memfence_deprecated_alias`. Restore main's
  `test_grid_memfence` (CUDA-only smoke test) verbatim.
- Docs:
  * `block.md` -- restore the `## Grid-scope fence: qd.simt.grid.memfence()`
    section from main and the original lead/barrier-vs-fence references
    to it. Block-side improvements (the support table, sync_*_nonzero
    portability, mem_sync -> mem_fence rename, divergent control-flow
    fix on CUDA, thread_idx, global_thread_idx) all stay.
  * `atomics.md` -- restore main's grid.memfence portability claims;
    only the block-side rename `block.mem_sync() -> block.mem_fence()`
    is kept.
  * `parallelization.md` -- only the block-side rename
    `block.mem_sync -> block.mem_fence` is kept; the device-scope
    fence reference remains `qd.simt.grid.memfence()` and links into
    the in-page `block.md#grid-scope-fence` anchor that this commit
    restored.

The grid-side counterpart of this work lives on `hp/cross-gpu-grid`
(PR #670), stacked on top of this branch.
@hughperkins hughperkins force-pushed the hp/cross-gpu-grid branch from 475d6cd to 7ef37eb Compare May 9, 2026 15:52
@hughperkins hughperkins changed the title [Docs] Update grid.md for cross-gpu grid.mem_fence() [GPU] Make grid.mem_fence() portable cross-gpu May 9, 2026
hughperkins added a commit that referenced this pull request May 9, 2026
block.md was duplicating the grid-scope fence info that already lives
on its own page (`grid.md`). This commit removes the dedicated
`## Grid-scope fence` section, drops the `grid.memfence()` row from
the support table (it is not a block primitive), and replaces the
in-page references with neutral pointers to `grid.md` -- without
asserting anything about cross-GPU portability of the device-scope
fence, since that work is the subject of a separate PR
(`hp/cross-gpu-grid`, #670).

Also fixes the broken anchor `block.md#grid-scope-fence-qdsimtgridmemfence`
in parallelization.md by re-pointing it at `grid.md`.

This keeps PR #664 self-coherent: every doc reference to the
device-scope fence on this branch goes through `grid.md`, and
`grid.md` itself is unchanged from main on this branch (it gets the
cross-GPU rewrite on `hp/cross-gpu-grid`).
@hughperkins hughperkins changed the title [GPU] Make grid.mem_fence() portable cross-gpu [GPU] Cross-GPU for grid ops May 9, 2026
@hughperkins hughperkins force-pushed the hp/cross-gpu-grid branch 2 times, most recently from fcf4e4b to 88fdd9e Compare May 9, 2026 16:01
@hughperkins hughperkins changed the base branch from hp/cross-gpu-block to main May 9, 2026 16:02
@hughperkins hughperkins force-pushed the hp/cross-gpu-grid branch 2 times, most recently from c132384 to 3d8bec8 Compare May 9, 2026 16:18
Independent grid-side rollout (off main, no longer stacked on the
block-primitives PR #664). Adds full cross-GPU support for the
device-scope memory fence and renames `qd.simt.grid.memfence()` ->
`qd.simt.grid.mem_fence()` with a deprecated alias for back-compat.

User-facing surface:

- New canonical name `qd.simt.grid.mem_fence()` works on **CUDA,
  AMDGPU, Vulkan, and Metal** (with an atomic-only ordering caveat
  on Metal / Vulkan-on-macOS, documented in `grid.md`).
- The old `qd.simt.grid.memfence()` becomes a deprecated alias that
  emits `DeprecationWarning` on first use and forwards to
  `mem_fence()`.

Per-backend lowering:

- **CUDA**: `nvvm_membar_gl` (= `__threadfence()`).
- **AMDGPU**: LLVM IR `fence acquire_release syncscope("agent")`,
  installed via a `patch_fence` body-replacement helper in
  `llvm_context.cpp`. We can't compile this from `runtime.cpp`
  because that runtime is built with host-targeted clang and only
  retargeted to amdgcn here.
- **Vulkan / Metal**: SPIR-V `OpMemoryBarrier(ScopeDevice,
  UniformMemory | WorkgroupMemory | AcquireRelease)`. MoltenVK
  translates this to MSL
  `atomic_thread_fence(metal::memory_scope_device)`.

C++ ops:

- `internal_ops.inc.h` + `type_system.cpp`: register
  `grid_mem_fence` (CUDA / AMDGPU dispatch) and `gridMemoryBarrier`
  (SPIR-V dispatch). Drop the now-aliased `grid_memfence` registration.
- `runtime.cpp` + `locked_task.h`: rename `grid_memfence` ->
  `grid_mem_fence` everywhere; rename `system_memfence` ->
  `system_mem_fence` along with it for naming consistency.
- `spirv_codegen.cpp`: add the `gridMemoryBarrier` case.

Tests:

- `test_grid_mem_fence` -- broadened to `arch=qd.gpu`, with
  `exclude=qd.metal` and a runtime `pytest.skip` for `vulkan` on
  macOS (MoltenVK -> Metal inherits the atomic-only caveat). Replaces
  the old CUDA-only `test_grid_memfence`.
- `test_grid_memfence_deprecated_alias` -- verifies the
  deprecation warning fires.

Docs:

- `grid.md` -- rewritten to reflect cross-GPU portability: new
  support table, per-backend lowering section, deprecated-alias
  section, updated code examples + spin-wait gotcha.
- `atomics.md` -- the device-scope-fence portability claim is
  refreshed to "portable across all four backends" with the Metal
  caveat link; `grid.memfence()` references switch to
  `grid.mem_fence()`. The "Backend caveat for the fence-pair
  pattern" paragraph is reduced to the still-extant
  `block.mem_sync()` AMDGPU gap (block-side AMDGPU lowering lands
  in PR #664).
- `parallelization.md` -- `grid.memfence()` -> `grid.mem_fence()`
  rename in the cross-block-coordination paragraph. `block.md`
  itself is intentionally not touched here (the dedup of
  `## Grid-scope fence` lives in PR #664).
@hughperkins hughperkins force-pushed the hp/cross-gpu-grid branch from 3d8bec8 to 662b245 Compare May 9, 2026 16:18

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file seems ok

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file seems ok

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown

The CI line-wrap check flagged this comment block as being wrapped
conservatively at ~95-98 chars instead of the project's 120c target.
Reflow to the full 120c budget (now 6 lines @ 104-118c, was 7 lines
@ 89-99c).
@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown

Conflicts came from PR #664 (block-primitives cross-gpu) landing on main
with overlapping changes to the same memfence machinery this branch
touches for grid scope. Both branches renamed *_memfence -> *_mem_fence
on disjoint scopes:
  - main:   block_memfence  -> block_mem_fence
  - branch: grid_memfence   -> grid_mem_fence
            system_memfence -> system_mem_fence

Resolution = union of both renames. Canonical names after merge are
`block_mem_fence` / `grid_mem_fence` / `system_mem_fence` (runtime.cpp
already auto-merged onto these). Per-file resolution:

  - docs/atomics.md, parallelization.md: keep both renames in user-facing prose.
  - docs/algorithms.md: update stale `grid.memfence()` -> `grid.mem_fence()`.
  - inc/internal_ops.inc.h, ir/type_system.cpp: register both ops.
  - runtime/llvm/llvm_context.cpp: CUDA patch_intrinsic for all three;
    AMDGPU patch_fence for both `block_mem_fence` ("workgroup") and
    `grid_mem_fence` ("agent"); merged comment block describes both.
  - tests/python/test_simt.py: keep both new test sets (grid alias test
    from branch + block.mem_fence tests from main).
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Comment thread tests/python/test_simt.py Outdated
# (Linux / Windows) honors it via `OpMemoryBarrier(ScopeDevice, ...)`. Metal does NOT, even on Apple Silicon: MSL
# `atomic_thread_fence(memory_scope_device)` -- which is what MoltenVK / SPIRV-Cross translate
# `OpMemoryBarrier(ScopeDevice, ...)` to -- only orders *atomic* memory accesses across the device, not plain stores;
# this is a documented Metal limitation called out in the `grid.mem_fence()` Metal caveat in `block.md`. So we exclude

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should say grid.md instead of block.md -- the Metal caveat is documented in grid.md (which this PR rewrites), not block.md.

@alanray-tech alanray-tech left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the nit above is addressed (block.md -> grid.md in the test comment).

The comment block on test_grid_mem_fence pointed to the `grid.mem_fence()`
Metal caveat as being in `block.md`, but it lives in `grid.md`. Likely a
leftover from when this branch was stacked on the block-primitives PR
before being unstacked.
@hughperkins hughperkins merged commit 284a19b into main May 11, 2026
25 checks passed
@hughperkins hughperkins deleted the hp/cross-gpu-grid branch May 11, 2026 17:04
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

npoulad1 added a commit to ROCm/quadrants that referenced this pull request Jun 8, 2026
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428)

* [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

* [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430)

* Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420)

* [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435)

* [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438)

* Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

* Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442)

* [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439)

* [Misc] Add named top-level loops (Genesis-Embodied-AI#440)

* [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

* [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

* [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456)

* [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461)

* [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

* [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

* [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

* [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465)

* [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

* [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471)

* [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

* [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

* [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475)

* [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

* Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485)

* [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484)

* [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477)

* [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)

* Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488)

* Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489)

* [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487)

* [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492)

* [CI] Serialize api doc workflow (Genesis-Embodied-AI#494)

* [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506)

* [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509)

* [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504)

* [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505)

* [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507)

* [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508)

* [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482)

* [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483)

* [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512)

* [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510)

* [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511)

* [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422)

* [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500)

* [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501)

* [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502)

* [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503)

* [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496)

* [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491)

* [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534)

* [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535)

* [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495)

* [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490)

* [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536)

* [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541)

* [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419)

* [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411)

* [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552)

* [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441)

* [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412)

* [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555)

* [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554)

* [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537)

* [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493)

* [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539)

* [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513)

* [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551)

* [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557)

* [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562)

* [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559)

* [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558)

* [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563)

* [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426)

Authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543)

* Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564)

* [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470)

* [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567)

* Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573)

* [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574)

* [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571)

* [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575)

* [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576)

* [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577)

* [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570)

* [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566)

* [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579)

* [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584)

* [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580)

* [Type] Tensor 24 (Genesis-Embodied-AI#561)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587)

* [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578)

* [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588)

* [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590)

* [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592)

* [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591)

* [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596)

* [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450)

* Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585)

Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598)

Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>

* [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599)

* [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606)

* [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610)

* [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611)

* [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Doc] Update README (Genesis-Embodied-AI#617)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

* [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Add PR Line change report (Genesis-Embodied-AI#624)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

* [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630)

* [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

Co-authored-by: Johnny Nunez and Hugh Perkins

* [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632)

* [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

* [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633)

* [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634)

* [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638)

* [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639)

* [Perf] Streams 1-4 (Genesis-Embodied-AI#410)

* [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643)

* [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650)

* [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640)

* [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641)

* [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635)

* [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658)

* [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655)

* [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653)

* [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659)

* [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654)

* [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660)

* [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669)

* [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668)

* [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667)

* [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671)

* [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675)

* [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677)

* [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Cross gpu atomics (Genesis-Embodied-AI#666)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664)

* [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685)

* [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670)

* [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662)

* [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687)

* [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672)

* [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679)

* [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665)

* [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691)

* [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694)

* [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690)

* Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698)

* [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692)

* [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696)

* [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683)

* [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676)

* [GPU] New QIPC ops for block (Genesis-Embodied-AI#684)

* [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693)

* [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701)

* [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700)

* [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702)

* [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708)

* [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707)

* Fix duplicate HIP graph driver-function declarations after v1.0.0 merge

The amd-integration fork had cherry-picked the HIP graph driver functions
(graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate /
graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set.
The per-file 3-way merge appended both copies into
amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the
AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures
are identical to the fork's existing declarations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge

- kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel
  rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream
  PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design,
  leaving references to undefined `ephemeral_context_ptr`. Restore the fork's
  coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced
  launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel
  groups adapted onto the AMD launch path.
- llvm_context.h: both the fork and upstream added `num_instructions`; the merge
  kept upstream's private placement, but the AMDGPU codegen force-inline heuristic
  calls it statically from outside the class. Move it back to the public section.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore async result D2H and hoist kernarg vectors in AMDGPU launcher

The v1.0.0 merge resolution regressed two amd-integration baseline
optimizations in launch_llvm_kernel / launch_offloaded_tasks:

  - The per-launch result-buffer copy was a blocking memcpy_device_to_host,
    forcing a host stall on every value-returning launch and serializing the
    GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it
    needs the value); external-array transfers still stream_synchronize once
    before reading back.

  - launch_task constructed the kernarg std::vectors from initializer lists
    ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free
    per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget

Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup
ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through
`amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside
`llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco`
(i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted
these constructs, which is why it was unaffected.

1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend.
   Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target
   (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the
   native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK`
   is now the default and still honored. This is the actual crash fix.

2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so
   `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries
   x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies
   but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm
   during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the
   wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources

CI pre-commit's clang-format hook reformatted these files (long
declarations/lambda signatures collapsed onto single lines per the repo's
clang-format config). Apply the same formatting so the hook passes.

No functional changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input)

clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged
`builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to
the `llvm::Value*` LHS parameter as a null pointer, not an integer zero.
Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper
zero constant -- identical intended semantics, and clang-tidy clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>
Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants