[GPU] Cross-GPU for grid ops#670
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 475d6cd876
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.memfence()` is safe. | ||
| The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.mem_fence()` is safe. | ||
|
|
||
| > **Metal / Vulkan-on-macOS portability:** the example as written above relies on the producer's plain (non-atomic) store `flags[bid] = STATE_AGGREGATE` becoming visible to other workgroups once `grid.mem_fence()` retires. CUDA, AMDGPU, and native Vulkan honor this strictly; Metal (and therefore Vulkan-on-macOS) does **not** — `atomic_thread_fence(memory_scope_device)` only orders atomic accesses across the device. To make this idiom Metal-portable, publish through an atomic store (`qd.atomic_or(flags[bid], STATE_AGGREGATE)`) or split the producer and consumer phases into separate kernel launches. |
There was a problem hiding this comment.
Correct the Metal-portable publication advice
For the Metal / Vulkan-on-macOS path described here, making only flags[bid] atomic does not make this example portable: the payload the reader consumes is still the plain store to partials[bid], and this same note says Metal's device fence only orders atomic accesses. Users following the suggested qd.atomic_or(flags[bid], STATE_AGGREGATE) replacement can still observe the atomic flag and then read a stale partials[bid]; the guidance should require publishing the payload itself through atomics or splitting the phases into separate kernel launches.
Useful? React with 👍 / 👎.
This PR was meant to be block-primitive cross-GPU work. It accidentally
also took on `qd.simt.grid.memfence()` -> `qd.simt.grid.mem_fence()`
rollout because the original block.md (on main) bundled the two together
in a single page. The grid work needs its own review and its own
backend coverage notes (Metal / Vulkan-on-macOS atomic-only ordering
caveat, AMDGPU agent-scope fence), so it is now stacked separately on
hp/cross-gpu-grid.
What this commit reverts on this branch:
- Python: `grid.py` back to main (no `mem_fence()`, no deprecated
`memfence()` alias).
- C++ ops: drop `grid_mem_fence` / `gridMemoryBarrier` registrations
in `internal_ops.inc.h` + `type_system.cpp`. Restore main's
`grid_memfence` spelling everywhere (`runtime.cpp`, `locked_task.h`,
`llvm_context.cpp`).
- LLVM: drop the AMDGPU `grid_mem_fence` patch_fence("agent") hook;
trim its rationale comment to mention only the workgroup-scope fence
(the block-side one stays).
- SPIR-V: drop the `gridMemoryBarrier` -> `OpMemoryBarrier(ScopeDevice,
...)` case in `spirv_codegen.cpp`. (The `globalInvocationId` case
for `block.global_thread_idx()` and the Metal atomic scope/semantics
fix for `sync_*_nonzero` emulation stay -- those are block-side.)
- Tests: drop `test_grid_mem_fence` and the deprecated-alias test
`test_grid_memfence_deprecated_alias`. Restore main's
`test_grid_memfence` (CUDA-only smoke test) verbatim.
- Docs:
* `block.md` -- restore the `## Grid-scope fence: qd.simt.grid.memfence()`
section from main and the original lead/barrier-vs-fence references
to it. Block-side improvements (the support table, sync_*_nonzero
portability, mem_sync -> mem_fence rename, divergent control-flow
fix on CUDA, thread_idx, global_thread_idx) all stay.
* `atomics.md` -- restore main's grid.memfence portability claims;
only the block-side rename `block.mem_sync() -> block.mem_fence()`
is kept.
* `parallelization.md` -- only the block-side rename
`block.mem_sync -> block.mem_fence` is kept; the device-scope
fence reference remains `qd.simt.grid.memfence()` and links into
the in-page `block.md#grid-scope-fence` anchor that this commit
restored.
The grid-side counterpart of this work lives on `hp/cross-gpu-grid`
(PR #670), stacked on top of this branch.
475d6cd to
7ef37eb
Compare
block.md was duplicating the grid-scope fence info that already lives on its own page (`grid.md`). This commit removes the dedicated `## Grid-scope fence` section, drops the `grid.memfence()` row from the support table (it is not a block primitive), and replaces the in-page references with neutral pointers to `grid.md` -- without asserting anything about cross-GPU portability of the device-scope fence, since that work is the subject of a separate PR (`hp/cross-gpu-grid`, #670). Also fixes the broken anchor `block.md#grid-scope-fence-qdsimtgridmemfence` in parallelization.md by re-pointing it at `grid.md`. This keeps PR #664 self-coherent: every doc reference to the device-scope fence on this branch goes through `grid.md`, and `grid.md` itself is unchanged from main on this branch (it gets the cross-GPU rewrite on `hp/cross-gpu-grid`).
fcf4e4b to
88fdd9e
Compare
c132384 to
3d8bec8
Compare
Independent grid-side rollout (off main, no longer stacked on the block-primitives PR #664). Adds full cross-GPU support for the device-scope memory fence and renames `qd.simt.grid.memfence()` -> `qd.simt.grid.mem_fence()` with a deprecated alias for back-compat. User-facing surface: - New canonical name `qd.simt.grid.mem_fence()` works on **CUDA, AMDGPU, Vulkan, and Metal** (with an atomic-only ordering caveat on Metal / Vulkan-on-macOS, documented in `grid.md`). - The old `qd.simt.grid.memfence()` becomes a deprecated alias that emits `DeprecationWarning` on first use and forwards to `mem_fence()`. Per-backend lowering: - **CUDA**: `nvvm_membar_gl` (= `__threadfence()`). - **AMDGPU**: LLVM IR `fence acquire_release syncscope("agent")`, installed via a `patch_fence` body-replacement helper in `llvm_context.cpp`. We can't compile this from `runtime.cpp` because that runtime is built with host-targeted clang and only retargeted to amdgcn here. - **Vulkan / Metal**: SPIR-V `OpMemoryBarrier(ScopeDevice, UniformMemory | WorkgroupMemory | AcquireRelease)`. MoltenVK translates this to MSL `atomic_thread_fence(metal::memory_scope_device)`. C++ ops: - `internal_ops.inc.h` + `type_system.cpp`: register `grid_mem_fence` (CUDA / AMDGPU dispatch) and `gridMemoryBarrier` (SPIR-V dispatch). Drop the now-aliased `grid_memfence` registration. - `runtime.cpp` + `locked_task.h`: rename `grid_memfence` -> `grid_mem_fence` everywhere; rename `system_memfence` -> `system_mem_fence` along with it for naming consistency. - `spirv_codegen.cpp`: add the `gridMemoryBarrier` case. Tests: - `test_grid_mem_fence` -- broadened to `arch=qd.gpu`, with `exclude=qd.metal` and a runtime `pytest.skip` for `vulkan` on macOS (MoltenVK -> Metal inherits the atomic-only caveat). Replaces the old CUDA-only `test_grid_memfence`. - `test_grid_memfence_deprecated_alias` -- verifies the deprecation warning fires. Docs: - `grid.md` -- rewritten to reflect cross-GPU portability: new support table, per-backend lowering section, deprecated-alias section, updated code examples + spin-wait gotcha. - `atomics.md` -- the device-scope-fence portability claim is refreshed to "portable across all four backends" with the Metal caveat link; `grid.memfence()` references switch to `grid.mem_fence()`. The "Backend caveat for the fence-pair pattern" paragraph is reduced to the still-extant `block.mem_sync()` AMDGPU gap (block-side AMDGPU lowering lands in PR #664). - `parallelization.md` -- `grid.memfence()` -> `grid.mem_fence()` rename in the cross-block-coordination paragraph. `block.md` itself is intentionally not touched here (the dedup of `## Grid-scope fence` lives in PR #664).
3d8bec8 to
662b245
Compare
There was a problem hiding this comment.
this file seems ok
There was a problem hiding this comment.
this file seems ok
The CI line-wrap check flagged this comment block as being wrapped conservatively at ~95-98 chars instead of the project's 120c target. Reflow to the full 120c budget (now 6 lines @ 104-118c, was 7 lines @ 89-99c).
Conflicts came from PR #664 (block-primitives cross-gpu) landing on main with overlapping changes to the same memfence machinery this branch touches for grid scope. Both branches renamed *_memfence -> *_mem_fence on disjoint scopes: - main: block_memfence -> block_mem_fence - branch: grid_memfence -> grid_mem_fence system_memfence -> system_mem_fence Resolution = union of both renames. Canonical names after merge are `block_mem_fence` / `grid_mem_fence` / `system_mem_fence` (runtime.cpp already auto-merged onto these). Per-file resolution: - docs/atomics.md, parallelization.md: keep both renames in user-facing prose. - docs/algorithms.md: update stale `grid.memfence()` -> `grid.mem_fence()`. - inc/internal_ops.inc.h, ir/type_system.cpp: register both ops. - runtime/llvm/llvm_context.cpp: CUDA patch_intrinsic for all three; AMDGPU patch_fence for both `block_mem_fence` ("workgroup") and `grid_mem_fence` ("agent"); merged comment block describes both. - tests/python/test_simt.py: keep both new test sets (grid alias test from branch + block.mem_fence tests from main).
| # (Linux / Windows) honors it via `OpMemoryBarrier(ScopeDevice, ...)`. Metal does NOT, even on Apple Silicon: MSL | ||
| # `atomic_thread_fence(memory_scope_device)` -- which is what MoltenVK / SPIRV-Cross translate | ||
| # `OpMemoryBarrier(ScopeDevice, ...)` to -- only orders *atomic* memory accesses across the device, not plain stores; | ||
| # this is a documented Metal limitation called out in the `grid.mem_fence()` Metal caveat in `block.md`. So we exclude |
There was a problem hiding this comment.
Nit: this should say grid.md instead of block.md -- the Metal caveat is documented in grid.md (which this PR rewrites), not block.md.
alanray-tech
left a comment
There was a problem hiding this comment.
LGTM once the nit above is addressed (block.md -> grid.md in the test comment).
The comment block on test_grid_mem_fence pointed to the `grid.mem_fence()` Metal caveat as being in `block.md`, but it lives in `grid.md`. Likely a leftover from when this branch was stacked on the block-primitives PR before being unstacked.
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>
Independent grid-side rollout (off
main, no longer stacked on PR #664). Adds full cross-GPU support forqd.simt.grid.mem_fence()and renamesqd.simt.grid.memfence()->qd.simt.grid.mem_fence()with a deprecated alias for back-compat.What this adds
User-facing surface:
qd.simt.grid.mem_fence()works on CUDA, AMDGPU, Vulkan, and Metal (with an atomic-only ordering caveat on Metal / Vulkan-on-macOS, documented ingrid.md).qd.simt.grid.memfence()becomes a deprecated alias that emitsDeprecationWarningon first use and forwards tomem_fence().Per-backend lowering:
nvvm_membar_gl(=__threadfence()).fence acquire_release syncscope("agent"), installed via apatch_fencebody-replacement helper inllvm_context.cpp(defined inline in this PR). We can't compile this fromruntime.cppbecause that runtime is built with host-targeted clang and only retargeted to amdgcn here.OpMemoryBarrier(ScopeDevice, UniformMemory | WorkgroupMemory | AcquireRelease). MoltenVK translates this to MSLatomic_thread_fence(metal::memory_scope_device).C++ ops:
internal_ops.inc.h+type_system.cpp: registergrid_mem_fence(CUDA / AMDGPU dispatch) andgridMemoryBarrier(SPIR-V dispatch). Drop the now-aliasedgrid_memfenceregistration.runtime.cpp+locked_task.h: renamegrid_memfence->grid_mem_fenceeverywhere; renamesystem_memfence->system_mem_fencealong with it for naming consistency.spirv_codegen.cpp: add thegridMemoryBarriercase.Tests:
test_grid_mem_fence-- broadened toarch=qd.gpu, withexclude=qd.metaland a runtimepytest.skipforvulkanon macOS. Replaces the old CUDA-onlytest_grid_memfence.test_grid_memfence_deprecated_alias-- verifies the deprecation warning fires.Docs:
grid.md-- rewritten to reflect cross-GPU portability: new support table, per-backend lowering section, deprecated-alias section, updated code examples + spin-wait gotcha.atomics.md-- the device-scope-fence portability claim is refreshed to "portable across all four backends" with the Metal caveat link;grid.memfence()references switch togrid.mem_fence(). The "Backend caveat for the fence-pair pattern" paragraph is reduced to the still-extantblock.mem_sync()AMDGPU gap (block-side AMDGPU lowering lands in PR [GPU] Make block operations portable cross-gpu #664).parallelization.md--grid.memfence()->grid.mem_fence()rename in the cross-block-coordination paragraph.block.mditself is intentionally not touched here (the dedup of## Grid-scope fencelives in PR [GPU] Make block operations portable cross-gpu #664).Test plan
pytest tests/python/test_simt.py -k gridon rtx-high.test_grid_mem_fenceis excluded on Metal and runtime-skipped on Vulkan-on-macOS, so the deprecated-alias test is what runs there.