Skip to content

[Perf] Tiles 2b: add triangular solve#474

Merged
hughperkins merged 10 commits into
mainfrom
hp/tiles-2b
Apr 13, 2026
Merged

[Perf] Tiles 2b: add triangular solve#474
hughperkins merged 10 commits into
mainfrom
hp/tiles-2b

Conversation

@hughperkins

@hughperkins hughperkins commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

This PR adds triangular solve to Tile16x16, building on the Cholesky factorization from PR 2a. Together, cholesky_ + solve_triangular_ give users the two primitives needed for
blocked Cholesky-based linear solves entirely in registers.
Two methods are added to _tile16.py:
_trsm(self, L) -- the internal workhorse. Solves X @ L^T = B in-place (self starts as B, ends as X) by forward-substituting column-by-column. Each column c computes a dot
product of previously-solved columns against row c of L (fetched via subgroup.shuffle), subtracts from the current column, and divides by the diagonal. Uses the
_get_col/_set_col helpers from PR 2a.
solve_triangular_(self, B, lower=True) -- the public API wrapper. Validates lower=True at compile time, then delegates to B._trsm(self). The argument order follows the
convention "self is the factored matrix, B is modified in-place."

The test loads a pre-computed Cholesky factor L and a random matrix B, runs L.solve_triangular_(B), and verifies X @ L^T ≈ B against scipy. Parametrized over f32/f64.
Good points:
• Very small and focused: 68 lines total (28 impl + 40 test), easy to review.
• Reuses _get_col/set_col from PR 2a, so the algorithm reads cleanly as a loop rather than 16-way unrolled register access.
• The algorithm structure mirrors cholesky
(column-by-column forward substitution with shuffles), making it easy to verify correctness by analogy.
• Test validates against scipy with tight f64 tolerance.

Bad points:
• Only lower=True is supported; lower=False raises at compile time rather than implementing upper-triangular solve.
• Test only exercises the full 16x16 case (offset 0, full tile). No partial-tile or offset-based parametrization like the Cholesky test in PR 2a has.
• No test for the lower=False error path.

copilot:summary

Walkthrough

copilot:walkthrough

TypeCheck has allow_undefined_visitor=true, so removing this override
is a no-op. The original code was actively wrong (overwriting correct
ret_type with i32), but nothing downstream relied on ret_type from
the type_check pass for InternalFuncStmt, so the bug was latent.
Removing the override eliminates the misleading TODO and prevents
a future pass from accidentally depending on the wrong type.
…tests

- Parametrize ger_sub and cholesky tests over f32/f64 dtypes
- Use tighter tolerance (1e-10) for f64, 1e-5 for f32
- Parametrize cholesky over src_offset (0, 5, 32) and dst_delta (0, 3, 16)
- Verify untouched regions of dst array remain at sentinel value
…olesky_

_ger_sub: 34 lines → 4 lines
cholesky_: 224 lines → 22 lines
Quadrants DSL types don't carry operator overloads in their stubs,
so pyright can't verify +=, *, /, > on shuffled values.
Fix scipy reference computation: solve X @ L^T = B requires
solve_triangular(L, B.T) not solve_triangular(L, B, trans='T').
Add type: ignore[reportOperatorIssue] for DSL operator expressions
in _trsm that pyright can't verify.
@hughperkins

Copy link
Copy Markdown
Collaborator Author

Questions to opus:

Is there any code or code change in this PR that might be embarrassing in some way if published in a public github repo?

  Yes, two issues:
  1. The test reference computation is wrong. The _trsm solves X @ L^T = B, but the scipy reference computes (L^{-T} @ B)^T = B^T @ L^{-1} instead of B @ L^{-T}. This test
     actually fails (we saw this earlier). The fix is B.T not trans="T":


  # Wrong (current):
  X_ref = scipy.linalg.solve_triangular(L_ref, B, lower=True, trans="T").T
  # Correct:
  X_ref = scipy.linalg.solve_triangular(L_ref, B.T, lower=True).T

  2. Missing `# type: ignore[reportOperatorIssue]` in _trsm -- the same DSL operator expressions (+=, *, /) that needed suppression in cholesky_ appear here without it. Pyright
     CI will fail on this branch too.

  Let me also check if the protocol stubs were updated.

  ⬢ Grepped 4 greps
    Grepped "_trsm|solve_triangular_" in quadrants/python/quadrants/lang/simt/_tile16.py
    Grepped "def _trsm|def solve_triangular_" in quadrants/python/quadrants/lang/simt/_tile16.py
    Grepped "class _Tile16x16Proto" in quadrants/python/quadrants/lang/simt/_tile16.py
    Grepped "_Tile16x16Proto" in quadrants/python/quadrants/lang/simt/_tile16.py

  3. `solve_triangular_` is missing `@qd.func` -- cholesky_ and _ger_sub both have it, but solve_triangular_ does not. It'll fail when called from a kernel.

  Want me to fix all three issues?

why dont the tests catch this?

  Good question -- let me reconsider each:
  Issue 1 (wrong reference): The test does fail -- we saw it earlier when running on hp/tiles-2b. So this IS caught. It's a test bug (wrong expected value), not a code bug.
  Issue 2 (missing pyright suppression): This would be caught by pyright CI, same as the hp/tiles-2a failure you just linked.
  Issue 3 (missing `@qd.func`): Let me check whether this actually matters.

  ⬢ Grepped "..._eye_|_load|_store|_get_col|_set_col)" in quadrants/python/quadrants/lang/simt/_tile16.py
    Found 398 matches

  I was wrong about issue #3. The tests on tiles-4 call L.solve_triangular_(B) inside a kernel and all 785 tests pass. So solve_triangular_ works without @qd.func -- Quadrants'
  AST transformer inlines dataclass method bodies regardless of the decorator. The if not lower: raise TypeError(...) acts as a compile-time guard, not runtime, which is why it
  works.
  So to correct my answer: there are really two issues, not three:
  1. Wrong test reference -- the test does fail (we saw it earlier), so it IS caught. It's just a test bug that was shipped. Fix: B.T instead of trans="T".
  2. Missing pyright suppression in `_trsm` -- pyright CI will catch this, same as it caught the cholesky_ issue on 2a.

  Want me to fix both on hp/tiles-2b and propagate?

yes

  Done. Both fixes applied to hp/tiles-2b and propagated to all downstream branches:
  • Test fix: solve_triangular(L, B.T, lower=True).T instead of solve_triangular(L, B, lower=True, trans="T").T
  • Pyright fix: # type: ignore[reportOperatorIssue] on the two operator lines in _trsm

Base automatically changed from hp/tiles-2a to main April 13, 2026 13:32
@hughperkins

Copy link
Copy Markdown
Collaborator Author

add test for the lower-False error path

Done. Added test_tile16_solve_triangular_upper_raises which verifies that calling solve_triangular_(..., lower=False) raises TypeError with the expected message.

@hughperkins hughperkins marked this pull request as ready for review April 13, 2026 13:44

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, focused PR with correct triangular solve implementation.

Extended reasoning...

Overview

Adds two methods to _tile16.py: the internal _trsm workhorse (in-place triangular solve via subgroup shuffles) and the public solve_triangular_ wrapper. Test validates against scipy over f32/f64. Changes touch 2 files, ~68 lines total.

Security risks

None. No auth, I/O, or user-facing surfaces touched.

Level of scrutiny

Low. This is internal GPU tile arithmetic — sandbox/library code, not production-critical infrastructure. The algorithm mirrors the existing cholesky_ pattern (same shuffle mechanics), which lowers the verification bar.

Other factors

Known issues (wrong test reference, missing pyright suppressions) were fixed prior to this review based on the PR timeline. Two nits flagged by the bug hunter (misleading Lkj variable name and unnecessary GPU decorator on a pure-Python test) are cosmetic/minor — neither affects correctness of the implementation or the primary test coverage.

Comment on lines +433 to +434
Lkj = qd.simt.subgroup.shuffle(L._get_col(j), qd.u32(c))
dot += self._get_col(j) * Lkj # type: ignore[reportOperatorIssue]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The variable Lkj on line 433 of _trsm is a copy-paste artifact from cholesky_ where the outer loop variable was k; in _trsm the outer loop variable is c, so the shuffle actually fetches L[c, j] and the variable should be named Lcj. No runtime effect — purely a naming issue.

Extended reasoning...

What the bug is and how it manifests

In _trsm, the outer loop iterates over column index c (for c in range(_TILE)). Inside the inner loop, the subgroup shuffle qd.simt.subgroup.shuffle(L._get_col(j), qd.u32(c)) broadcasts to all threads the value held by thread c in register j — in other words, it retrieves L[c, j]. The result is stored in a local variable named Lkj. The name strongly implies the outer loop variable is k, not c.

The specific code path that triggers it

Lines 433–434 of python/quadrants/lang/simt/_tile16.py:

for c in range(_TILE):
    ...
    for j in range(_TILE):
        if c > j:
            Lkj = qd.simt.subgroup.shuffle(L._get_col(j), qd.u32(c))  # line 433
            dot += self._get_col(j) * Lkj                              # line 434

Why existing code doesn't prevent it

This is a pure naming issue. Python does not enforce any relationship between a variable name and the loop variable it conceptually tracks, and the Quadrants DSL compiler is only interested in the value of the shuffle result, not its name. No static analysis or test will flag a misleading identifier name.

What the impact would be

The code is functionally correct. The misleading name causes confusion only for readers who compare _trsm against cholesky_, where Lkj correctly names L[k, j] because the outer loop variable there is k. A developer reading _trsm expecting k to be the outer loop variable (by analogy) will be briefly confused.

How to fix it

Rename Lkj to Lcj on the two lines where it appears (assignment and use):

Lcj = qd.simt.subgroup.shuffle(L._get_col(j), qd.u32(c))
dot += self._get_col(j) * Lcj  # type: ignore[reportOperatorIssue]

Step-by-step proof

  1. In cholesky_ (lines ~411–418), the outer loop is for k in range(_TILE). The shuffle qd.simt.subgroup.shuffle(my_col, qd.u32(k)) fetches L[k, j] from thread k. Naming this Lkj is accurate.
  2. In _trsm (lines ~430–434), the outer loop is for c in range(_TILE). The shuffle qd.simt.subgroup.shuffle(L._get_col(j), qd.u32(c)) fetches L[c, j] from thread c. Naming this Lkj is inaccurate — there is no variable k anywhere in _trsm.
  3. The name was carried over verbatim when _trsm was written by analogy with cholesky_, but the loop variable was renamed from k to c without updating the derived variable name.

Comment on lines +437 to +441
@test_utils.test(arch=qd.gpu)
def test_tile16_solve_triangular_upper_raises():
Tile = _make_tile16x16(qd.f32)
with pytest.raises(TypeError, match="only lower=True"):
Tile().solve_triangular_(Tile(), lower=False)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 test_tile16_solve_triangular_upper_raises is decorated with @test_utils.test(arch=qd.gpu), causing it to be silently skipped on non-GPU machines. Since the test only exercises a pure-Python TypeError raised before any GPU code runs, the decorator is unnecessary and should be removed.

Extended reasoning...

The test test_tile16_solve_triangular_upper_raises (tests/python/test_tile16.py:437-441) is decorated with @test_utils.test(arch=qd.gpu). Looking at test_utils.py lines 226-234, when qd.gpu is not present in expected_archs() (i.e., no GPU is available), the arch list becomes empty and pytest.mark.skip is applied to the test function. This means the test is silently skipped on any CI agent or developer machine without GPU hardware.

However, the test body is entirely pure Python. It calls make_tile16x16(qd.f32) to get a dataclass type, calls Tile() to instantiate it, and then calls solve_triangular(Tile(), lower=False). The solve_triangular_ method (in _tile16.py) immediately checks "if not lower: raise TypeError(...)" before delegating to B._trsm(self) or doing anything GPU-related. No @qd.func decorated path is ever entered, no kernel is compiled or launched.

Existing pure-Python tests in the same file correctly omit the GPU decorator. For example, test_tile16_size_constant (which just checks Tile.SIZE == 16) and test_tile16_make_caching (which checks the caching behavior of _make_tile16x16) both run as plain pytest functions with no decorator, and conftest.py lines 79-83 confirm that undecorated functions receive (None, None) for req_arch/req_options and run unconditionally as pure Python.

The impact is silent coverage loss on CPU-only CI agents. The lower=False error-path guard in solve_triangular_ goes entirely untested in those environments, at zero cost to fix.

Step-by-step proof on a non-GPU machine: (1) pytest collects test_tile16_solve_triangular_upper_raises; (2) the @test_utils.test(arch=qd.gpu) decorator runs, finds qd.gpu not in expected_archs(), so arch=[], and applies pytest.mark.skip; (3) pytest marks the test as skipped; (4) the body -- make_tile16x16, Tile(), solve_triangular(lower=False) -- never executes; (5) the TypeError guard in solve_triangular_ is never validated.

The fix is to remove the @test_utils.test(arch=qd.gpu) decorator from test_tile16_solve_triangular_upper_raises, exactly as was done for the other pure-Python tests in the same file. No other changes are needed.

@erizmr erizmr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the basis that I have reviewed the design and the public facing API, tests, and they look reasonable to me.

@hughperkins

Copy link
Copy Markdown
Collaborator Author

Thanks! 🙌

@hughperkins hughperkins enabled auto-merge (squash) April 13, 2026 14:08
@hughperkins hughperkins merged commit f5847f8 into main Apr 13, 2026
71 of 72 checks passed
@hughperkins hughperkins deleted the hp/tiles-2b branch April 13, 2026 19:18
npoulad1 added a commit to ROCm/quadrants that referenced this pull request Jun 8, 2026
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428)

* [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

* [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430)

* Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420)

* [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435)

* [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438)

* Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

* Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442)

* [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439)

* [Misc] Add named top-level loops (Genesis-Embodied-AI#440)

* [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

* [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

* [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456)

* [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461)

* [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

* [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

* [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

* [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465)

* [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

* [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471)

* [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

* [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

* [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475)

* [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

* Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485)

* [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484)

* [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477)

* [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)

* Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488)

* Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489)

* [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487)

* [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492)

* [CI] Serialize api doc workflow (Genesis-Embodied-AI#494)

* [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506)

* [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509)

* [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504)

* [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505)

* [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507)

* [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508)

* [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482)

* [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483)

* [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512)

* [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510)

* [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511)

* [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422)

* [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500)

* [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501)

* [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502)

* [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503)

* [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496)

* [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491)

* [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534)

* [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535)

* [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495)

* [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490)

* [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536)

* [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541)

* [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419)

* [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411)

* [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552)

* [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441)

* [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412)

* [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555)

* [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554)

* [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537)

* [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493)

* [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539)

* [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513)

* [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551)

* [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557)

* [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562)

* [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559)

* [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558)

* [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563)

* [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426)

Authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543)

* Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564)

* [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470)

* [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567)

* Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573)

* [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574)

* [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571)

* [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575)

* [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576)

* [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577)

* [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570)

* [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566)

* [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579)

* [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584)

* [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580)

* [Type] Tensor 24 (Genesis-Embodied-AI#561)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587)

* [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578)

* [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588)

* [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590)

* [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592)

* [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591)

* [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596)

* [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450)

* Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585)

Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598)

Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>

* [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599)

* [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606)

* [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610)

* [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611)

* [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Doc] Update README (Genesis-Embodied-AI#617)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

* [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Add PR Line change report (Genesis-Embodied-AI#624)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

* [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630)

* [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

Co-authored-by: Johnny Nunez and Hugh Perkins

* [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632)

* [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

* [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633)

* [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634)

* [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638)

* [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639)

* [Perf] Streams 1-4 (Genesis-Embodied-AI#410)

* [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643)

* [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650)

* [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640)

* [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641)

* [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635)

* [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658)

* [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655)

* [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653)

* [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659)

* [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654)

* [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660)

* [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669)

* [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668)

* [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667)

* [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671)

* [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675)

* [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677)

* [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Cross gpu atomics (Genesis-Embodied-AI#666)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664)

* [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685)

* [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670)

* [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662)

* [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687)

* [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672)

* [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679)

* [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665)

* [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691)

* [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694)

* [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690)

* Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698)

* [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692)

* [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696)

* [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683)

* [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676)

* [GPU] New QIPC ops for block (Genesis-Embodied-AI#684)

* [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693)

* [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701)

* [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700)

* [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702)

* [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708)

* [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707)

* Fix duplicate HIP graph driver-function declarations after v1.0.0 merge

The amd-integration fork had cherry-picked the HIP graph driver functions
(graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate /
graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set.
The per-file 3-way merge appended both copies into
amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the
AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures
are identical to the fork's existing declarations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge

- kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel
  rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream
  PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design,
  leaving references to undefined `ephemeral_context_ptr`. Restore the fork's
  coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced
  launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel
  groups adapted onto the AMD launch path.
- llvm_context.h: both the fork and upstream added `num_instructions`; the merge
  kept upstream's private placement, but the AMDGPU codegen force-inline heuristic
  calls it statically from outside the class. Move it back to the public section.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore async result D2H and hoist kernarg vectors in AMDGPU launcher

The v1.0.0 merge resolution regressed two amd-integration baseline
optimizations in launch_llvm_kernel / launch_offloaded_tasks:

  - The per-launch result-buffer copy was a blocking memcpy_device_to_host,
    forcing a host stall on every value-returning launch and serializing the
    GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it
    needs the value); external-array transfers still stream_synchronize once
    before reading back.

  - launch_task constructed the kernarg std::vectors from initializer lists
    ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free
    per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget

Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup
ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through
`amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside
`llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco`
(i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted
these constructs, which is why it was unaffected.

1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend.
   Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target
   (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the
   native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK`
   is now the default and still honored. This is the actual crash fix.

2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so
   `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries
   x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies
   but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm
   during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the
   wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources

CI pre-commit's clang-format hook reformatted these files (long
declarations/lambda signatures collapsed onto single lines per the repo's
clang-format config). Apply the same formatting so the hook passes.

No functional changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input)

clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged
`builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to
the `llvm::Value*` LHS parameter as a null pointer, not an integer zero.
Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper
zero constant -- identical intended semantics, and clang-tidy clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>
Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants