Skip to content

[Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable#598

Merged
hughperkins merged 58 commits into
mainfrom
hp/dlpack-v1-numpy
Apr 30, 2026
Merged

[Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable#598
hughperkins merged 58 commits into
mainfrom
hp/dlpack-v1-numpy

Conversation

@hughperkins

Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

ScalarField.to_torch and MatrixField.to_torch now accept copy=False
to return a zero-copy DLPack-backed tensor, copy=True to force a copy,
and copy=None (default) to prefer zero-copy with kernel-copy fallback.

Handles metal sync, device transfers, 0-dim scalar edge cases, and
unsupported dtype/arch rejection internally so callers don't need to.
DataTypeCxx is a wrapper class, not an enum — it doesn't have f32/f64/etc
attributes. Use the proper Python-side type constants from
quadrants.types.primitive_types instead.
ScalarNdarray, MatrixNdarray, and VectorNdarray.to_torch now accept
copy= (None/True/False) to match the Field API. When copy is not True,
DLPack zero-copy is attempted first, falling back to kernel copy.

This fixes TypeError when genesis calls value.to_torch(copy=None) on
ndarray types.
copy=None (default) now always does kernel copy, matching the old
behavior. DLPack zero-copy is only attempted when copy=False is
explicitly passed. This prevents StructField.to_torch() from getting
garbage data — struct members have interleaved (AoS) memory that
DLPack's contiguous stride assumption can't represent.
…ehavior

Tests cover copy=False (DLPack zero-copy), copy=None (kernel copy),
AoS struct members (garbage via DLPack, correct via kernel), SoA
struct members (correct zero-copy), and Vulkan rejection.
Reverts unintended ruff format changes that bloated the diff. Re-applies
only the logical changes (copy= parameter, _try_zerocopy_torch helpers)
on top of the original formatting, then ran pre-commit run -a (black +
ruff + pylint) to match project conventions.
Instead of hardcoding arch=[cpu, cuda], use @test_utils.test() with no
arch filter so tests run on all available backends. Tests requiring
DLPack zero-copy call _skip_if_no_zerocopy() to skip gracefully on
backends like Vulkan or old-torch Metal.
Replace _can_zerocopy_field calls in test skip logic with a simple
_NO_ZEROCOPY_ARCHS = {qd.vulkan} set, so tests don't depend on the
implementation to decide whether to run.
Recovers docs/source/user_guide/interop.md from the pre-reset branch,
removes the layout= section (not in current design), updates copy=None
semantics to reflect that it always does a kernel copy, and fixes the
caching section to reflect that Quadrants doesn't cache views internally.
copy=None and copy=True had identical behavior (kernel copy). Defaulting
to True is simpler and more explicit -- no need for a three-valued flag
when only two behaviors exist (copy vs zero-copy).
…_dlpack

Mirrors the to_torch(copy=) API. copy=True (default) returns an
independent kernel-copied array, copy=False returns a zero-copy DLPack
view (CPU backends only, since numpy arrays cannot reference GPU memory).

Adds _try_zerocopy_numpy helper in field.py and propagates copy= through
ScalarField, MatrixField, StructField, ScalarNdarray, MatrixNdarray, and
VectorNdarray. Includes tests for all types plus GPU-raises test.
…ield

AOS struct members have interleaved memory (stride = sizeof(cell)) but
the C++ DLPack export emits contiguous strides at the member dtype size,
so a zero-copy view silently returns garbage. Now _can_zerocopy_field
detects multi-member parent snodes and returns False, causing copy=False
to raise ValueError instead of returning corrupted data.

Fixes the test to assert ValueError rather than codifying the corruption.
…or numpy

Two fixes from cluster test run:

1. _is_aos_struct_member was too aggressive for MatrixField: a plain
   VectorField has parent_snode.get_num_ch() == n*m (matrix elements),
   not struct members. Now checks grandparent for MatrixField.

2. np.from_dlpack requires DLPack v1 (__dlpack__/__dlpack_device__) but
   Quadrants' to_dlpack() returns a raw PyCapsule (v0). Added
   _DLPackV1Adapter wrapper.
…sync doc

1. MatrixField.to_torch/to_numpy(copy=False, keep_dims=True) on vector
   fields (m==1) returned shape (*shape, n) instead of (*shape, n, 1)
   because the DLPack export collapses m=1 but the reshape only handled
   the as_vector case. Now always normalises to the expected shape.

2. interop.md incorrectly claimed copy=True calls torch.mps.synchronize()
   on Metal -- it doesn't, since copy=True skips _try_zerocopy_torch
   entirely. Corrected the doc.
…ro-copy

copy=False (zero-copy view): only qd.sync(), no torch.mps.synchronize()
copy=True  (kernel copy):    qd.sync() + torch.mps.synchronize()

Also simplifies _try_zerocopy_torch since it's now only called for
copy=False -- removes dead copy=True clone branch and always raises
on failure instead of returning None.
SNode.place flattens vec/mat components directly under the struct cell
(no intermediate matrix SNode), so parent(2) overshoots to root and the
bare except swallows the error, returning False -- allowing silent data
corruption for vec/mat members of AOS structs.

Fix: use parent(1) for both ScalarField and MatrixField, but compare
num_ch > n*m for MatrixField (own components) vs > 1 for ScalarField.

Added test for vec3 AOS struct member to lock the fix.
- test_vector_field_copy_false_keep_dims: asserts copy=False and copy=True
  produce the same shape when keep_dims=True (to_torch)
- test_vector_field_to_numpy_copy_false_keep_dims: same for to_numpy
- test_metal_copy_true_syncs_mps: verifies tensor is usable immediately
  after copy=True on Metal
- test_metal_copy_false_no_mps_sync: verifies copy=False path works on
  Metal (with explicit torch.mps.synchronize() by caller)
…opy=False

torch.device('cuda:0') != torch.device('cuda') because PyTorch requires
both type and index to match. DLPack tensors always carry an explicit
index, so passing device='cuda' without an index would raise even though
no transfer is needed. Now compares type strictly and index only when
both sides specify one.
_try_zerocopy_torch and _try_zerocopy_numpy take Field as the type hint
but call field.to_dlpack(), which only exists on ScalarField/MatrixField/
Ndarray subclasses. Add a NotImplementedError stub on Field so pyright
can resolve the attribute.
SOA layout places each component in its own dense subtree, so
parent.get_num_ch() == 1 < n*m. The previous check (> n*m) missed this
and allowed DLPack export with wrong strides, silently corrupting data.

Changed to != n*m for MatrixField and != 1 for ScalarField, so both SOA
(num_ch < expected) and AOS struct members (num_ch > expected) are
rejected. Only standalone AOS layout (num_ch == expected) allows
zero-copy.

Added test for SOA Vector.field copy=False raising ValueError.
…DA GPUs

Restrict to arch=[qd.cuda] since qd.cfg.arch.name gives 'amdgpu'/'metal'
which are not valid torch device strings (torch expects 'cuda'/'mps').
@duburcqa

Copy link
Copy Markdown
Contributor

ok to merge

@github-actions

Copy link
Copy Markdown

Coverage Report (f399f11f1)

File Coverage Missing
🔴 python/quadrants/lang/_ndarray.py 33% 90,106
🔴 python/quadrants/lang/field.py 67% 528
🔴 python/quadrants/lang/matrix.py 50% 1293
🟢 tests/python/test_to_torch_copy.py 100%

Diff coverage: 89% · Overall: 67% · 36 lines, 4 missing

Full annotated report

Pyright resolves the field type as the base Field class, which lacked the
versioned parameter, causing reportCallIssue.
Base automatically changed from hp/zerocopy-interop to main April 30, 2026 17:08
Resolve conflicts keeping v1 DLPack changes (writable numpy views,
updated docstrings, to_dlpack docs section, writability tests).
Three fixes based on review feedback:

1. _patch_field_dlpack_canonical now handles both v0 ("dltensor") and v1
   ("dltensor_versioned") capsules with correct struct layouts. Previously
   it hardcoded the v0 name, breaking layout-tagged fields (order='ji')
   when the numpy path requested a v1 capsule.

2. Version-gate versioned=True behind numpy >= 2.1 (which added v1 DLPack
   consumer support). On numpy 1.26-2.0, fall back to v0 capsules
   (read-only but functional, matching pre-upgrade behavior).

3. Add tests for layout-tagged field to_numpy(copy=False), and guard
   writability tests behind numpy >= 2.1.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 python/quadrants/lang/field.py:528-540 — Tensor wrapper at python/quadrants/_tensor_wrapper.py:208-209 was not updated for the new versioned= parameter, so qd.tensor(qd.f32, shape=(4,)).to_dlpack(versioned=True) raises TypeError: to_dlpack() got an unexpected keyword argument 'versioned'. The wrapper docstring (lines 7-9) explicitly markets qd.Tensor as what qd.tensor()/qd.Vector.tensor()/qd.Matrix.tensor() return, so the new writable-numpy entry point is unreachable through the documented modern factory. Fix is one line: def to_dlpack(self, versioned: bool = False) -> typing.Any: return self._impl.to_dlpack(versioned=versioned).

    Extended reasoning...

    What is the bug

    This PR adds a new versioned= parameter to the underlying to_dlpack methods on three impls:

    • ScalarField.to_dlpack (field.py:528)
    • MatrixField.to_dlpack (matrix.py:1293)
    • Ndarray.to_dlpack (_ndarray.py:90)

    But the public Tensor wrapper at python/quadrants/_tensor_wrapper.py:208-209 is hardcoded as:

    def to_dlpack(self) -> typing.Any:
        return self._impl.to_dlpack()

    — no versioned parameter, no forwarding. The wrapper file is not in this PR's diff, which is the structural reason the gap was missed.

    Why this matters

    Per the wrapper module docstring (lines 7-9), qd.Tensor is the modern public class returned by qd.tensor(), qd.Vector.tensor(), and qd.Matrix.tensor(). Line 18 explicitly lists to_dlpack as a wrapper forward. Verifying the call chain in _tensor.py: qd.tensor(...)_wrap_impl(...) → returns a Tensor wrapper instance.

    The new interop docs section Raw DLPack export with to_dlpack() (interop.md:209-232) advertises f.to_dlpack(versioned=True) as the documented entry point for writable numpy zero-copy. Users who follow the modern qd.tensor() factory pattern and reach for the new feature will hit the TypeError immediately.

    Step-by-step proof

    import quadrants as qd
    qd.init(arch=qd.cpu)
    
    t = qd.tensor(qd.f32, shape=(4,))
    # 1. qd.tensor(...) returns a Tensor wrapper (not a bare ScalarField/Ndarray).
    # 2. Tensor.to_dlpack signature is (self) — no kwargs.
    # 3. Python rejects the call before any forwarding happens.
    t.to_dlpack(versioned=True)
    # TypeError: to_dlpack() got an unexpected keyword argument 'versioned'

    By contrast, the bare-impl path works:

    f = qd.field(qd.f32, shape=(4,))
    f.to_dlpack(versioned=True)   # OK — bare ScalarField has the new kwarg

    The new docs example uses bare qd.field(...) rather than qd.tensor(...), so a user who copy-pastes the doc literally will not trip — but anyone using the wrapper factory will.

    Why existing tests do not catch it

    The new tests in tests/python/test_to_torch_copy.py (test_scalar_field_to_numpy_copy_false_is_writable, etc.) all go through f.to_numpy(copy=False) on bare qd.field(...) / qd.ndarray(...) — never through qd.tensor(...). So the wrapper's to_dlpack is not exercised on the new code path.

    How to fix

    One-line change in _tensor_wrapper.py:

    def to_dlpack(self, versioned: bool = False) -> typing.Any:
        return self._impl.to_dlpack(versioned=versioned)

    Note on scope

    The same wrapper (lines 192, 200) also drops the copy= kwarg on to_numpy/to_torch, so users of qd.tensor() already cannot reach copy=False zero-copy through the wrapper either. That is a pre-existing gap and out of scope for this bug, but the cleanest fix is consistent with closing it: forward the new kwargs symmetrically.

Comment thread docs/source/user_guide/interop.md Outdated
np.from_dlpack does not accept raw PyCapsules -- it requires an object
exposing __dlpack__/__dlpack_device__. Replace the broken example with
guidance to use to_numpy(copy=False) for NumPy, which handles the
protocol adapter internally.
Comment thread python/quadrants/lang/field.py Outdated
ctypes.pythonapi is a PyDLL which auto-raises ValueError when
PyCapsule_GetPointer fails (name mismatch), so the else branch was
unreachable. Use PyCapsule_IsValid to dispatch without raising.
Comment thread python/quadrants/lang/_ndarray.py
@hughperkins

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread quadrants/python/dlpack_funcs.cpp
Comment thread docs/source/user_guide/interop.md
The PyCapsule destructor was a no-op, so unconsumed capsules (still
named "dltensor"/"dltensor_versioned") leaked the DLManagedTensor(Versioned)
struct, shape/strides arrays, and (for ndarrays) the pybind11::object
pinning the Python wrapper. Now follows the standard producer convention:
check if capsule is still unconsumed via PyCapsule_IsValid, and invoke
the inner deleter if so.

Also add inline comment in interop.md noting NumPy >= 2.1 requirement
for writable assignment.

Made-with: Cursor
@github-actions

Copy link
Copy Markdown

Coverage Report (b1fd13443)

File Coverage Missing
🔴 python/quadrants/_tensor_wrapper.py 0% 208-209
🔴 python/quadrants/lang/_ndarray.py 33% 90,110
🔴 python/quadrants/lang/field.py 55% 69-70,73,76,116-117,122,539,556
🔴 python/quadrants/lang/matrix.py 50% 1293
🟢 tests/python/test_to_torch_copy.py 93% 491,506,521,555

Diff coverage: 79% · Overall: 74% · 85 lines, 18 missing

Full annotated report

@hughperkins

Copy link
Copy Markdown
Collaborator Author

@claude review

…ack(versioned=True)

Covers the two test coverage gaps flagged by CI: writable v1 capsule
tests for MatrixField and the Tensor wrapper forwarding.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously flagged issues addressed (versioned-capsule dispatch, layout-tag patch path, capsule deleter ownership, Tensor wrapper plumbing, doc thresholds, round-trip example caveat).

Extended reasoning...

Overview

This PR upgrades the DLPack export path to optionally emit DLPack v1 (DLManagedTensorVersioned) capsules so to_numpy(copy=False) can return writable arrays on NumPy >= 2.1. Touches the C++ DLPack producer (dlpack_funcs.cpp/h, export_lang.cpp), the Python to_dlpack wrappers on Ndarray/ScalarField/MatrixField/Tensor, the canonical-view ctypes patch helper, the user-guide interop docs, and a sizable batch of new tests in test_to_torch_copy.py.

Security risks

None applicable. This is an in-process buffer export ABI; there is no auth, network, deserialization of untrusted input, or privileged operation. The only resource-management concern (the empty v0 capsule deleter pre-existing in the v0 path, plus the new v1 path) was flagged earlier in this review and the current diff fixes both: the deleters now dispatch on PyCapsule_IsValid and invoke the inner managed-tensor deleter when the capsule is GC'd unconsumed, so producer-side leaks of DLManagedTensor*, the heap shape/strides, and the pybind11::object* pinning the ndarray wrapper are no longer possible.

Level of scrutiny

Higher than a typical config change because this is C++/Python ABI code at the boundary with PyTorch and NumPy, with manual memory management on both sides of the v0/v1 split, ctypes structure punning to mutate capsule contents in place, and version-dependent consumer behaviour. I ran multiple rounds of bug hunting on it and posted six distinct issue reports across the cycle — runtime regression on NumPy < 2.1, layout-tagged + v1 capsule patch path, PyDLL-auto-raise dead code, capsule deleter ownership leak, Tensor wrapper missing the new kwarg, and doc snippet accuracy. Each iteration of the diff has substantively addressed the concrete issue raised. I have re-walked the current code (field.py:107-122, _tensor_wrapper.py:208-209, dlpack_funcs.cpp:329-356, :460-498, interop.md:101, :140, :208-232) and confirmed the fixes are correct.

Other factors

The bug hunting system reported zero bugs on this run. A human reviewer (duburcqa) said "ok to merge". CI coverage for the diff is 89% with the missing lines being the no-torch ModuleNotFoundError branches and a couple of fall-through paths, which is acceptable. The new tests cover the writable-numpy path, the layout-tagged + numpy zero-copy combination (the regression I flagged in an earlier round), and the Tensor-wrapper passthrough. Given how iterative this review has been and how thoroughly the surface has been exercised, I'm comfortable shadow-approving rather than asking for another human pass.

@github-actions

Copy link
Copy Markdown

Coverage Report (d7bb21d30)

File Coverage Missing
🔴 python/quadrants/_tensor_wrapper.py 50% 208
🔴 python/quadrants/lang/_ndarray.py 33% 90,110
🔴 python/quadrants/lang/field.py 55% 69-70,73,76,116-117,122,539,556
🔴 python/quadrants/lang/matrix.py 50% 1293
🟢 tests/python/test_to_torch_copy.py 93% 491,506,521,555,577,595

Diff coverage: 83% · Overall: 74% · 110 lines, 19 missing

Full annotated report

@hughperkins hughperkins merged commit 8d20beb into main Apr 30, 2026
54 checks passed
@hughperkins hughperkins deleted the hp/dlpack-v1-numpy branch April 30, 2026 23:09
npoulad1 added a commit to ROCm/quadrants that referenced this pull request Jun 8, 2026
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428)

* [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

* [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430)

* Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420)

* [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435)

* [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438)

* Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

* Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442)

* [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439)

* [Misc] Add named top-level loops (Genesis-Embodied-AI#440)

* [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

* [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

* [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456)

* [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461)

* [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

* [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

* [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

* [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465)

* [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

* [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471)

* [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

* [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

* [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475)

* [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

* Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485)

* [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484)

* [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477)

* [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)

* Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488)

* Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489)

* [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487)

* [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492)

* [CI] Serialize api doc workflow (Genesis-Embodied-AI#494)

* [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506)

* [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509)

* [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504)

* [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505)

* [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507)

* [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508)

* [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482)

* [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483)

* [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512)

* [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510)

* [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511)

* [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422)

* [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500)

* [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501)

* [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502)

* [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503)

* [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496)

* [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491)

* [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534)

* [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535)

* [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495)

* [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490)

* [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536)

* [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541)

* [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419)

* [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411)

* [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552)

* [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441)

* [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412)

* [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555)

* [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554)

* [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537)

* [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493)

* [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539)

* [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513)

* [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551)

* [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557)

* [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562)

* [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559)

* [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558)

* [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563)

* [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426)

Authored-by: v01dxyz <v01dxyz@v01d.xyz>

* [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543)

* Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564)

* [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470)

* [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567)

* Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573)

* [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574)

* [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571)

* [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575)

* [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576)

* [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577)

* [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570)

* [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566)

* [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579)

* [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584)

* [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580)

* [Type] Tensor 24 (Genesis-Embodied-AI#561)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587)

* [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578)

* [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588)

* [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590)

* [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592)

* [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591)

* [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596)

* [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450)

* Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585)

Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

* [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597)

Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>

* [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598)

Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>

* [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599)

* [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606)

* [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610)

* [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611)

* [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Doc] Update README (Genesis-Embodied-AI#617)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

* [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Add PR Line change report (Genesis-Embodied-AI#624)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

* [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630)

* [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

Co-authored-by: Johnny Nunez and Hugh Perkins

* [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632)

* [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

* [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633)

* [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634)

* [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638)

* [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639)

* [Perf] Streams 1-4 (Genesis-Embodied-AI#410)

* [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643)

* [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650)

* [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640)

* [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641)

* [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635)

* [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658)

* [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655)

* [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653)

* [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659)

* [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654)

* [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660)

* [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669)

* [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668)

* [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667)

* [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671)

* [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675)

* [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677)

* [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Cross gpu atomics (Genesis-Embodied-AI#666)

Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>

* [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664)

* [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685)

* [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670)

* [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662)

* [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687)

* [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672)

* [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679)

* [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665)

* [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691)

* [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694)

* [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690)

* Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698)

* [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692)

* [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696)

* [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683)

* [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676)

* [GPU] New QIPC ops for block (Genesis-Embodied-AI#684)

* [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693)

* [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701)

* [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700)

* [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702)

* [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708)

* [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707)

* Fix duplicate HIP graph driver-function declarations after v1.0.0 merge

The amd-integration fork had cherry-picked the HIP graph driver functions
(graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate /
graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set.
The per-file 3-way merge appended both copies into
amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the
AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures
are identical to the fork's existing declarations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge

- kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel
  rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream
  PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design,
  leaving references to undefined `ephemeral_context_ptr`. Restore the fork's
  coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced
  launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel
  groups adapted onto the AMD launch path.
- llvm_context.h: both the fork and upstream added `num_instructions`; the merge
  kept upstream's private placement, but the AMDGPU codegen force-inline heuristic
  calls it statically from outside the class. Move it back to the public section.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore async result D2H and hoist kernarg vectors in AMDGPU launcher

The v1.0.0 merge resolution regressed two amd-integration baseline
optimizations in launch_llvm_kernel / launch_offloaded_tasks:

  - The per-launch result-buffer copy was a blocking memcpy_device_to_host,
    forcing a host stall on every value-returning launch and serializing the
    GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it
    needs the value); external-array transfers still stream_synchronize once
    before reading back.

  - launch_task constructed the kernarg std::vectors from initializer lists
    ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free
    per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget

Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup
ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through
`amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside
`llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco`
(i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted
these constructs, which is why it was unaffected.

1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend.
   Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target
   (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the
   native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK`
   is now the default and still honored. This is the actual crash fix.

2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so
   `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries
   x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies
   but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm
   during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the
   wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources

CI pre-commit's clang-format hook reformatted these files (long
declarations/lambda signatures collapsed onto single lines per the repo's
clang-format config). Apply the same formatting so the hook passes.

No functional changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input)

clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged
`builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to
the `llvm::Value*` LHS parameter as a null pointer, not an integer zero.
Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper
zero constant -- identical intended semantics, and clang-tidy clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: alanray-tech <alan.ray@genesis-ai.company>
Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com>
Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants