[Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable#598
Conversation
ScalarField.to_torch and MatrixField.to_torch now accept copy=False to return a zero-copy DLPack-backed tensor, copy=True to force a copy, and copy=None (default) to prefer zero-copy with kernel-copy fallback. Handles metal sync, device transfers, 0-dim scalar edge cases, and unsupported dtype/arch rejection internally so callers don't need to.
DataTypeCxx is a wrapper class, not an enum — it doesn't have f32/f64/etc attributes. Use the proper Python-side type constants from quadrants.types.primitive_types instead.
ScalarNdarray, MatrixNdarray, and VectorNdarray.to_torch now accept copy= (None/True/False) to match the Field API. When copy is not True, DLPack zero-copy is attempted first, falling back to kernel copy. This fixes TypeError when genesis calls value.to_torch(copy=None) on ndarray types.
copy=None (default) now always does kernel copy, matching the old behavior. DLPack zero-copy is only attempted when copy=False is explicitly passed. This prevents StructField.to_torch() from getting garbage data — struct members have interleaved (AoS) memory that DLPack's contiguous stride assumption can't represent.
…ehavior Tests cover copy=False (DLPack zero-copy), copy=None (kernel copy), AoS struct members (garbage via DLPack, correct via kernel), SoA struct members (correct zero-copy), and Vulkan rejection.
Reverts unintended ruff format changes that bloated the diff. Re-applies only the logical changes (copy= parameter, _try_zerocopy_torch helpers) on top of the original formatting, then ran pre-commit run -a (black + ruff + pylint) to match project conventions.
Instead of hardcoding arch=[cpu, cuda], use @test_utils.test() with no arch filter so tests run on all available backends. Tests requiring DLPack zero-copy call _skip_if_no_zerocopy() to skip gracefully on backends like Vulkan or old-torch Metal.
Replace _can_zerocopy_field calls in test skip logic with a simple
_NO_ZEROCOPY_ARCHS = {qd.vulkan} set, so tests don't depend on the
implementation to decide whether to run.
Recovers docs/source/user_guide/interop.md from the pre-reset branch, removes the layout= section (not in current design), updates copy=None semantics to reflect that it always does a kernel copy, and fixes the caching section to reflect that Quadrants doesn't cache views internally.
copy=None and copy=True had identical behavior (kernel copy). Defaulting to True is simpler and more explicit -- no need for a three-valued flag when only two behaviors exist (copy vs zero-copy).
…_dlpack Mirrors the to_torch(copy=) API. copy=True (default) returns an independent kernel-copied array, copy=False returns a zero-copy DLPack view (CPU backends only, since numpy arrays cannot reference GPU memory). Adds _try_zerocopy_numpy helper in field.py and propagates copy= through ScalarField, MatrixField, StructField, ScalarNdarray, MatrixNdarray, and VectorNdarray. Includes tests for all types plus GPU-raises test.
…ield AOS struct members have interleaved memory (stride = sizeof(cell)) but the C++ DLPack export emits contiguous strides at the member dtype size, so a zero-copy view silently returns garbage. Now _can_zerocopy_field detects multi-member parent snodes and returns False, causing copy=False to raise ValueError instead of returning corrupted data. Fixes the test to assert ValueError rather than codifying the corruption.
…or numpy Two fixes from cluster test run: 1. _is_aos_struct_member was too aggressive for MatrixField: a plain VectorField has parent_snode.get_num_ch() == n*m (matrix elements), not struct members. Now checks grandparent for MatrixField. 2. np.from_dlpack requires DLPack v1 (__dlpack__/__dlpack_device__) but Quadrants' to_dlpack() returns a raw PyCapsule (v0). Added _DLPackV1Adapter wrapper.
…sync doc 1. MatrixField.to_torch/to_numpy(copy=False, keep_dims=True) on vector fields (m==1) returned shape (*shape, n) instead of (*shape, n, 1) because the DLPack export collapses m=1 but the reshape only handled the as_vector case. Now always normalises to the expected shape. 2. interop.md incorrectly claimed copy=True calls torch.mps.synchronize() on Metal -- it doesn't, since copy=True skips _try_zerocopy_torch entirely. Corrected the doc.
…ro-copy copy=False (zero-copy view): only qd.sync(), no torch.mps.synchronize() copy=True (kernel copy): qd.sync() + torch.mps.synchronize() Also simplifies _try_zerocopy_torch since it's now only called for copy=False -- removes dead copy=True clone branch and always raises on failure instead of returning None.
SNode.place flattens vec/mat components directly under the struct cell (no intermediate matrix SNode), so parent(2) overshoots to root and the bare except swallows the error, returning False -- allowing silent data corruption for vec/mat members of AOS structs. Fix: use parent(1) for both ScalarField and MatrixField, but compare num_ch > n*m for MatrixField (own components) vs > 1 for ScalarField. Added test for vec3 AOS struct member to lock the fix.
- test_vector_field_copy_false_keep_dims: asserts copy=False and copy=True produce the same shape when keep_dims=True (to_torch) - test_vector_field_to_numpy_copy_false_keep_dims: same for to_numpy - test_metal_copy_true_syncs_mps: verifies tensor is usable immediately after copy=True on Metal - test_metal_copy_false_no_mps_sync: verifies copy=False path works on Metal (with explicit torch.mps.synchronize() by caller)
…opy=False
torch.device('cuda:0') != torch.device('cuda') because PyTorch requires
both type and index to match. DLPack tensors always carry an explicit
index, so passing device='cuda' without an index would raise even though
no transfer is needed. Now compares type strictly and index only when
both sides specify one.
_try_zerocopy_torch and _try_zerocopy_numpy take Field as the type hint but call field.to_dlpack(), which only exists on ScalarField/MatrixField/ Ndarray subclasses. Add a NotImplementedError stub on Field so pyright can resolve the attribute.
SOA layout places each component in its own dense subtree, so parent.get_num_ch() == 1 < n*m. The previous check (> n*m) missed this and allowed DLPack export with wrong strides, silently corrupting data. Changed to != n*m for MatrixField and != 1 for ScalarField, so both SOA (num_ch < expected) and AOS struct members (num_ch > expected) are rejected. Only standalone AOS layout (num_ch == expected) allows zero-copy. Added test for SOA Vector.field copy=False raising ValueError.
…DA GPUs Restrict to arch=[qd.cuda] since qd.cfg.arch.name gives 'amdgpu'/'metal' which are not valid torch device strings (torch expects 'cuda'/'mps').
|
ok to merge |
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/_ndarray.py |
33% | 90,106 |
🔴 python/quadrants/lang/field.py |
67% | 528 |
🔴 python/quadrants/lang/matrix.py |
50% | 1293 |
🟢 tests/python/test_to_torch_copy.py |
100% |
Diff coverage: 89% · Overall: 67% · 36 lines, 4 missing
Pyright resolves the field type as the base Field class, which lacked the versioned parameter, causing reportCallIssue.
Resolve conflicts keeping v1 DLPack changes (writable numpy views, updated docstrings, to_dlpack docs section, writability tests).
Three fixes based on review feedback:
1. _patch_field_dlpack_canonical now handles both v0 ("dltensor") and v1
("dltensor_versioned") capsules with correct struct layouts. Previously
it hardcoded the v0 name, breaking layout-tagged fields (order='ji')
when the numpy path requested a v1 capsule.
2. Version-gate versioned=True behind numpy >= 2.1 (which added v1 DLPack
consumer support). On numpy 1.26-2.0, fall back to v0 capsules
(read-only but functional, matching pre-upgrade behavior).
3. Add tests for layout-tagged field to_numpy(copy=False), and guard
writability tests behind numpy >= 2.1.
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
python/quadrants/lang/field.py:528-540— Tensor wrapper atpython/quadrants/_tensor_wrapper.py:208-209was not updated for the newversioned=parameter, soqd.tensor(qd.f32, shape=(4,)).to_dlpack(versioned=True)raisesTypeError: to_dlpack() got an unexpected keyword argument 'versioned'. The wrapper docstring (lines 7-9) explicitly marketsqd.Tensoras whatqd.tensor()/qd.Vector.tensor()/qd.Matrix.tensor()return, so the new writable-numpy entry point is unreachable through the documented modern factory. Fix is one line:def to_dlpack(self, versioned: bool = False) -> typing.Any: return self._impl.to_dlpack(versioned=versioned).Extended reasoning...
What is the bug
This PR adds a new
versioned=parameter to the underlyingto_dlpackmethods on three impls:ScalarField.to_dlpack(field.py:528)MatrixField.to_dlpack(matrix.py:1293)Ndarray.to_dlpack(_ndarray.py:90)
But the public
Tensorwrapper atpython/quadrants/_tensor_wrapper.py:208-209is hardcoded as:def to_dlpack(self) -> typing.Any: return self._impl.to_dlpack()
— no
versionedparameter, no forwarding. The wrapper file is not in this PR's diff, which is the structural reason the gap was missed.Why this matters
Per the wrapper module docstring (lines 7-9),
qd.Tensoris the modern public class returned byqd.tensor(),qd.Vector.tensor(), andqd.Matrix.tensor(). Line 18 explicitly liststo_dlpackas a wrapper forward. Verifying the call chain in_tensor.py:qd.tensor(...)→_wrap_impl(...)→ returns aTensorwrapper instance.The new interop docs section
Raw DLPack export with to_dlpack()(interop.md:209-232) advertisesf.to_dlpack(versioned=True)as the documented entry point for writable numpy zero-copy. Users who follow the modernqd.tensor()factory pattern and reach for the new feature will hit the TypeError immediately.Step-by-step proof
import quadrants as qd qd.init(arch=qd.cpu) t = qd.tensor(qd.f32, shape=(4,)) # 1. qd.tensor(...) returns a Tensor wrapper (not a bare ScalarField/Ndarray). # 2. Tensor.to_dlpack signature is (self) — no kwargs. # 3. Python rejects the call before any forwarding happens. t.to_dlpack(versioned=True) # TypeError: to_dlpack() got an unexpected keyword argument 'versioned'
By contrast, the bare-impl path works:
f = qd.field(qd.f32, shape=(4,)) f.to_dlpack(versioned=True) # OK — bare ScalarField has the new kwarg
The new docs example uses bare
qd.field(...)rather thanqd.tensor(...), so a user who copy-pastes the doc literally will not trip — but anyone using the wrapper factory will.Why existing tests do not catch it
The new tests in
tests/python/test_to_torch_copy.py(test_scalar_field_to_numpy_copy_false_is_writable, etc.) all go throughf.to_numpy(copy=False)on bareqd.field(...)/qd.ndarray(...)— never throughqd.tensor(...). So the wrapper'sto_dlpackis not exercised on the new code path.How to fix
One-line change in
_tensor_wrapper.py:def to_dlpack(self, versioned: bool = False) -> typing.Any: return self._impl.to_dlpack(versioned=versioned)
Note on scope
The same wrapper (lines 192, 200) also drops the
copy=kwarg onto_numpy/to_torch, so users ofqd.tensor()already cannot reachcopy=Falsezero-copy through the wrapper either. That is a pre-existing gap and out of scope for this bug, but the cleanest fix is consistent with closing it: forward the new kwargs symmetrically.
np.from_dlpack does not accept raw PyCapsules -- it requires an object exposing __dlpack__/__dlpack_device__. Replace the broken example with guidance to use to_numpy(copy=False) for NumPy, which handles the protocol adapter internally.
ctypes.pythonapi is a PyDLL which auto-raises ValueError when PyCapsule_GetPointer fails (name mismatch), so the else branch was unreachable. Use PyCapsule_IsValid to dispatch without raising.
|
@claude review |
The PyCapsule destructor was a no-op, so unconsumed capsules (still named "dltensor"/"dltensor_versioned") leaked the DLManagedTensor(Versioned) struct, shape/strides arrays, and (for ndarrays) the pybind11::object pinning the Python wrapper. Now follows the standard producer convention: check if capsule is still unconsumed via PyCapsule_IsValid, and invoke the inner deleter if so. Also add inline comment in interop.md noting NumPy >= 2.1 requirement for writable assignment. Made-with: Cursor
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/_tensor_wrapper.py |
0% | 208-209 |
🔴 python/quadrants/lang/_ndarray.py |
33% | 90,110 |
🔴 python/quadrants/lang/field.py |
55% | 69-70,73,76,116-117,122,539,556 |
🔴 python/quadrants/lang/matrix.py |
50% | 1293 |
🟢 tests/python/test_to_torch_copy.py |
93% | 491,506,521,555 |
Diff coverage: 79% · Overall: 74% · 85 lines, 18 missing
Made-with: Cursor
|
@claude review |
…ack(versioned=True) Covers the two test coverage gaps flagged by CI: writable v1 capsule tests for MatrixField and the Tensor wrapper forwarding.
There was a problem hiding this comment.
All previously flagged issues addressed (versioned-capsule dispatch, layout-tag patch path, capsule deleter ownership, Tensor wrapper plumbing, doc thresholds, round-trip example caveat).
Extended reasoning...
Overview
This PR upgrades the DLPack export path to optionally emit DLPack v1 (DLManagedTensorVersioned) capsules so to_numpy(copy=False) can return writable arrays on NumPy >= 2.1. Touches the C++ DLPack producer (dlpack_funcs.cpp/h, export_lang.cpp), the Python to_dlpack wrappers on Ndarray/ScalarField/MatrixField/Tensor, the canonical-view ctypes patch helper, the user-guide interop docs, and a sizable batch of new tests in test_to_torch_copy.py.
Security risks
None applicable. This is an in-process buffer export ABI; there is no auth, network, deserialization of untrusted input, or privileged operation. The only resource-management concern (the empty v0 capsule deleter pre-existing in the v0 path, plus the new v1 path) was flagged earlier in this review and the current diff fixes both: the deleters now dispatch on PyCapsule_IsValid and invoke the inner managed-tensor deleter when the capsule is GC'd unconsumed, so producer-side leaks of DLManagedTensor*, the heap shape/strides, and the pybind11::object* pinning the ndarray wrapper are no longer possible.
Level of scrutiny
Higher than a typical config change because this is C++/Python ABI code at the boundary with PyTorch and NumPy, with manual memory management on both sides of the v0/v1 split, ctypes structure punning to mutate capsule contents in place, and version-dependent consumer behaviour. I ran multiple rounds of bug hunting on it and posted six distinct issue reports across the cycle — runtime regression on NumPy < 2.1, layout-tagged + v1 capsule patch path, PyDLL-auto-raise dead code, capsule deleter ownership leak, Tensor wrapper missing the new kwarg, and doc snippet accuracy. Each iteration of the diff has substantively addressed the concrete issue raised. I have re-walked the current code (field.py:107-122, _tensor_wrapper.py:208-209, dlpack_funcs.cpp:329-356, :460-498, interop.md:101, :140, :208-232) and confirmed the fixes are correct.
Other factors
The bug hunting system reported zero bugs on this run. A human reviewer (duburcqa) said "ok to merge". CI coverage for the diff is 89% with the missing lines being the no-torch ModuleNotFoundError branches and a couple of fall-through paths, which is acceptable. The new tests cover the writable-numpy path, the layout-tagged + numpy zero-copy combination (the regression I flagged in an earlier round), and the Tensor-wrapper passthrough. Given how iterative this review has been and how thoroughly the surface has been exercised, I'm comfortable shadow-approving rather than asking for another human pass.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/_tensor_wrapper.py |
50% | 208 |
🔴 python/quadrants/lang/_ndarray.py |
33% | 90,110 |
🔴 python/quadrants/lang/field.py |
55% | 69-70,73,76,116-117,122,539,556 |
🔴 python/quadrants/lang/matrix.py |
50% | 1293 |
🟢 tests/python/test_to_torch_copy.py |
93% | 491,506,521,555,577,595 |
Diff coverage: 83% · Overall: 74% · 110 lines, 19 missing
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough