Add BufferView: safe sub-range ndarray access for kernels by alanray-tech · Pull Request #585 · Genesis-Embodied-AI/quadrants

alanray-tech · 2026-04-28T09:17:35Z

Migrated from #445 (fork branch) to this in-repo branch per @hughperkins' request, so that CI AI review jobs can run. All commits, review history, and discussion are preserved in #445.

Summary

BufferView provides a safe, zero-copy sub-range view into an ndarray for kernel arguments. It rewrites view[i] to arr[offset + i] at AST-translation time, requiring no IR modifications.

In debug mode (debug=True), it inserts runtime bounds assertions that report the kernel name, thread ID, file and line for every frame in the callstack.

API

from quadrants import BufferView
import quadrants as qd

qd.init(arch=qd.cpu, debug=True)

N = 32
data = qd.ndarray(qd.f32, shape=(N,))

# Slice an ndarray to create a view:
view = data[:16]          # offset=0, size=16

# Slice a view to create a subview (offsets accumulate):
sub = view[4:12]          # offset=4, size=8 into data

# Or use subview() with explicit offset and size:
sub = view.subview(offset=4, size=8)

# Kernel annotation - dtype is optional:
@qd.kernel
def scale(v: BufferView[qd.f32], factor: qd.f32):
    for i in range(v.size):
        v[i] = v[i] * factor

scale(data[:16], 2.0)

Slicing forms a closed chain: ndarray -> slice -> BufferView -> slice -> BufferView. Each step validates bounds against the parent.

Debug output (OOB access):

quadrants.lang.exception.QuadrantsAssertionError:
BufferView Out Of Range: kernel[kernel] tid=0, got index 16 (offset=0, size=16).
Callstack:
kernel (script.py:12)
  writer (script.py:7)

Design notes

Zero C++ changes. All features are implemented purely in the Python AST transformation and runtime layers.
Host-side validation: BufferView.__init__ coerces offset/size to int and checks bounds (1D, non-negative, within ndarray length). subview() and slice validate against the parent view's size.
Device-side validation: subscript() inserts qd_assert(0 <= i < size) in debug mode, with kernel name, thread ID, and full callstack in the error message. _subview_expr() inserts qd_assert(offset >= 0, size >= 0, offset + size <= parent_size).
BufferView[dtype] annotation delegates to BufferViewType via __class_getitem__. dtype can be omitted - Quadrants infers it from the passed ndarray at compile time.
Kernel-side slicing: v[4:8] inside a kernel is intercepted by impl.subscript() and routed to subview().

Files changed

File	Change
`python/quadrants/lang/buffer_view.py`	New - BufferView class with `subview()`, `_subview_expr()`, `__getitem__` slice, `__class_getitem__`, debug OOB
`python/quadrants/types/buffer_view_type.py`	New - BufferViewType annotation
`python/quadrants/lang/impl.py`	BufferView dispatch in subscript / assign, kernel-side slice support
`python/quadrants/lang/_func_base.py`	BufferViewType param handling, dtype-optional support
`python/quadrants/lang/_template_mapper_hotpath.py`	BufferViewType cache key + dtype validation
`python/quadrants/lang/ast/.../function_def_transformer.py`	AST decomposition for BufferViewType, @qd.func support
`python/quadrants/lang/_ndarray.py`	`_slice_to_buffer_view` enables `data[:16]` slice syntax
`python/quadrants/lang/matrix.py`	Slice guard for VectorNdarray and MatrixNdarray
`python/quadrants/lang/__init__.py`	Export BufferView, filter buffer_view module from all
`python/quadrants/types/__init__.py`	Export BufferViewType
`python/quadrants/types/enums.py`	Add BoundaryMode to all, improve error message
`tests/python/test_api.py`	Add BufferView to expected public API list
`docs/source/user_guide/buffer_view.md`	New - user-facing guide
`docs/source/user_guide/index.md`	Add buffer_view to Core concepts toctree
`tests/python/test_buffer_view.py`	New - 40 tests

See user-facing documentation for full usage guide.

Test plan (40 tests)

Group A - Slice syntax + host-side validation (17 tests)

Slice semantics: data[:16], data[8:24], data[:], data[8:], data[-8:]
Error paths: step!=1, 2D array, negative offset, negative size, exceeds length
Subview: view.subview(4, 8), subview OOB, chained view[4:12], view step error, view int index error
VectorNdarray slice, explicit constructor

Group B - Kernel functional tests (10 tests)

BufferView[dtype] and plain BufferView annotations
Write isolation, v.size iteration, multiple views, @qd.func annotation
Kernel-internal: BufferView(arr, off, sz) construction, v.subview(4, 4), v[4:8] slice
Type mismatch error

Group C - Debug mode: OOB + callstack (13 tests, x64 + cuda)

subscript() bounds: upper, lower, offset/size in message, nested callstack
_subview_expr() bounds: subview OOB
No false positive, no assertion without debug mode

All 40 tests pass on x64 and cuda backends.

AI Attestation

See timestamped comment in #445.

Maintenance

I will monitor and respond to any bugs or issues related to this feature.

BufferView provides a safe, zero-copy sub-range view into an ndarray for kernel arguments. It rewrites view[i] to arr[offset + i] at AST-translation time with zero IR changes. In debug mode, inserts runtime bounds assertions with full callstack diagnostics (kernel name, thread ID, file:line per frame). Can be passed directly as a kernel parameter via qd.types.buffer_view(dtype), which auto-decomposes into (ndarray, offset, count) at compile time. Minor: improve boundary enum error message to list valid options.

…tests - BufferView[dtype] square-bracket annotation replaces qd.types.buffer_view() call-style; removes buffer_view alias from buffer_view_type.py - Add BufferView.__class_getitem__ enabling v: BufferView[qd.f32] annotation - Add BufferViewType branch to _transform_func_arg so @qd.func accepts BufferView[dtype] annotations directly, no qd.template() workaround needed - Add ndarray slice syntax: data[:16] returns BufferView(data, 0, 16) Handles start/stop/negative indices; step must be 1; 1D arrays only. Covers ScalarNdarray, VectorNdarray, MatrixNdarray. - Add BoundaryMode to types/enums.py __all__ - Add docs/source/user_guide/buffer_view.md - Add tests/python/test_buffer_view.py: 26 tests covering slice syntax, kernel annotation, @qd.func support, debug OOB with callstack on x64/cuda

- black -l 120: collapse single-element raise() calls to one line (_ndarray.py, function_def_transformer.py) - Add # pylint: disable=C0415 # noqa: I001 to local BufferView import in _ndarray.py (import-outside-toplevel is intentional to avoid the impl.py -> buffer_view.py -> _ndarray.py circular dep) - Remove erroneous _slice_to_buffer_view guard from MatrixField.__getitem__ (MatrixField does not inherit from Ndarray; only MatrixNdarray and VectorNdarray need the slice guard)

…write - Rename BufferView.count -> BufferView.size (per duburcqa: aligns with torch) - Add BufferView.shape property returning (size,) tuple - Make dtype optional in annotation: v: BufferView infers dtype from the passed ndarray at compile time (add elif annotation is BufferView branch in check_parameter_annotations, converts to BufferViewType()) - Update @qd.func section in docs: annotation is optional on qd.funcs - Rewrite buffer_view.md: simpler intro, no line wrapping, remove duplicate limitations section, update count->size throughout - Add test_no_dtype_annotation to verify dtype-optional path - Update all v.count -> v.size in tests and debug messages (offset=, size=)

…add BufferView to test_api expected list

… branch, reversed-slice clamp

…eation

…r, clarify func dtype validation

…ic types

…chain

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 412cb1f4f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…xis, reject slice step!=1

github-actions · 2026-04-28T10:18:05Z

Coverage Report (`03e4d02c9`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	7
🟢 `python/quadrants/lang/_func_base.py`	92%	38
🟢 `python/quadrants/lang/_ndarray.py`	91%	266
🟢 `python/quadrants/lang/_template_mapper_hotpath.py`	92%	39
🔴 `python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	79%	24,31,226
🔴 `python/quadrants/lang/buffer_view.py`	67%	3,5-9,11,14,28,35,49-56,58-62,65,70,94,96-97,100,103,114,123-124,128,143-144,173,188-189,230,235
🟢 `python/quadrants/lang/impl.py`	83%	21,259
🔴 `python/quadrants/lang/matrix.py`	75%	1717
🔴 `python/quadrants/types/__init__.py`	0%	13
🔴 `python/quadrants/types/buffer_view_type.py`	21%	1,4,25,30-34,36-37,40
🔴 `python/quadrants/types/enums.py`	0%	15,47
🟢 `tests/python/test_buffer_view.py`	99%	366-367,468

Diff coverage: 88% · Overall: 61% · 549 lines, 68 missing

Full annotated report

alanray-tech

Inline comments for the two codex review fixes: kernel-side bounds assertion and slice step rejection.

hughperkins · 2026-04-28T10:23:40Z

Note: some comment underwraps to address: https://github.com/Genesis-Embodied-AI/quadrants/actions/runs/25045033191/job/73357813381

…-char width

github-actions · 2026-04-28T11:38:55Z

Coverage Report (`97ba51b81`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	7
🟢 `python/quadrants/lang/_func_base.py`	92%	38
🟢 `python/quadrants/lang/_ndarray.py`	91%	266
🟢 `python/quadrants/lang/_template_mapper_hotpath.py`	92%	39
🔴 `python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	79%	24,31,226
🔴 `python/quadrants/lang/buffer_view.py`	70%	3,5-9,11,14,28,35,49-56,58-62,65,70,94,96-97,100,103,124,133-134,138,153-154,183,198-199,240,245
🟢 `python/quadrants/lang/impl.py`	86%	21,259
🔴 `python/quadrants/lang/matrix.py`	75%	1717
🔴 `python/quadrants/types/__init__.py`	0%	13
🔴 `python/quadrants/types/buffer_view_type.py`	21%	1,4,24,29-33,35-36,39
🔴 `python/quadrants/types/enums.py`	0%	15,47
🟢 `tests/python/test_buffer_view.py`	98%	367-369,382-383,484,500

Diff coverage: 88% · Overall: 73% · 584 lines, 72 missing

Full annotated report

hughperkins · 2026-04-28T12:35:43Z

Once you are ready-ish to merge, could you run Genesis benchmarks and unit-tests please.

There are three approaches that I know of to do this:

create a pre-release release of Quadrants (I have a video I can send you of how to do this), then create a draft PR onto Genesis, that updated quadrants version to use the release, and wait for PR to run
- this is the easiest way, and if you're not sure waht to do, I'd go this way
- the main downside is that it takes a lloooonnnggg elapsed timed
  - creating the pre-release takes ~1.5 hours
  - then waiting for CI to run on Genesis another ~2-4 hours, depending on CI load
spin up your own cluster node, run the benchmarks and unit tests manually somehow
- I wouldn't recommend this route, on the whole. Hard to do, easy to make errors, not very repeatable, doesnt give you a nice results table
use my script at https://github.com/Genesis-Embodied-AI/cmp-tooling/blob/main/bench_cluster_wandb.py (and also unit_tests_cluster.py)
- does eveyrhting for you, once you've done initial setup (primarliy cloning repos in the right relative locations)
- gives you a nice results table at teh end
- just run a single command (once initial setup is done)

alanray-tech · 2026-04-28T14:25:46Z

Once you are ready-ish to merge, could you run Genesis benchmarks and unit-tests please.

There are three approaches that I know of to do this:

create a pre-release release of Quadrants (I have a video I can send you of how to do this), then create a draft PR onto Genesis, that updated quadrants version to use the release, and wait for PR to run

this is the easiest way, and if you're not sure waht to do, I'd go this way

the main downside is that it takes a lloooonnnggg elapsed timed

creating the pre-release takes ~1.5 hours

then waiting for CI to run on Genesis another ~2-4 hours, depending on CI load

spin up your own cluster node, run the benchmarks and unit tests manually somehow

I wouldn't recommend this route, on the whole. Hard to do, easy to make errors, not very repeatable, doesnt give you a nice results table

use my script at https://github.com/Genesis-Embodied-AI/cmp-tooling/blob/main/bench_cluster_wandb.py (and also unit_tests_cluster.py)

does eveyrhting for you, once you've done initial setup (primarliy cloning repos in the right relative locations)

gives you a nice results table at teh end

just run a single command (once initial setup is done)

I think 3. is good enough.

Do I need some account for the GPU cluster?
Anything I need to provide after running this?

hughperkins · 2026-04-28T15:08:33Z

Once you are ready-ish to merge, could you run Genesis benchmarks and unit-tests please.
There are three approaches that I know of to do this:

create a pre-release release of Quadrants (I have a video I can send you of how to do this), then create a draft PR onto Genesis, that updated quadrants version to use the release, and wait for PR to run

this is the easiest way, and if you're not sure waht to do, I'd go this way

the main downside is that it takes a lloooonnnggg elapsed timed

creating the pre-release takes ~1.5 hours

then waiting for CI to run on Genesis another ~2-4 hours, depending on CI load

spin up your own cluster node, run the benchmarks and unit tests manually somehow

I wouldn't recommend this route, on the whole. Hard to do, easy to make errors, not very repeatable, doesnt give you a nice results table

use my script at https://github.com/Genesis-Embodied-AI/cmp-tooling/blob/main/bench_cluster_wandb.py (and also unit_tests_cluster.py)

does eveyrhting for you, once you've done initial setup (primarliy cloning repos in the right relative locations)

gives you a nice results table at teh end

just run a single command (once initial setup is done)

I think 3. is good enough.

Do I need some account for the GPU cluster?

Anything I need to provide after running this?

If you configure slack, it'll slack you the unit test stats, and an image of the benchmark results

copy and paste the unit test stats here
upload the benchmarks results image here

hughperkins · 2026-04-29T11:25:57Z

Yes, please submit a pr to improve this script :) thank you

…

On Wed, Apr 29, 2026, 07:19 alanray-tech ***@***.***> wrote: *alanray-tech* left a comment (Genesis-Embodied-AI/quadrants#585) <#585 (comment)> @hughperkins <https://github.com/hughperkins> I ran into a small issue running \unit_tests_cluster.py\ on the cluster. The script writes \salloc\ output to a hardcoded path /tmp/salloc_test.txt, which is currently owned by your user (\hugh:hugh), so my user gets \Permission denied\ when trying to write to it. Could you delete it when you get a chance? \ rm /tmp/salloc_test.txt \\ Alternatively, I could patch the script to use a per-user filename like /tmp/salloc_test_.txt\ — happy to open a quick PR on cmp-tooling if you prefer that. — Reply to this email directly, view it on GitHub <#585 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA6FKEAW4C7JM3IJ7ZKHIL4YHQMXAVCNFSM6AAAAACYJCKIJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGNBTGE4TQNZQG4> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

alanray-tech · 2026-04-29T12:37:22Z

Genesis Validation Results

Unit Tests

python3 unit_tests_cluster.py --branch main --backend ndarray \
  --quadrants-branch feature/buffer-view --partition rtx-mid --time 60

Item	Value
Genesis branch	`main` (c5984a2)
Quadrants branch	`feature/buffer-view` (`97ba51b`)
Backend	ndarray
Partition	rtx-mid (8 GPU)
Duration	13m 22s

Result: 582 passed, 2 skipped, 2 xfailed, 0 failed

The 2 skips are pre-existing (SAPCoupler does not support ndarray yet). The 2 xfails are pre-existing known issues. No regressions introduced by BufferView.

Speed Benchmarks

python3 bench_cluster_wandb.py --ref buffer-view --branch main --backend ndarray \
  --quadrants-branch feature/buffer-view --partition rtx-mid --time 60

32 benchmarks passed. Branch results vs WandB main baseline (selected high-traffic envs):

Env	Backend	Batch	Branch FPS	Main FPS	Delta
franka	cuda	30000	15,999,389	15,852,506	+0.9%
franka_random	cuda	30000	12,348,453	12,508,873	-1.3%
franka_free	cuda	30000	21,251,349	21,284,966	-0.2%
anymal_uniform	cuda	30000	10,566,985	10,382,809	+1.8%
anymal_zero	cuda	30000	14,455,816	14,235,909	+1.5%
go2 (Newton)	cuda	4096	3,052,034	3,040,264	+0.4%
dex_hand	cuda	4096	14,196	14,257	-0.4%
box_pyramid_6	cuda	4096	53,062	51,841	+2.4%
shadow_hand_cubes	cpu	0	31	31	0.0%
franka_random	cpu	0	3,179	3,042	+4.5%

All deltas are within normal run-to-run variance (< 5%). No performance regressions.

…rray kernel tests

alanray-tech

Final comprehensive review: inline comments on every changed file documenting the design intent and rationale for each modification. 15 files changed, 46 tests, 0 regressions in Genesis benchmarks (ndarray + field).

alanray-tech · 2026-04-30T10:23:06Z

+                impl.qd_assert((offset_expr >= Expr(0)).ptr, msg, args, dbg_info)
+                impl.qd_assert((size_expr >= Expr(0)).ptr, msg, args, dbg_info)
+                impl.qd_assert(((offset_expr + size_expr) <= arr_len).ptr, msg, args, dbg_info)
+        else:


Two construction paths gated by impl.inside_kernel():

Host-side: int() coercion + 1D shape check + bounds validation (offset >= 0, size >= 0, offset+size <= len).

Kernel-compilation: Debug-mode assertions via ExternalTensorShapeAlongAxisStmt (same IR instruction as C++ CheckOutOfBound pass). Validates offset >= 0, size >= 0, offset+size <= ndarray_length at device runtime.

alanray-tech · 2026-04-30T10:27:50Z

@hughperkins Merged latest main. I believe this PR is ready to merge once CI passes:

All reviewer feedback addressed (API rename count->size, dtype-optional, doc rewrite, coverage fixes)
Genesis validation: 582 unit tests passed (ndarray), 64 benchmarks passed (ndarray + field), no regressions
46 local tests covering all code paths (host slice, kernel annotation, subview, kernel slice, debug OOB with callstack, Matrix/Vector ndarray support)
Inline review comments added on every changed file for documentation

Let me know if anything else is needed.

alanray-tech

Corrected inline review with accurate line references (post-merge). Documents design intent for every changed file.

alanray-tech · 2026-04-30T10:47:59Z

+    _is_quadrants_class = True
+
+    @classmethod
+    def __class_getitem__(cls, dtype):


BufferView[qd.f32] -> BufferViewType(qd.f32). Lets BufferView serve as both runtime object and type annotation, consistent with Python generics (list[int]). The entire downstream pipeline (_func_base, _template_mapper_hotpath, function_def_transformer) recognizes BufferViewType instances -- no downstream changes needed.

alanray-tech · 2026-04-30T10:47:59Z

+            return BufferViewType(*dtype)
+        return BufferViewType(dtype)
+
+    def __init__(self, arr, offset, size):


Two construction paths gated by impl.inside_kernel():

Host-side: int() coercion + 1D shape check + bounds validation.

Kernel-compilation: Debug-mode assertions via ExternalTensorShapeAlongAxisStmt (same IR instruction as C++ CheckOutOfBound pass) -- validates offset >= 0, size >= 0, offset+size <= ndarray_length at device runtime.

alanray-tech · 2026-04-30T10:47:59Z

+        """Returns the shape of this view as a tuple, e.g. ``(16,)``."""
+        return (self.size,)
+
+    def subview(self, offset, size):


subview() creates a narrower view within an existing view. Host path validates against self.size. Kernel path delegates to _subview_expr() which inserts three qd_assert calls (offset >= 0, size >= 0, offset+size <= parent_size) in debug mode.

alanray-tech · 2026-04-30T10:47:59Z

+        new_offset = Expr(self.offset) + offset_expr
+        return BufferView(self.arr, new_offset, size_expr)
+
+    def __getitem__(self, key):


__getitem__ is only reachable on the host. Inside kernels, v[...] is intercepted by the AST transformer and dispatched through impl.subscript() -> subscript() (index) or subview() (slice). The assert guards against accidental future routing changes.

alanray-tech · 2026-04-30T10:47:59Z

+        return self.subview(start, max(stop - start, 0))
+
+    @quadrants_scope
+    def subscript(self, *indices):


_build_callstack is called at compile time. The callstack string is frozen as a literal in the compiled IR assertion message. Dynamic values (tid, index, offset, size) are injected at runtime via %d format args -- zero runtime string-building cost.

alanray-tech · 2026-04-30T10:47:59Z


    indices_expr_group = None
    if has_slice:
+        if isinstance(value, BufferView):


Kernel-side v[4:8] slice: AST transformer's build_Subscript -> build_Slice -> impl.subscript(). Converts slice(start, stop) to subview(start, stop-start), inheriting debug-mode bounds assertions from _subview_expr(). Step != 1 is rejected.

alanray-tech · 2026-04-30T10:47:59Z

    @python_scope
    def __getitem__(self, key):
+        if isinstance(key, slice):
+            return self._slice_to_buffer_view(key)


Slice guard for MatrixNdarray.__getitem__. Without this, mat_arr[:16] reaches tuple(key) -> TypeError: 'slice' object is not iterable. Same guard on both VectorNdarray.__getitem__ overrides.

alanray-tech · 2026-04-30T10:47:59Z

    not in [
        "any_array",
        "ast",
+        "buffer_view",


buffer_view excluded from __all__ to prevent the submodule name from leaking into the public API. BufferView (the class) is exported via from quadrants.lang.buffer_view import *; the module itself should not be user-visible.

alanray-tech · 2026-04-30T10:47:59Z



-__all__ = ["Layout", "AutodiffMode", "SNodeGradType", "Format", "DeviceCapability"]
+__all__ = ["Layout", "AutodiffMode", "SNodeGradType", "Format", "BoundaryMode", "DeviceCapability"]


BoundaryMode was defined but absent from __all__. Pre-existing issue, fixed since BufferView uses BoundaryMode for boundary checking.

alanray-tech · 2026-04-30T10:47:59Z

 user_api[qd] = [
    "Backend",
    "BitpackedFields",
+    "BufferView",


BufferView registered as a new public API symbol. test_api checks qd.__all__ against this hardcoded list.

hughperkins · 2026-04-30T10:49:39Z

@claude review

hughperkins · 2026-04-30T11:21:10Z

Note: for claude review, whilst I think we should address 🔴 comments, I'll leave it to your own discretion whether you wish to address 🟡 comments.

…ify kernel slice step error

alanray-tech · 2026-04-30T11:31:29Z

Addressed all three @claude review findings in 10b1065:

BufferViewType.boundary dead field — Removed self.boundary (string). All functional consumers read self.ndarray_type.boundary (int). __repr__ updated to use self.ndarray_type.boundary.
_build_callstack docstring typo — _CALLSTACK_WARNING_SIZE -> _CALLSTACK_WARNING_BYTE to match the actual constant name.
Kernel slice step error message — Changed from "requires step=1 (or omit step)" to "does not support an explicit step in kernels". This avoids the misleading suggestion that step=1 should work, since inside a kernel s.step is an Expr and cannot be compared to 1 at compile time.

github-actions · 2026-04-30T11:31:51Z

Coverage Report (`6f780b0c6`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	7
🟢 `python/quadrants/lang/_func_base.py`	92%	42
🟢 `python/quadrants/lang/_ndarray.py`	91%	391
🟢 `python/quadrants/lang/_template_mapper_hotpath.py`	92%	43
🔴 `python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	79%	29,36,351
🔴 `python/quadrants/lang/buffer_view.py`	76%	3,5-10,12,14,17,31,38,51,59,83,85-86,89,92,113,122-123,127,142-143,172,187-188,229,234
🟢 `python/quadrants/lang/impl.py`	86%	21,271
🟢 `python/quadrants/lang/matrix.py`	100%
🔴 `python/quadrants/types/__init__.py`	0%	13
🔴 `python/quadrants/types/buffer_view_type.py`	33%	1,4,23,28-29,32
🔴 `python/quadrants/types/enums.py`	0%	15,47
🟢 `tests/python/test_buffer_view.py`	98%	420-422,435-436,537,553

Diff coverage: 91% · Overall: 74% · 604 lines, 55 missing

Full annotated report

github-actions · 2026-04-30T12:43:11Z

Coverage Report (`10b106524`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	7
🟢 `python/quadrants/lang/_func_base.py`	92%	42
🟢 `python/quadrants/lang/_ndarray.py`	91%	391
🟢 `python/quadrants/lang/_template_mapper_hotpath.py`	92%	43
🔴 `python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	79%	29,36,351
🔴 `python/quadrants/lang/buffer_view.py`	76%	3,5-10,12,14,17,31,38,51,59,83,85-86,89,92,113,122-123,127,142-143,172,187-188,229,234
🟢 `python/quadrants/lang/impl.py`	86%	21,271
🟢 `python/quadrants/lang/matrix.py`	100%
🔴 `python/quadrants/types/__init__.py`	0%	13
🔴 `python/quadrants/types/buffer_view_type.py`	25%	1,4,23,27-28,31
🔴 `python/quadrants/types/enums.py`	0%	15,47
🟢 `tests/python/test_buffer_view.py`	98%	420-422,435-436,537,553

Diff coverage: 91% · Overall: 67% · 603 lines, 55 missing

Full annotated report

alanray-tech · 2026-04-30T12:53:34Z

Addressed @claude review (7c61de7):

1. _build_callstack dedup by funcname only (line 35) — Fixed. Now deduplicates by (filepath, funcname) instead of funcname alone. Same-named functions in different modules are no longer collapsed into a single frame.

2. __init__ kernel branch missing 1D check + callstack (line 106) — Not fixing. The ndim is already guaranteed by the compiler: function_def_transformer._decl_and_create_variable calls decl_ndarray_arg(..., ndim=1), so the backing array is always 1D by construction. Adding a redundant runtime check + callstack here would add complexity without catching any real bug. The existing three assertion args (offset >= 0, size >= 0, offset+size <= length) are sufficient.

…eeds_grad ndarrays

alanray-tech · 2026-04-30T13:33:03Z

Fixed both @claude findings in 7cc3e3f:

1. Launch-context cache id-reuse bug (_func_base.py:685) — Changed return 3, True to return 3, False. Temporary BufferView objects created by data[:16] are GC'd between calls, and CPython can reuse the same memory address for the next data[16:32], causing cache hits with stale offset/size. Disabling cacheability is the correct fix — the ~540ns pybind11 overhead per call is negligible vs kernel execution time.

2. Silent gradient drop (buffer_view.py) — Added TypeError in host-side __init__ when the backing ndarray has .grad is not None. BufferView currently hardcodes needs_grad=False at three sites (cache key, decl_ndarray_arg, launch_ctx). Rather than threading gradient support through all three, we reject it early with a clear error. Gradient support can be added later if needed.

claude · 2026-04-30T13:57:48Z

+        if needed_arg_basetype is buffer_view_type.BufferViewType and isinstance(v, BufferViewInstance):
+            inner = v.get_ndarray()
+            assert isinstance(inner, Ndarray)
+            launch_ctx_buffer[_QD_ARRAY].append((index, inner.arr))
+            launch_ctx_buffer[_INT].append((index + 1, int(v.offset)))
+            launch_ctx_buffer[_INT].append((index + 2, int(v.size)))
+            return 3, False


🔴 🟡 The frozen-dataclass fast path at _func_base.py:626-650 discards the cacheability flag from each child's recursive call (line 637 destructures with _) and returns a hardcoded True (line 650), overriding the (3, False) that the BufferView branch correctly returns at line 685. So a frozen dataclass containing a BufferView field reintroduces the same id()-reuse silent corruption that commit 7cc3e3f closed at top level: fill(MyStruct(v=data[:16])) followed by fill(MyStruct(v=data[16:32])) can replay stale offset/size from the launch_ctx cache and write the wrong region. This is a pre-existing gap in the frozen-DC fast path (introduced in PR #561), but BufferView is the first user-facing arg type that documents transient-instance construction (data[:16]) where cacheability=False is correctness-critical. Fix is mechanical: mirror the non-frozen branch at lines 664-678 — capture each child's cacheability and AND-aggregate into a returned bool.

Extended reasoning...

What the bug is

The frozen-dataclass fast path in FuncBase._recursive_set_args discards the is_launch_ctx_cacheable flag from every recursive child call and unconditionally tells the caller the dataclass is cacheable. The non-frozen branch immediately below does this correctly — it AND-aggregates each child's cacheability and returns the result.

# python/quadrants/lang/_func_base.py:626-650 (frozen-DC, BUGGY) if is_frozen: plan = _get_frozen_dc_plan(...) unwrapped = _get_frozen_dc_unwrapped(v, needed_arg_fields) for field_name, field_full_name, field_type in plan: field_value = unwrapped[field_name] num_args_, _ = FuncBase._recursive_set_args(...) # ← cacheability discarded idx += num_args_ return idx, True # ← hardcoded True # python/quadrants/lang/_func_base.py:651-678 (non-frozen, CORRECT pattern) is_launch_ctx_cacheable = False # (separately buggy — should be True, but irrelevant here) for field in needed_arg_fields.values(): ... num_args_, is_launch_ctx_cacheable_ = FuncBase._recursive_set_args(...) idx += num_args_ is_launch_ctx_cacheable &= is_launch_ctx_cacheable_ return idx, is_launch_ctx_cacheable

The BufferView branch at line 685 correctly returns (3, False) — this is the fix from commit 7cc3e3f that closed the top-level id()-reuse bug for transient slices like data[:16]. But when a BufferView is wrapped in a frozen dataclass, that False is silently overridden back to True by line 650.

How it manifests

@dataclass(frozen=True) class MyStruct: v: BufferView[qd.f32] @qd.kernel def fill(s: MyStruct): for i in range(s.v.size): s.v[i] = 1.0 data = qd.ndarray(qd.f32, shape=(64,)) fill(MyStruct(v=data[:16])) # call 1: caches launch_ctx with offset=0, size=16 # MyStruct + BufferView GC'd; pymalloc returns slot to LIFO free list fill(MyStruct(v=data[16:32])) # call 2: NEW MyStruct lands at SAME id() # args_hash = (id(t_kernel), id(MyStruct)) → cache hit # populate_launch_ctx_from_cache copies STALE offset=0, size=16 # kernel writes data[0:16] AGAIN, data[16:32] untouched

Step-by-step:

kernel.py:461 computes args_hash = (id(t_kernel), *[id(arg) for arg in args]) — uses Python id() of each top-level arg, including the transient MyStruct wrapper.

After call 1, MyStruct and its BufferView field are GC'd. CPython pymalloc puts the slot back on a per-size-class LIFO free list.

Call 2 allocates a new MyStruct at the same address (deterministic on CPython for same-size-class objects).

_recursive_set_args runs the frozen-DC fast path; at line 637 it gets (3, False) from the BufferView child but discards the False.

Line 650 returns True → top-level caller treats the cache entry as valid.

populate_launch_ctx_from_cache (kernel.py:178) copies the cached launch_ctx wholesale, including the stale offset=0, size=16 values that launch_ctx.set_args_int baked in during call 1.

Kernel runs with stale offset/size — writes data[0:16] instead of data[16:32].

Expected: data[0:32] all 1.0. Actual: data[0:16] all 1.0, data[16:32] untouched. No exception.

Why existing safeguards don't prevent it

The line-685 return 3, False fix from 7cc3e3f covers top-level BufferView args, but only because the top-level dispatch in Kernel.__call__ reads the returned cacheability directly. The frozen-DC parent intercepts and discards that signal one wrapping level deeper.

LaunchContextBufferCache weakref-evict callbacks (kernel.py:122-153) only track _QD_ARRAY entries, which point at the long-lived inner.arr (the backing ndarray), not the transient wrapper. So GC of the wrapper does not invalidate the cache entry.

The 40 BufferView tests in tests/python/test_buffer_view.py do not exercise frozen-DC + BufferView. test_multiple_views passes views in a single call (both alive at once, no id collision), not in sequential calls.

The _template_mapper_hotpath cache key for BufferView at line 152 is (element_type, ndim, False, boundary) with no offset/size — so two slices like data[:16] and data[16:32] produce the same template-cache entry. That's correct for template caching but means they collide on the post-template launch-context cache too.

Pre-existing nature

This gap was introduced in PR #561 (the frozen-DC fast path). It would also affect frozen DCs with float fields (line 609 returns 1, False), large-int fields (line 619 conditional), and external arrays — any leaf that returns False. BufferView is the most user-facing new participant because the user guide explicitly documents data[:16] (transient slice) as the idiomatic usage pattern, and a user wrapping that in a @dataclass(frozen=True) is plausible.

The fix lives entirely in pre-existing code (lines 626-650 are not touched by this PR), so this technically meets the bar for pre_existing severity. I'm filing it as normal because:

The user-facing surface area materially expanded with this PR — BufferView is the first arg type whose documented idiom produces transient instances, and a user wrapping that pattern in a frozen DC is plausible.

Silent data corruption with a documented usage pattern warrants pre-merge attention rather than a follow-up.

The fix is mechanical (~3 lines) and matches an existing precedent already in the file.

Maintainers may reasonably disagree and prefer to land this and address the frozen-DC gap comprehensively in a follow-up that also covers float/int/external-array participants.

How to fix

Mirror the non-frozen branch's pattern (lines 664-678). Capture each child's cacheability, AND-aggregate into a running flag, return that instead of hardcoded True:

if is_frozen: plan = _get_frozen_dc_plan(...) unwrapped = _get_frozen_dc_unwrapped(v, needed_arg_fields) is_launch_ctx_cacheable = True # ← initialize True for AND-aggregation for field_name, field_full_name, field_type in plan: field_value = unwrapped[field_name] num_args_, is_launch_ctx_cacheable_ = FuncBase._recursive_set_args(...) idx += num_args_ is_launch_ctx_cacheable &= is_launch_ctx_cacheable_ return idx, is_launch_ctx_cacheable

Note: the non-frozen branch at line 652 initializes is_launch_ctx_cacheable = False, which makes False & x = False always — the non-frozen path is over-conservative (no cache hits), the frozen path is over-optimistic (stale cache hits). The frozen-path fix above uses the correct True initializer; the non-frozen-path's False initializer is a separate pre-existing bug worth addressing in the same patch but doesn't affect this verdict.

hughperkins · 2026-04-30T14:08:50Z

Reminder: no obligation to handle the 🟡 . And every time you handle a 🟡 you risk getting a new 🔴 :). Just handle the 🟡 if you want to (if you think it's going to be useful to you, in the short-term). You can always make newer PRs later.

…ith clear error

alanray-tech · 2026-04-30T14:10:31Z

Addressed 2 of 3 @claude findings in ffd7334:

Doc: needs_grad limitation — Added to the Limitations section in buffer_view.md.

Nested BufferView(view, ...) rejection — Added isinstance(arr, BufferView) check in __init__ with clear error message pointing to subview().

Re: frozen dataclass cache override (_func_base.py:626-650) — This is a pre-existing issue in the frozen-dataclass fast path (introduced in PR #561), not specific to BufferView. The fast path discards is_launch_ctx_cacheable from child fields and hardcodes True, which would override our return 3, False fix for any dataclass containing a BufferView field.

@hughperkins Should this be tracked as a separate issue? The fix would be to propagate the cacheability flag in the frozen-DC fast path, but that changes behavior for all frozen dataclass args, not just BufferView.

alanray-tech · 2026-04-30T14:16:03Z

For context on the launch-ctx cache issue — here's a concrete example of the id-reuse bug that return 3, False fixes:

data = qd.ndarray(qd.f32, shape=(32,))
data.from_numpy(np.zeros(32, dtype=np.float32))

@qd.kernel
def fill(v: BufferView[qd.f32]):
    for i in range(v.size):
        v[i] = 1.0

# Call 1
fill(data[:16])       # temp BufferView(data, 0, 16), id=0xABC
                       # cache miss -> set launch_ctx: arr=data, offset=0, size=16
                       # cache stores: key=id(0xABC) -> launch_ctx
                       # executes: data[0:16] = 1.0  (correct)
                       # call returns, temp object has no refs -> GC frees 0xABC

# Call 2
fill(data[16:32])     # new BufferView(data, 16, 16)
                       # CPython pymalloc reuses address -> id=0xABC (same as call 1!)
                       # cache hit -> reuses old launch_ctx: offset=0, size=16
                       # executes: data[0:16] = 1.0 again  (WRONG! should write data[16:32])

# Expected: all 32 elements = 1.0
# Actual:   data[0:16] = 1.0, data[16:32] = 0.0 (second call wrote to wrong region)

The root cause is that id(arg) is used as the cache key, but CPython reuses memory addresses for short-lived objects of the same size. return 3, False disables caching for BufferView, so launch_ctx is always rebuilt with the correct offset/size.

The frozen-dataclass fast path (_func_base.py:626-650) has the same vulnerability: it discards child is_launch_ctx_cacheable flags and hardcodes True, so a frozen dataclass containing a BufferView field would re-enable caching and reproduce this bug. This is a pre-existing issue in the DC fast path, not introduced by BufferView.

alanray-tech · 2026-04-30T15:17:41Z

@hughperkins Thanks for the guidance. I'll stop addressing further 🟡 nits here — the remaining findings (kernel-side nested BufferView rejection, frozen-DC cache propagation) are edge cases that can be handled in follow-up PRs.

The PR is ready for merge from my side. Summary of current state:

46 tests, all passing (host slice, kernel annotation, subview, kernel slice, debug OOB with callstack, Matrix/Vector ndarray support)
Genesis validation: 582 unit tests passed, 64 benchmarks passed (ndarray + field), no regressions
All reviewer feedback addressed (API, docs, coverage, cache safety, gradient guard)

github-actions · 2026-04-30T15:36:07Z

Coverage Report (`ffd7334c1`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	7
🟢 `python/quadrants/lang/_func_base.py`	92%	42
🟢 `python/quadrants/lang/_ndarray.py`	91%	391
🟢 `python/quadrants/lang/_template_mapper_hotpath.py`	92%	43
🔴 `python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	79%	29,36,351
🔴 `python/quadrants/lang/buffer_view.py`	75%	3,5-10,12,14,17,31,39,52,60,84,86-87,90,93,112,114,118,127-128,132,147-148,177,192-193,234,239
🟢 `python/quadrants/lang/impl.py`	86%	21,271
🟢 `python/quadrants/lang/matrix.py`	100%
🔴 `python/quadrants/types/__init__.py`	0%	13
🔴 `python/quadrants/types/buffer_view_type.py`	25%	1,4,23,27-28,31
🔴 `python/quadrants/types/enums.py`	0%	15,47
🟢 `tests/python/test_buffer_view.py`	98%	420-422,435-436,537,553

Diff coverage: 91% · Overall: 74% · 608 lines, 57 missing

Full annotated report

hughperkins · 2026-04-30T17:13:14Z

Thanks! 🙌

* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>

alanray-tech and others added 16 commits April 1, 2026 16:42

Merge branch 'main' into feature/buffer-view

efd8ab9

Doc: clarify that BufferView dtype annotation is optional

dbb57c9

Remove @qd.func doc section, fix buffer_view module leak in __all__, …

b28bea2

…add BufferView to test_api expected list

Fix @claude review findings: stale count refs, dtype validation, dead…

1fcc336

… branch, reversed-slice clamp

Merge branch 'main' into feature/buffer-view

7c70144

Validate offset/size bounds in BufferView constructor on host-side cr…

aaf511b

…eation

Harden BufferView: numpy-int bounds check, 1D assertion in constructo…

b3d4fe5

…r, clarify func dtype validation

Coerce numpy integers to int in BufferView constructor

6328dc2

Simplify BufferView init: check for Expr instead of enumerating numer…

375e617

…ic types

Remove unused numpy import from buffer_view.py

dd50434

Add subview() and BufferView slice support, update docs with slicing …

6393a78

…chain

Add subview, kernel slice, device-side bounds assertions, and 40 tests

412cb1f

alanray-tech mentioned this pull request Apr 28, 2026

Add BufferView: safe sub-range ndarray access for kernels #445

Closed

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/quadrants/lang/buffer_view.py Outdated

Comment thread python/quadrants/lang/impl.py

hughperkins and others added 2 commits April 28, 2026 05:28

Merge branch 'main' into feature/buffer-view

03e4d02

Kernel-side BufferView bounds assertion via ExternalTensorShapeAlongA…

6d7355f

…xis, reject slice step!=1

alanray-tech commented Apr 28, 2026

View reviewed changes

Comment thread python/quadrants/lang/buffer_view.py

Comment thread python/quadrants/lang/impl.py

Fix line wrapping: merge short-wrapped docstrings and comments to 120…

97ba51b

…-char width

Rename count->size in function_def_transformer, add Matrix/Vector nda…

19fee47

…rray kernel tests

alanray-tech commented Apr 30, 2026

View reviewed changes

Merge main into feature/buffer-view

6f780b0

alanray-tech commented Apr 30, 2026

View reviewed changes