Add BufferView: safe sub-range ndarray access for kernels#585
Conversation
BufferView provides a safe, zero-copy sub-range view into an ndarray for kernel arguments. It rewrites view[i] to arr[offset + i] at AST-translation time with zero IR changes. In debug mode, inserts runtime bounds assertions with full callstack diagnostics (kernel name, thread ID, file:line per frame). Can be passed directly as a kernel parameter via qd.types.buffer_view(dtype), which auto-decomposes into (ndarray, offset, count) at compile time. Minor: improve boundary enum error message to list valid options.
…tests - BufferView[dtype] square-bracket annotation replaces qd.types.buffer_view() call-style; removes buffer_view alias from buffer_view_type.py - Add BufferView.__class_getitem__ enabling v: BufferView[qd.f32] annotation - Add BufferViewType branch to _transform_func_arg so @qd.func accepts BufferView[dtype] annotations directly, no qd.template() workaround needed - Add ndarray slice syntax: data[:16] returns BufferView(data, 0, 16) Handles start/stop/negative indices; step must be 1; 1D arrays only. Covers ScalarNdarray, VectorNdarray, MatrixNdarray. - Add BoundaryMode to types/enums.py __all__ - Add docs/source/user_guide/buffer_view.md - Add tests/python/test_buffer_view.py: 26 tests covering slice syntax, kernel annotation, @qd.func support, debug OOB with callstack on x64/cuda
- black -l 120: collapse single-element raise() calls to one line (_ndarray.py, function_def_transformer.py) - Add # pylint: disable=C0415 # noqa: I001 to local BufferView import in _ndarray.py (import-outside-toplevel is intentional to avoid the impl.py -> buffer_view.py -> _ndarray.py circular dep) - Remove erroneous _slice_to_buffer_view guard from MatrixField.__getitem__ (MatrixField does not inherit from Ndarray; only MatrixNdarray and VectorNdarray need the slice guard)
…write - Rename BufferView.count -> BufferView.size (per duburcqa: aligns with torch) - Add BufferView.shape property returning (size,) tuple - Make dtype optional in annotation: v: BufferView infers dtype from the passed ndarray at compile time (add elif annotation is BufferView branch in check_parameter_annotations, converts to BufferViewType()) - Update @qd.func section in docs: annotation is optional on qd.funcs - Rewrite buffer_view.md: simpler intro, no line wrapping, remove duplicate limitations section, update count->size throughout - Add test_no_dtype_annotation to verify dtype-optional path - Update all v.count -> v.size in tests and debug messages (offset=, size=)
…add BufferView to test_api expected list
… branch, reversed-slice clamp
…r, clarify func dtype validation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 412cb1f4f6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 7 |
🟢 python/quadrants/lang/_func_base.py |
92% | 38 |
🟢 python/quadrants/lang/_ndarray.py |
91% | 266 |
🟢 python/quadrants/lang/_template_mapper_hotpath.py |
92% | 39 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
79% | 24,31,226 |
🔴 python/quadrants/lang/buffer_view.py |
67% | 3,5-9,11,14,28,35,49-56,58-62,65,70,94,96-97,100,103,114,123-124,128,143-144,173,188-189,230,235 |
🟢 python/quadrants/lang/impl.py |
83% | 21,259 |
🔴 python/quadrants/lang/matrix.py |
75% | 1717 |
🔴 python/quadrants/types/__init__.py |
0% | 13 |
🔴 python/quadrants/types/buffer_view_type.py |
21% | 1,4,25,30-34,36-37,40 |
🔴 python/quadrants/types/enums.py |
0% | 15,47 |
🟢 tests/python/test_buffer_view.py |
99% | 366-367,468 |
Diff coverage: 88% · Overall: 61% · 549 lines, 68 missing
alanray-tech
left a comment
There was a problem hiding this comment.
Inline comments for the two codex review fixes: kernel-side bounds assertion and slice step rejection.
|
Note: some comment underwraps to address: https://github.com/Genesis-Embodied-AI/quadrants/actions/runs/25045033191/job/73357813381 |
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 7 |
🟢 python/quadrants/lang/_func_base.py |
92% | 38 |
🟢 python/quadrants/lang/_ndarray.py |
91% | 266 |
🟢 python/quadrants/lang/_template_mapper_hotpath.py |
92% | 39 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
79% | 24,31,226 |
🔴 python/quadrants/lang/buffer_view.py |
70% | 3,5-9,11,14,28,35,49-56,58-62,65,70,94,96-97,100,103,124,133-134,138,153-154,183,198-199,240,245 |
🟢 python/quadrants/lang/impl.py |
86% | 21,259 |
🔴 python/quadrants/lang/matrix.py |
75% | 1717 |
🔴 python/quadrants/types/__init__.py |
0% | 13 |
🔴 python/quadrants/types/buffer_view_type.py |
21% | 1,4,24,29-33,35-36,39 |
🔴 python/quadrants/types/enums.py |
0% | 15,47 |
🟢 tests/python/test_buffer_view.py |
98% | 367-369,382-383,484,500 |
Diff coverage: 88% · Overall: 73% · 584 lines, 72 missing
|
Once you are ready-ish to merge, could you run Genesis benchmarks and unit-tests please. There are three approaches that I know of to do this:
|
I think 3. is good enough.
|
If you configure slack, it'll slack you the unit test stats, and an image of the benchmark results
|
|
Yes, please submit a pr to improve this script :) thank you
…On Wed, Apr 29, 2026, 07:19 alanray-tech ***@***.***> wrote:
*alanray-tech* left a comment (Genesis-Embodied-AI/quadrants#585)
<#585 (comment)>
@hughperkins <https://github.com/hughperkins> I ran into a small issue
running \unit_tests_cluster.py\ on the cluster. The script writes \salloc\
output to a hardcoded path /tmp/salloc_test.txt, which is currently owned
by your user (\hugh:hugh), so my user gets \Permission denied\ when trying
to write to it.
Could you delete it when you get a chance?
\
rm /tmp/salloc_test.txt
\\
Alternatively, I could patch the script to use a per-user filename like
/tmp/salloc_test_.txt\ — happy to open a quick PR on cmp-tooling if you
prefer that.
—
Reply to this email directly, view it on GitHub
<#585 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA6FKEAW4C7JM3IJ7ZKHIL4YHQMXAVCNFSM6AAAAACYJCKIJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGNBTGE4TQNZQG4>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Genesis Validation ResultsUnit Tests
Result: 582 passed, 2 skipped, 2 xfailed, 0 failed The 2 skips are pre-existing ( Speed Benchmarks32 benchmarks passed. Branch results vs WandB main baseline (selected high-traffic envs):
All deltas are within normal run-to-run variance (< 5%). No performance regressions. |
…rray kernel tests
alanray-tech
left a comment
There was a problem hiding this comment.
Final comprehensive review: inline comments on every changed file documenting the design intent and rationale for each modification. 15 files changed, 46 tests, 0 regressions in Genesis benchmarks (ndarray + field).
| impl.qd_assert((offset_expr >= Expr(0)).ptr, msg, args, dbg_info) | ||
| impl.qd_assert((size_expr >= Expr(0)).ptr, msg, args, dbg_info) | ||
| impl.qd_assert(((offset_expr + size_expr) <= arr_len).ptr, msg, args, dbg_info) | ||
| else: |
There was a problem hiding this comment.
Two construction paths gated by impl.inside_kernel():
Host-side: int() coercion + 1D shape check + bounds validation (offset >= 0, size >= 0, offset+size <= len).
Kernel-compilation: Debug-mode assertions via ExternalTensorShapeAlongAxisStmt (same IR instruction as C++ CheckOutOfBound pass). Validates offset >= 0, size >= 0, offset+size <= ndarray_length at device runtime.
|
@hughperkins Merged latest main. I believe this PR is ready to merge once CI passes:
Let me know if anything else is needed. |
alanray-tech
left a comment
There was a problem hiding this comment.
Corrected inline review with accurate line references (post-merge). Documents design intent for every changed file.
| _is_quadrants_class = True | ||
|
|
||
| @classmethod | ||
| def __class_getitem__(cls, dtype): |
There was a problem hiding this comment.
BufferView[qd.f32] -> BufferViewType(qd.f32). Lets BufferView serve as both runtime object and type annotation, consistent with Python generics (list[int]). The entire downstream pipeline (_func_base, _template_mapper_hotpath, function_def_transformer) recognizes BufferViewType instances -- no downstream changes needed.
| return BufferViewType(*dtype) | ||
| return BufferViewType(dtype) | ||
|
|
||
| def __init__(self, arr, offset, size): |
There was a problem hiding this comment.
Two construction paths gated by impl.inside_kernel():
Host-side: int() coercion + 1D shape check + bounds validation.
Kernel-compilation: Debug-mode assertions via ExternalTensorShapeAlongAxisStmt (same IR instruction as C++ CheckOutOfBound pass) -- validates offset >= 0, size >= 0, offset+size <= ndarray_length at device runtime.
| """Returns the shape of this view as a tuple, e.g. ``(16,)``.""" | ||
| return (self.size,) | ||
|
|
||
| def subview(self, offset, size): |
There was a problem hiding this comment.
subview() creates a narrower view within an existing view. Host path validates against self.size. Kernel path delegates to _subview_expr() which inserts three qd_assert calls (offset >= 0, size >= 0, offset+size <= parent_size) in debug mode.
| new_offset = Expr(self.offset) + offset_expr | ||
| return BufferView(self.arr, new_offset, size_expr) | ||
|
|
||
| def __getitem__(self, key): |
There was a problem hiding this comment.
__getitem__ is only reachable on the host. Inside kernels, v[...] is intercepted by the AST transformer and dispatched through impl.subscript() -> subscript() (index) or subview() (slice). The assert guards against accidental future routing changes.
| return self.subview(start, max(stop - start, 0)) | ||
|
|
||
| @quadrants_scope | ||
| def subscript(self, *indices): |
There was a problem hiding this comment.
_build_callstack is called at compile time. The callstack string is frozen as a literal in the compiled IR assertion message. Dynamic values (tid, index, offset, size) are injected at runtime via %d format args -- zero runtime string-building cost.
|
|
||
| indices_expr_group = None | ||
| if has_slice: | ||
| if isinstance(value, BufferView): |
There was a problem hiding this comment.
Kernel-side v[4:8] slice: AST transformer's build_Subscript -> build_Slice -> impl.subscript(). Converts slice(start, stop) to subview(start, stop-start), inheriting debug-mode bounds assertions from _subview_expr(). Step != 1 is rejected.
| @python_scope | ||
| def __getitem__(self, key): | ||
| if isinstance(key, slice): | ||
| return self._slice_to_buffer_view(key) |
There was a problem hiding this comment.
Slice guard for MatrixNdarray.__getitem__. Without this, mat_arr[:16] reaches tuple(key) -> TypeError: 'slice' object is not iterable. Same guard on both VectorNdarray.__getitem__ overrides.
| not in [ | ||
| "any_array", | ||
| "ast", | ||
| "buffer_view", |
There was a problem hiding this comment.
buffer_view excluded from __all__ to prevent the submodule name from leaking into the public API. BufferView (the class) is exported via from quadrants.lang.buffer_view import *; the module itself should not be user-visible.
|
|
||
|
|
||
| __all__ = ["Layout", "AutodiffMode", "SNodeGradType", "Format", "DeviceCapability"] | ||
| __all__ = ["Layout", "AutodiffMode", "SNodeGradType", "Format", "BoundaryMode", "DeviceCapability"] |
There was a problem hiding this comment.
BoundaryMode was defined but absent from __all__. Pre-existing issue, fixed since BufferView uses BoundaryMode for boundary checking.
| user_api[qd] = [ | ||
| "Backend", | ||
| "BitpackedFields", | ||
| "BufferView", |
There was a problem hiding this comment.
BufferView registered as a new public API symbol. test_api checks qd.__all__ against this hardcoded list.
|
@claude review |
|
Note: for claude review, whilst I think we should address 🔴 comments, I'll leave it to your own discretion whether you wish to address 🟡 comments. |
…ify kernel slice step error
|
Addressed all three @claude review findings in 10b1065:
|
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 7 |
🟢 python/quadrants/lang/_func_base.py |
92% | 42 |
🟢 python/quadrants/lang/_ndarray.py |
91% | 391 |
🟢 python/quadrants/lang/_template_mapper_hotpath.py |
92% | 43 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
79% | 29,36,351 |
🔴 python/quadrants/lang/buffer_view.py |
76% | 3,5-10,12,14,17,31,38,51,59,83,85-86,89,92,113,122-123,127,142-143,172,187-188,229,234 |
🟢 python/quadrants/lang/impl.py |
86% | 21,271 |
🟢 python/quadrants/lang/matrix.py |
100% | |
🔴 python/quadrants/types/__init__.py |
0% | 13 |
🔴 python/quadrants/types/buffer_view_type.py |
33% | 1,4,23,28-29,32 |
🔴 python/quadrants/types/enums.py |
0% | 15,47 |
🟢 tests/python/test_buffer_view.py |
98% | 420-422,435-436,537,553 |
Diff coverage: 91% · Overall: 74% · 604 lines, 55 missing
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 7 |
🟢 python/quadrants/lang/_func_base.py |
92% | 42 |
🟢 python/quadrants/lang/_ndarray.py |
91% | 391 |
🟢 python/quadrants/lang/_template_mapper_hotpath.py |
92% | 43 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
79% | 29,36,351 |
🔴 python/quadrants/lang/buffer_view.py |
76% | 3,5-10,12,14,17,31,38,51,59,83,85-86,89,92,113,122-123,127,142-143,172,187-188,229,234 |
🟢 python/quadrants/lang/impl.py |
86% | 21,271 |
🟢 python/quadrants/lang/matrix.py |
100% | |
🔴 python/quadrants/types/__init__.py |
0% | 13 |
🔴 python/quadrants/types/buffer_view_type.py |
25% | 1,4,23,27-28,31 |
🔴 python/quadrants/types/enums.py |
0% | 15,47 |
🟢 tests/python/test_buffer_view.py |
98% | 420-422,435-436,537,553 |
Diff coverage: 91% · Overall: 67% · 603 lines, 55 missing
|
Addressed @claude review (7c61de7): 1. 2. |
…eeds_grad ndarrays
|
Fixed both @claude findings in 7cc3e3f: 1. Launch-context cache id-reuse bug ( 2. Silent gradient drop ( |
| if needed_arg_basetype is buffer_view_type.BufferViewType and isinstance(v, BufferViewInstance): | ||
| inner = v.get_ndarray() | ||
| assert isinstance(inner, Ndarray) | ||
| launch_ctx_buffer[_QD_ARRAY].append((index, inner.arr)) | ||
| launch_ctx_buffer[_INT].append((index + 1, int(v.offset))) | ||
| launch_ctx_buffer[_INT].append((index + 2, int(v.size))) | ||
| return 3, False |
There was a problem hiding this comment.
🔴 🟡 The frozen-dataclass fast path at _func_base.py:626-650 discards the cacheability flag from each child's recursive call (line 637 destructures with _) and returns a hardcoded True (line 650), overriding the (3, False) that the BufferView branch correctly returns at line 685. So a frozen dataclass containing a BufferView field reintroduces the same id()-reuse silent corruption that commit 7cc3e3f closed at top level: fill(MyStruct(v=data[:16])) followed by fill(MyStruct(v=data[16:32])) can replay stale offset/size from the launch_ctx cache and write the wrong region. This is a pre-existing gap in the frozen-DC fast path (introduced in PR #561), but BufferView is the first user-facing arg type that documents transient-instance construction (data[:16]) where cacheability=False is correctness-critical. Fix is mechanical: mirror the non-frozen branch at lines 664-678 — capture each child's cacheability and AND-aggregate into a returned bool.
Extended reasoning...
What the bug is
The frozen-dataclass fast path in FuncBase._recursive_set_args discards the is_launch_ctx_cacheable flag from every recursive child call and unconditionally tells the caller the dataclass is cacheable. The non-frozen branch immediately below does this correctly — it AND-aggregates each child's cacheability and returns the result.
# python/quadrants/lang/_func_base.py:626-650 (frozen-DC, BUGGY)
if is_frozen:
plan = _get_frozen_dc_plan(...)
unwrapped = _get_frozen_dc_unwrapped(v, needed_arg_fields)
for field_name, field_full_name, field_type in plan:
field_value = unwrapped[field_name]
num_args_, _ = FuncBase._recursive_set_args(...) # ← cacheability discarded
idx += num_args_
return idx, True # ← hardcoded True
# python/quadrants/lang/_func_base.py:651-678 (non-frozen, CORRECT pattern)
is_launch_ctx_cacheable = False # (separately buggy — should be True, but irrelevant here)
for field in needed_arg_fields.values():
...
num_args_, is_launch_ctx_cacheable_ = FuncBase._recursive_set_args(...)
idx += num_args_
is_launch_ctx_cacheable &= is_launch_ctx_cacheable_
return idx, is_launch_ctx_cacheableThe BufferView branch at line 685 correctly returns (3, False) — this is the fix from commit 7cc3e3f that closed the top-level id()-reuse bug for transient slices like data[:16]. But when a BufferView is wrapped in a frozen dataclass, that False is silently overridden back to True by line 650.
How it manifests
@dataclass(frozen=True)
class MyStruct:
v: BufferView[qd.f32]
@qd.kernel
def fill(s: MyStruct):
for i in range(s.v.size):
s.v[i] = 1.0
data = qd.ndarray(qd.f32, shape=(64,))
fill(MyStruct(v=data[:16])) # call 1: caches launch_ctx with offset=0, size=16
# MyStruct + BufferView GC'd; pymalloc returns slot to LIFO free list
fill(MyStruct(v=data[16:32])) # call 2: NEW MyStruct lands at SAME id()
# args_hash = (id(t_kernel), id(MyStruct)) → cache hit
# populate_launch_ctx_from_cache copies STALE offset=0, size=16
# kernel writes data[0:16] AGAIN, data[16:32] untouchedStep-by-step:
kernel.py:461computesargs_hash = (id(t_kernel), *[id(arg) for arg in args])— uses Pythonid()of each top-level arg, including the transientMyStructwrapper.- After call 1,
MyStructand itsBufferViewfield are GC'd. CPython pymalloc puts the slot back on a per-size-class LIFO free list. - Call 2 allocates a new
MyStructat the same address (deterministic on CPython for same-size-class objects). _recursive_set_argsruns the frozen-DC fast path; at line 637 it gets(3, False)from the BufferView child but discards theFalse.- Line 650 returns
True→ top-level caller treats the cache entry as valid. populate_launch_ctx_from_cache(kernel.py:178) copies the cached launch_ctx wholesale, including the staleoffset=0, size=16values thatlaunch_ctx.set_args_intbaked in during call 1.- Kernel runs with stale offset/size — writes data[0:16] instead of data[16:32].
Expected: data[0:32] all 1.0. Actual: data[0:16] all 1.0, data[16:32] untouched. No exception.
Why existing safeguards don't prevent it
- The line-685
return 3, Falsefix from 7cc3e3f covers top-level BufferView args, but only because the top-level dispatch inKernel.__call__reads the returned cacheability directly. The frozen-DC parent intercepts and discards that signal one wrapping level deeper. LaunchContextBufferCacheweakref-evict callbacks (kernel.py:122-153) only track_QD_ARRAYentries, which point at the long-livedinner.arr(the backing ndarray), not the transient wrapper. So GC of the wrapper does not invalidate the cache entry.- The 40 BufferView tests in
tests/python/test_buffer_view.pydo not exercise frozen-DC + BufferView.test_multiple_viewspasses views in a single call (both alive at once, no id collision), not in sequential calls. - The
_template_mapper_hotpathcache key for BufferView at line 152 is(element_type, ndim, False, boundary)with no offset/size — so two slices likedata[:16]anddata[16:32]produce the same template-cache entry. That's correct for template caching but means they collide on the post-template launch-context cache too.
Pre-existing nature
This gap was introduced in PR #561 (the frozen-DC fast path). It would also affect frozen DCs with float fields (line 609 returns 1, False), large-int fields (line 619 conditional), and external arrays — any leaf that returns False. BufferView is the most user-facing new participant because the user guide explicitly documents data[:16] (transient slice) as the idiomatic usage pattern, and a user wrapping that in a @dataclass(frozen=True) is plausible.
The fix lives entirely in pre-existing code (lines 626-650 are not touched by this PR), so this technically meets the bar for pre_existing severity. I'm filing it as normal because:
- The user-facing surface area materially expanded with this PR — BufferView is the first arg type whose documented idiom produces transient instances, and a user wrapping that pattern in a frozen DC is plausible.
- Silent data corruption with a documented usage pattern warrants pre-merge attention rather than a follow-up.
- The fix is mechanical (~3 lines) and matches an existing precedent already in the file.
Maintainers may reasonably disagree and prefer to land this and address the frozen-DC gap comprehensively in a follow-up that also covers float/int/external-array participants.
How to fix
Mirror the non-frozen branch's pattern (lines 664-678). Capture each child's cacheability, AND-aggregate into a running flag, return that instead of hardcoded True:
if is_frozen:
plan = _get_frozen_dc_plan(...)
unwrapped = _get_frozen_dc_unwrapped(v, needed_arg_fields)
is_launch_ctx_cacheable = True # ← initialize True for AND-aggregation
for field_name, field_full_name, field_type in plan:
field_value = unwrapped[field_name]
num_args_, is_launch_ctx_cacheable_ = FuncBase._recursive_set_args(...)
idx += num_args_
is_launch_ctx_cacheable &= is_launch_ctx_cacheable_
return idx, is_launch_ctx_cacheableNote: the non-frozen branch at line 652 initializes is_launch_ctx_cacheable = False, which makes False & x = False always — the non-frozen path is over-conservative (no cache hits), the frozen path is over-optimistic (stale cache hits). The frozen-path fix above uses the correct True initializer; the non-frozen-path's False initializer is a separate pre-existing bug worth addressing in the same patch but doesn't affect this verdict.
|
Reminder: no obligation to handle the 🟡 . And every time you handle a 🟡 you risk getting a new 🔴 :). Just handle the 🟡 if you want to (if you think it's going to be useful to you, in the short-term). You can always make newer PRs later. |
|
Addressed 2 of 3 @claude findings in ffd7334: Doc: Nested Re: frozen dataclass cache override ( @hughperkins Should this be tracked as a separate issue? The fix would be to propagate the cacheability flag in the frozen-DC fast path, but that changes behavior for all frozen dataclass args, not just BufferView. |
|
For context on the launch-ctx cache issue — here's a concrete example of the id-reuse bug that data = qd.ndarray(qd.f32, shape=(32,))
data.from_numpy(np.zeros(32, dtype=np.float32))
@qd.kernel
def fill(v: BufferView[qd.f32]):
for i in range(v.size):
v[i] = 1.0
# Call 1
fill(data[:16]) # temp BufferView(data, 0, 16), id=0xABC
# cache miss -> set launch_ctx: arr=data, offset=0, size=16
# cache stores: key=id(0xABC) -> launch_ctx
# executes: data[0:16] = 1.0 (correct)
# call returns, temp object has no refs -> GC frees 0xABC
# Call 2
fill(data[16:32]) # new BufferView(data, 16, 16)
# CPython pymalloc reuses address -> id=0xABC (same as call 1!)
# cache hit -> reuses old launch_ctx: offset=0, size=16
# executes: data[0:16] = 1.0 again (WRONG! should write data[16:32])
# Expected: all 32 elements = 1.0
# Actual: data[0:16] = 1.0, data[16:32] = 0.0 (second call wrote to wrong region)The root cause is that The frozen-dataclass fast path ( |
|
@hughperkins Thanks for the guidance. I'll stop addressing further 🟡 nits here — the remaining findings (kernel-side nested BufferView rejection, frozen-DC cache propagation) are edge cases that can be handled in follow-up PRs. The PR is ready for merge from my side. Summary of current state:
|
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 7 |
🟢 python/quadrants/lang/_func_base.py |
92% | 42 |
🟢 python/quadrants/lang/_ndarray.py |
91% | 391 |
🟢 python/quadrants/lang/_template_mapper_hotpath.py |
92% | 43 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
79% | 29,36,351 |
🔴 python/quadrants/lang/buffer_view.py |
75% | 3,5-10,12,14,17,31,39,52,60,84,86-87,90,93,112,114,118,127-128,132,147-148,177,192-193,234,239 |
🟢 python/quadrants/lang/impl.py |
86% | 21,271 |
🟢 python/quadrants/lang/matrix.py |
100% | |
🔴 python/quadrants/types/__init__.py |
0% | 13 |
🔴 python/quadrants/types/buffer_view_type.py |
25% | 1,4,23,27-28,31 |
🔴 python/quadrants/types/enums.py |
0% | 15,47 |
🟢 tests/python/test_buffer_view.py |
98% | 420-422,435-436,537,553 |
Diff coverage: 91% · Overall: 74% · 608 lines, 57 missing
|
Thanks! 🙌 |
* [Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (Genesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodied-AI#428) * [Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429) * [Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Embodied-AI#430) * Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> * [Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-AI#420) * [Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodied-AI#435) * [Perf] Cache last-call result in perf_dispatch for single-compatible case (Genesis-Embodied-AI#438) * Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443) * Fix shared memory offset not reset between CUDA kernels. (Genesis-Embodied-AI#442) * [Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-Embodied-AI#439) * [Misc] Add named top-level loops (Genesis-Embodied-AI#440) * [Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446) * [Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447) * [Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesis-Embodied-AI#456) * [Bug] Also search default CUDA toolkit install location on Windows (Genesis-Embodied-AI#461) * [SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432) * [Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463) * [Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464) * [Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465) * [Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466) * [Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embodied-AI#471) * [Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472) * [Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474) * [Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Genesis-Embodied-AI#475) * [Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436) * Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-AI#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (Genesis-Embodied-AI#485) * [AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-Embodied-AI#484) * [Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis-Embodied-AI#477) * [AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486) * Enable all backends that are supported in unit tests. (Genesis-Embodied-AI#488) * Fix SPIRV ID overflow for large kernels due to autodiff. (Genesis-Embodied-AI#489) * [Misc] Fix purity checker to allow accessing constants from quadrants modules (Genesis-Embodied-AI#487) * [Misc] Increase tolerance for clock monotonic test (Genesis-Embodied-AI#492) * [CI] Serialize api doc workflow (Genesis-Embodied-AI#494) * [CI] Increase tolerance for clock test (Genesis-Embodied-AI#506) * [CI] Increase clock test tolerance to 20% (Genesis-Embodied-AI#509) * [Perf] Add tensor_type parametrization to tile16 tests (Genesis-Embodied-AI#504) * [Perf] Tiles 4b: Migrate tiles16 tests to enable fastcache (Genesis-Embodied-AI#505) * [Perf] Tiles 4c: add Tiles16x16 proxy (Genesis-Embodied-AI#507) * [Perf] Tiles 4d: Consolidate slice error tests using parametrize (Genesis-Embodied-AI#508) * [Perf] Tiles 4: add SharedArray slice support (Genesis-Embodied-AI#482) * [Perf] Tiles 5: add Cholesky benchmark demo (Genesis-Embodied-AI#483) * [Doc] Add user guide page for subgroup shuffle (Genesis-Embodied-AI#512) * [Perf] Implement cross-platform shuffle_down (Genesis-Embodied-AI#510) * [Perf] Add portable subgroup reduce_add and reduce_all_add (Genesis-Embodied-AI#511) * [Perf] Add first warmup config to perf dispatch (Genesis-Embodied-AI#422) * [AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections (Genesis-Embodied-AI#500) * [AutoDiff] Autodiff 2: Implement derivative for tan (Genesis-Embodied-AI#501) * [AutoDiff] Autodiff 3: Recompute tanh/exp on the operand in the reverse pass (Genesis-Embodied-AI#502) * [AutoDiff] Autodiff 4: Mark rsqrt as non-linear for adstack promotion (Genesis-Embodied-AI#503) * [AutoDiff] Autodiff 5: Fix adjoint-alloca placement for GlobalLoads outside the current range-for (Genesis-Embodied-AI#496) * [AutoDiff] Autodiff 6: Adstack regression tests (Genesis-Embodied-AI#491) * [AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout (Genesis-Embodied-AI#534) * [AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception (Genesis-Embodied-AI#535) * [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget (Genesis-Embodied-AI#495) * [AutoDiff] Autodiff 10: Implement adstack for SPIR-V (Genesis-Embodied-AI#490) * [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) (Genesis-Embodied-AI#536) * [Doc] Add AGENTS.md with instructions for AI agents (Genesis-Embodied-AI#541) * [Bug] Abort kernel execution on assertion failure instead of segfaulting (Genesis-Embodied-AI#419) * [Type] ndarray typing 1: Add eval_str=True to inspect.signature() calls (Genesis-Embodied-AI#411) * [CI] Suppress reportPrivateImportUsage in torch-using files (Genesis-Embodied-AI#552) * [Misc] QD_DUMP_IR dumps to files with the task_id added to the filename (Genesis-Embodied-AI#441) * [Type] ndarray typing 2: Fix NDArray single-arg subscript crash (Genesis-Embodied-AI#412) * [Test] Flush xdist channel before worker exit so test failure reports are visible (Genesis-Embodied-AI#555) * [CI] Reduce test retries on CI from 3 to 1. (Genesis-Embodied-AI#554) * [AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) (Genesis-Embodied-AI#537) * [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) (Genesis-Embodied-AI#493) * [AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback (Genesis-Embodied-AI#539) * [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck (Genesis-Embodied-AI#513) * [Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch and sanitise MoltenVK cap advertisement (Genesis-Embodied-AI#551) * [Test] Suppress stock pytest-timeout to avoid conflict with pytest_hardtle (Genesis-Embodied-AI#557) * [Vulkan] Use SDK validation layer for debugPrintf instead of apt package (Genesis-Embodied-AI#562) * [Test] Fix flaky perf_dispatch tests by increasing work amounts (Genesis-Embodied-AI#559) * [Test] Add --maxfail CLI option to run_tests.py (default 20) (Genesis-Embodied-AI#558) * [CI] Vulkan debug printf fix to address flaky tests (Genesis-Embodied-AI#563) * [Docs] Add a new page to help for first time contributors (Genesis-Embodied-AI#426) Authored-by: v01dxyz <v01dxyz@v01d.xyz> * [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr (Genesis-Embodied-AI#543) * Fix: raise error if device memory allocation fails (Genesis-Embodied-AI#451) (Genesis-Embodied-AI#453) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [CI] Add CI job to check line wrapping of comments and docs (Genesis-Embodied-AI#564) * [Misc] Add coverage report to PRs, including kernels (Genesis-Embodied-AI#470) * [CI] CI wrap check feeds only diffs to agent (Genesis-Embodied-AI#567) * Skip 'flaky' test on MacOS CI. (Genesis-Embodied-AI#573) * [Test] Fix missing `import sys` in test_fail_device_memory_allocation (Genesis-Embodied-AI#574) * [CI] Fix Vulkan debugPrintf flake with session-scoped warmup (Genesis-Embodied-AI#571) * [AutoDiff] determine_ad_stack_size: replace whole-CFG Bellman-Ford with SCC + DAG DP (Genesis-Embodied-AI#575) * [Test] Fix macOS OOM skip reason to describe actual root cause (Genesis-Embodied-AI#576) * [Lang] whole_kernel_cse: 2.5x compile time speedup on large kernels (Genesis-Embodied-AI#577) * [CI] Add CI check for unnecessarily deleted comments (Genesis-Embodied-AI#570) * [CI] Migrate coverage report to github Check page (Genesis-Embodied-AI#566) * [Lang] Skip IR verifier between passes unless debug=true (Genesis-Embodied-AI#579) * [Lang] Inline AdStack ops on release LLVM codegen: dramatically reduces compile time for adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#584) * [CUDA] Honor offline_cache=False end-to-end so QD_OFFLINE_CACHE=0 actually gives a cold compile (Genesis-Embodied-AI#580) * [Type] Tensor 24 (Genesis-Embodied-AI#561) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Lang] auto_diff host-walk reductions: dramatically faster front-end compile time on adstack-enabled reverse-mode kernels (Genesis-Embodied-AI#587) * [AutoDiff] Speed up reverse-mode kernel launches on GPU backends (Genesis-Embodied-AI#578) * [Vulkan] Move adstack-sizer scratch out of Function-scope memory to fix SPIR-V pipeline build failures (Genesis-Embodied-AI#588) * [AutoDiff] Improve diagnosis of unsupported reverse-mode AD patterns (Genesis-Embodied-AI#590) * [Bug] Fix: promote Ndarray to AnyArray in build_Name for flattened struct fields (Genesis-Embodied-AI#592) * [SPIR-V] Shrink reverse-grad kernel MSL by ~50% (Genesis-Embodied-AI#591) * [CI] Add CI check that PR changes have test coverage (Genesis-Embodied-AI#596) * [Perf] Enable zero-copy in to_torch() and to_numpy() (Genesis-Embodied-AI#450) * Add BufferView: safe sub-range ndarray access for kernels (Genesis-Embodied-AI#585) Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> * [Doc] Add user-facing fastcache documentation (Genesis-Embodied-AI#597) Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> * [Misc] Upgrade to enable v1 dlpack so to_numpy(copy=False) writable (Genesis-Embodied-AI#598) Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> * [AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends (Genesis-Embodied-AI#599) * [Misc] Add CI check for feature file factorization (Genesis-Embodied-AI#606) * [Perf] Skip _recursive_set_args for all-Field frozen dataclass structs (Genesis-Embodied-AI#607) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices (Genesis-Embodied-AI#610) * [Misc] Suppress field fastcache warning for qd.Tensor (Genesis-Embodied-AI#615) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) (Genesis-Embodied-AI#611) * [Misc] Forward copy= through qd.Tensor, add copy=None option (Genesis-Embodied-AI#616) Co-authored-by: Cursor <cursoragent@cursor.com> * [Doc] Update README (Genesis-Embodied-AI#617) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Fix coverage report showing def lines as uncovered (Genesis-Embodied-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619) * [CI] Encode Python-first testing policy in coverage-check prompt (Genesis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Add PR Line change report (Genesis-Embodied-AI#624) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Disable quadrants pytest plugin during quadrants internal coverage runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com> * [AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdStackPushes pass + leaf extensions (Genesis-Embodied-AI#621) * [CI] Simplify coverage PR comment to a single linked line (Genesis-Embodied-AI#630) * [CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631) Co-authored-by: Johnny Nunez and Hugh Perkins * [CI] Lines changed report: collapse PR comment to a single linked totals line (Genesis-Embodied-AI#632) * [FEATURE] Support external Metal command queue via qd.init (Genesis-Embodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com> * [Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620) * [AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated-SNode chain-leaf fix (Genesis-Embodied-AI#633) * [AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + load_top consumer-aware guard (Genesis-Embodied-AI#634) * [Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Embodied-AI#638) * [Docs] Expand qd.simt.subgroup user-guide page to cover every op (Genesis-Embodied-AI#639) * [Perf] Streams 1-4 (Genesis-Embodied-AI#410) * [Docs] Add user-guide page for matrix decompositions and solvers (Genesis-Embodied-AI#643) * [Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-Embodied-AI#650) * [Docs] Add user-guide page for atomics and bit operations (Genesis-Embodied-AI#640) * [Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Embodied-AI#641) * [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch (Genesis-Embodied-AI#635) * [AMDGPU] Fix amdgpu parallel rand init (Genesis-Embodied-AI#658) * [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap (Genesis-Embodied-AI#655) * [Perf] Re-land Streams 1-4 with bug fixes (Genesis-Embodied-AI#653) * [AMDGPU] Apply device_memory_GB=0.3 cap to AMDGPU tests (Genesis-Embodied-AI#659) * [Perf] Per-launch host sync: drop wait_idle on SPIR-V, pin stream and drop stream_synchronize on CUDA/AMDGPU (Genesis-Embodied-AI#654) * [AMDGPU] Unload hipModule_t in JITModuleAMDGPU destructor (Genesis-Embodied-AI#660) * [AMDGPU] Trim default mempool on qd.reset() (Genesis-Embodied-AI#669) * [AMDGPU] Hoist rand-state buffer to process lifetime (Genesis-Embodied-AI#668) * [Streams] Use events for streams serialization on AMDGPU and CUDA (Genesis-Embodied-AI#667) * [Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id (Genesis-Embodied-AI#671) * [SPIR-V] dispatch_max_reducers: register each task with the real kernel name (Genesis-Embodied-AI#675) * [AutoDiff] Debug-mode field/grad/dual: dtype, layout, and access-time invariants (Genesis-Embodied-AI#677) * [Docs] Add user-guide page for qd.algorithms.* device-wide algorithms (Genesis-Embodied-AI#642) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [Docs] Doc for existing atomics: switch support table to per-backend columns (Genesis-Embodied-AI#657) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Cross gpu atomics (Genesis-Embodied-AI#666) Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> * [GPU] Make block operations portable cross-gpu (Genesis-Embodied-AI#664) * [Perf] CPU LLVM adstack-cache: skip per-launch bump-writes + ndarray_shapes capture on forward-only handles (Genesis-Embodied-AI#685) * [GPU] Cross-GPU for grid ops (Genesis-Embodied-AI#670) * [Math] Make bitop operations portable cross-gpu (Genesis-Embodied-AI#662) * [AMDGPU] Always use wave64, on both RDNA and CDNA (Genesis-Embodied-AI#687) * [AMDGPU] Use syncscope("agent") for atomix xor to avoid CAS livelock (Genesis-Embodied-AI#672) * [GPU] New bit ops for QIPC (Genesis-Embodied-AI#679) * [GPU] Subgroup ops cross-gpu (Genesis-Embodied-AI#665) * [Graph] Rename CUDA Graph to Graph in docs (Genesis-Embodied-AI#691) * [SPIR-V] Fix FIFO-queue ordering when sharing command queue. (Genesis-Embodied-AI#694) * [Atomics] New QIPC ops for atomics (Genesis-Embodied-AI#690) * Pass dataclass sub-structs into qd.func (Genesis-Embodied-AI#698) * [AMDGPU] HIP graph runtime support for @qd.kernel(graph=True) (Genesis-Embodied-AI#692) * [CI] Add per-file timing report to Mac Metal test job (Genesis-Embodied-AI#695) Co-authored-by: Cursor <cursoragent@cursor.com> * [CI] Enable kernel disk cache during tests (Genesis-Embodied-AI#696) * [Math] New QIPC ops for single-threaded linalg (Genesis-Embodied-AI#683) * [BREAKING][GPU] New QIPC ops for subgroups (Genesis-Embodied-AI#676) * [GPU] New QIPC ops for block (Genesis-Embodied-AI#684) * [GPU] New device-level ops for QIPC (Genesis-Embodied-AI#693) * [algorithms] PrefixSumExecutor: drop unused GRID_SZ local (Genesis-Embodied-AI#701) * [block] sync(): fix unsupported-arch error message (Genesis-Embodied-AI#700) * [volatile_load] add qd.volatile_load primitive (closes Genesis-Embodied-AI#648) (Genesis-Embodied-AI#702) * [AutoDiff] Reject recycled identity_key in AdStackCache::register_adstack_sizing_info (Genesis-Embodied-AI#708) * [Vulkan] Declare GroupNonUniform SPIR-V caps and enable shaderSubgroupExtendedTypes (Genesis-Embodied-AI#707) * Fix duplicate HIP graph driver-function declarations after v1.0.0 merge The amd-integration fork had cherry-picked the HIP graph driver functions (graph_create / graph_destroy / graph_add_kernel_node / graph_instantiate / graph_exec_destroy / graph_launch), and upstream v1.0.0 added the same set. The per-file 3-way merge appended both copies into amdgpu_driver_functions.inc.h, producing redeclaration errors that broke the AMDGPU RHI/runtime compile. Drop the upstream duplicate block; the signatures are identical to the fork's existing declarations. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix AMDGPU launcher coherence and num_instructions visibility after v1.0.0 merge - kernel_launcher.cpp: the 3-way merge spliced upstream v1.0.0's launch_llvm_kernel rewrite (ephemeral arg/context buffers, explicit-stream path, AmdgpuDefaultStream PinGuard) onto the AMD fork's kernarg-by-value + persistent-scratch design, leaving references to undefined `ephemeral_context_ptr`. Restore the fork's coherent launch_llvm_kernel verbatim; it calls the (already merged) enhanced launch_offloaded_tasks, which keeps the max-reducer dispatch and stream-parallel groups adapted onto the AMD launch path. - llvm_context.h: both the fork and upstream added `num_instructions`; the merge kept upstream's private placement, but the AMDGPU codegen force-inline heuristic calls it statically from outside the class. Move it back to the public section. Co-authored-by: Cursor <cursoragent@cursor.com> * Restore async result D2H and hoist kernarg vectors in AMDGPU launcher The v1.0.0 merge resolution regressed two amd-integration baseline optimizations in launch_llvm_kernel / launch_offloaded_tasks: - The per-launch result-buffer copy was a blocking memcpy_device_to_host, forcing a host stall on every value-returning launch and serializing the GPU pipeline. Restore the async D2H (the caller synchronizes lazily when it needs the value); external-array transfers still stream_synchronize once before reading back. - launch_task constructed the kernarg std::vectors from initializer lists ({kernarg_payload} / {kernarg_size}) on every dispatch (heap alloc + free per launch). Hoist arg_ptrs/arg_sizes out of the per-task launch and reuse. Co-authored-by: Cursor <cursoragent@cursor.com> * amdgpu: default to LDS permlane64 emulation; drop host-x86 barrier asm on retarget Two AMDGPU JIT-compile crashes surfaced after the v1.0.0 merge pulled in the QIPC subgroup ops (Genesis-Embodied-AI#676), which made the rigid constraint solver's wave-cooperative reductions route through `amdgpu_cross_half_shuffle_i32`. Both manifested as a SIGSEGV inside `llvm::SIInstrInfo::getInstSizeInBytes` during `JITSessionAMDGPU::compile_module_to_hsaco` (i.e. at first kernel launch), and reproduce on gfx942 / MI300X. Baseline 0.4.6 never emitted these constructs, which is why it was unaffected. 1. Native `llvm.amdgcn.permlane64` lowering crashes the bundled LLVM 22.1.0 AMDGPU backend. Default `amdgpu_permlane64` to the existing LDS-roundtrip software emulation on every target (it produces identical results). Add `QD_AMDGPU_USE_NATIVE_PERMLANE64=1` to opt back into the native instruction once the backend bug is fixed; the old `QD_AMDGPU_FORCE_PERMLANE64_FALLBACK` is now the default and still honored. This is the actual crash fix. 2. The runtime module is compiled by the host x86_64 clang and only retargeted to amdgcn here, so `amdgpu_cross_half_shuffle_i32`'s `__asm__ volatile("" : "+v"(byte))` optimization barrier carries x86 flag clobbers (`~{dirflag},~{fpsr},~{flags}`) that are meaningless on AMDGPU. The IR verifies but the empty-body INLINEASM is invalid on the amdgcn target. Neutralize empty-body barrier asm during retarget (forward the tied value, then erase) so no stale host asm reaches codegen. On the wave64 targets we ship `ds_bpermute` already addresses the full wave, so the hint is a no-op. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply clang-format (v19.1.7) to AMDGPU fn_attrs and launcher sources CI pre-commit's clang-format hook reformatted these files (long declarations/lambda signatures collapsed onto single lines per the repo's clang-format config). Apply the same formatting so the hook passes. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(amdgpu): use CreateNeg for branchless i32 sgn instead of CreateSub(0, input) clang-tidy (modernize-use-nullptr, -warnings-as-errors) flagged `builder->CreateSub(0, input)` in the i32 sgn path: the literal `0` binds to the `llvm::Value*` LHS parameter as a null pointer, not an integer zero. Replace with `builder->CreateNeg(input)`, which emits `0 - input` with a proper zero constant -- identical intended semantics, and clang-tidy clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Robert Dazi <14996868+v01dXYZ@users.noreply.github.com> Co-authored-by: v01dxyz <v01dxyz@v01d.xyz> Co-authored-by: Hugh Perkins <hughperkins@gmail.com> Co-authored-by: Alexis DUBURCQ <alexis.duburcq@gmail.com> Co-authored-by: hugh <hugh@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local> Co-authored-by: alanray-tech <alan.ray@genesis-ai.company> Co-authored-by: alanray-tech <alanray-tech@users.noreply.github.com> Co-authored-by: root <root@rtx-209-201.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Johnny <johnnynuca14@gmail.com>
Summary
BufferViewprovides a safe, zero-copy sub-range view into an ndarray for kernel arguments. It rewritesview[i]toarr[offset + i]at AST-translation time, requiring no IR modifications.In debug mode (
debug=True), it inserts runtime bounds assertions that report the kernel name, thread ID, file and line for every frame in the callstack.API
Slicing forms a closed chain:
ndarray-> slice ->BufferView-> slice ->BufferView. Each step validates bounds against the parent.Debug output (OOB access):
Design notes
BufferView.__init__coerces offset/size to int and checks bounds (1D, non-negative, within ndarray length).subview()and slice validate against the parent view's size.subscript()insertsqd_assert(0 <= i < size)in debug mode, with kernel name, thread ID, and full callstack in the error message._subview_expr()insertsqd_assert(offset >= 0, size >= 0, offset + size <= parent_size).BufferView[dtype]annotation delegates toBufferViewTypevia__class_getitem__. dtype can be omitted - Quadrants infers it from the passed ndarray at compile time.v[4:8]inside a kernel is intercepted byimpl.subscript()and routed tosubview().Files changed
python/quadrants/lang/buffer_view.pysubview(),_subview_expr(),__getitem__slice,__class_getitem__, debug OOBpython/quadrants/types/buffer_view_type.pypython/quadrants/lang/impl.pypython/quadrants/lang/_func_base.pypython/quadrants/lang/_template_mapper_hotpath.pypython/quadrants/lang/ast/.../function_def_transformer.pypython/quadrants/lang/_ndarray.py_slice_to_buffer_viewenablesdata[:16]slice syntaxpython/quadrants/lang/matrix.pypython/quadrants/lang/__init__.pypython/quadrants/types/__init__.pypython/quadrants/types/enums.pytests/python/test_api.pydocs/source/user_guide/buffer_view.mddocs/source/user_guide/index.mdtests/python/test_buffer_view.pySee user-facing documentation for full usage guide.
Test plan (40 tests)
Group A - Slice syntax + host-side validation (17 tests)
data[:16],data[8:24],data[:],data[8:],data[-8:]view.subview(4, 8), subview OOB, chainedview[4:12], view step error, view int index errorGroup B - Kernel functional tests (10 tests)
BufferView[dtype]and plainBufferViewannotationsv.sizeiteration, multiple views,@qd.funcannotationBufferView(arr, off, sz)construction,v.subview(4, 4),v[4:8]sliceGroup C - Debug mode: OOB + callstack (13 tests, x64 + cuda)
subscript()bounds: upper, lower, offset/size in message, nested callstack_subview_expr()bounds: subview OOBAll 40 tests pass on
x64andcudabackends.AI Attestation
See timestamped comment in #445.
Maintenance
I will monitor and respond to any bugs or issues related to this feature.