[Doc] Fix graph doc issues by hughperkins · Pull Request #762 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-06-24T14:59:55Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Add RULE 3 to the doc-quality agent prompt: within a single doc, information must be ordered so a first-time reader can follow it top-to-bottom without jumping ahead (no forward references). Includes carve-outs for roadmaps, optional "see below" pointers, backward references, cross-doc references, and conventional preamble ordering, and hands pure undefined-term cases to Rule 1. Adds the [order] tag and reworks the violation cap so one rule cannot crowd out the others.

The doc-quality check now enforces a third rule (reading order / no forward references); describe it and its carve-outs in contributing.md.

Both added cognitive load to the agent for little gain. Revert to the original "stop after 10 violations" cap and remove the Rule 1/Rule 3 delineation note.

github-actions · 2026-06-24T16:34:14Z

Diff coverage: 0% · 0 lines, 0 missing

…to hp/fixup-graph-doc

github-actions · 2026-06-25T14:03:30Z

Diff coverage: 100% · 3 lines, 0 missing

github-actions · 2026-06-26T14:34:20Z

Diff coverage: 0% · 0 lines, 0 missing

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # docs/source/user_guide/contributing.md

Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>

github-actions · 2026-06-26T19:37:23Z

Diff coverage: 0% · 0 lines, 0 missing

github-actions · 2026-06-26T20:33:41Z

Diff coverage: 0% · 0 lines, 0 missing

github-actions · 2026-06-26T22:29:34Z

Diff coverage: 0% · 0 lines, 0 missing

duburcqa · 2026-07-01T15:58:43Z

 ### Restrictions

- The counter ndarray may be swapped between calls: the cached graph reads each counter through an indirection slot that is refreshed on every launch, so passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it.
+- The counter ndarray may be swapped between calls: passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it.


I realise that the notion of "replaying graph" was never introduced. It is not very intuitive I feel, unless one already has a clear view of what a compute graph is concretely.

duburcqa · 2026-07-01T15:59:51Z

 ### Caveats

-On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** (HIP has no conditional / while node API as of ROCm 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:
+On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** ([HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html) has no conditional / while node API as of [ROCm](https://www.amd.com/en/products/software/rocm.html) 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:


— currently CUDA pre-SM 9.0 and AMDGPU (HIP has no conditional / while node API as of ROCm 7.2) —

Just cross-reference 'Backend support' section. It is good enough.

would fail doc check, because it's a forward reference.

As I said, I think forward reference which are not necessary to read to understand what's going on should be considered fine. It should be reframed as (for details about the supported backend, see 'Backend support' section). At this point, we don't care about which backend is supported or not in practice.

duburcqa · 2026-07-01T16:01:00Z

 ### Caveats

-On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** (HIP has no conditional / while node API as of ROCm 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:
+On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** ([HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html) has no conditional / while node API as of [ROCm](https://www.amd.com/en/products/software/rocm.html) 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:


On platforms without [...]

On GPU backends [...] ?

duburcqa · 2026-07-01T16:02:11Z

- on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 kernels
- on other platforms, there is no change: we still launch 3 gpu kernels: no change: not better, not worse
+- on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 separate kernels
+- on other platforms, there is no change: we still launch 3 separate kernels: no change: not better, not worse


3 separate kernels

3 separate gpu kernels ?

duburcqa · 2026-07-01T16:03:17Z

 - on unsupported hardware, we still incur the pipeline stall, as before
-    - note that there will be some small acceleration, because the condition evaluation and kernel launch will take place entirely from c++, bypassing python
-    - no worse, incrementally better


because the condition evaluation and kernel launch will take place entirely from c++, bypassing python

Only this part is worth removing, the rest is relevant for users no?

I think it's confusing without the explanation? We could put this in an 'advanced' section potentailly. A typical end-user is not going to notice the difference until they are heavily optimizing, at which time, they could read an advanced section.

I think it's confusing without the explanation?

Why confusing? This statements are very clear. Maybe raising more questions, but not confusing by itself I feel.

We could put this in an 'advanced' section potentailly. A typical end-user is not going to notice the difference until they are heavily optimizing, at which time, they could read an advanced section.

Yes, sounds like a good idea.

duburcqa · 2026-07-01T16:03:39Z

 - there is kernel launch latency associated with:
-    - running k1 from host-side python
-    - launching the gpu kernels for each of fn_1, fn_2, fn_3 from host-side c++
+    - running k1 from host-side


"host-side python" was more explicit I think.

It was, but then we have to make a distinction btween python and c++, and talking about c++ is adding information that the user perhaps doesn't need to know.

duburcqa · 2026-07-01T16:05:53Z

-In practice, for our own kernels, i.e. in genesis-world, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.
+In practice, for our own [genesis-world](https://github.com/Genesis-Embodied-AI/genesis-world) kernels, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.


What is the point of this sentence? Like, what Quadrants' users are supposed to do with this information?

replace "for our own genesis-world kernels" with "for many real-world cases"?

duburcqa · 2026-07-01T17:07:25Z

ok to merge

graphite-app · 2026-07-02T13:51:24Z

+On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below):
+— the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:


Markdown formatting issue: Line 158 ends with a colon, but line 159 starts with an em-dash (—) instead of a proper list marker or continuation. This will likely render incorrectly.

Suggestion:

On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below), the value of the `graph_do_while` parameter will be copied...

Or use a proper bullet list:

On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below): - The value of the `graph_do_while` parameter will be copied...

Spotted by Graphite

Is this helpful? React 👍 or 👎 to let us know.

Explain what a GPU kernel launch is, break down launch latency into python/c++/gpu-driver sources, and describe why graph_do_while helps even on unsupported hardware (host-side loop driven in c++, avoiding per-iteration python latency).

Drop the detailed enumeration of per-call python work; too much detail.

Drop the detailed enumeration of per-task c++ runtime work; too much info.

Drop the "irreducible floor" sentence; doesn't add signal.

Explain that launches overlap with execution, so reducing launch latency only helps throughput when kernels are small enough that launch latency is exposed rather than hidden behind execution.

…launch latency')

…' -> 'can directly increase')

Reframe from 'typically faster' to 'launch latency can be slightly reduced'.

…n help slightly')

github-actions · 2026-07-02T15:50:01Z

Diff coverage: 0% · 0 lines, 0 missing

hughperkins added 4 commits June 24, 2026 10:59

address doc CI issues

3d3327d

[Doc] Document reading-order rule in doc-quality check description

aafcc41

The doc-quality check now enforces a third rule (reading order / no forward references); describe it and its carve-outs in contributing.md.

[CI] Simplify doc-quality RULE 3: drop reworked cap and double-flag note

899600e

Both added cognitive load to the agent for little gain. Revert to the original "stop after 10 violations" cap and remove the Rule 1/Rule 3 delineation note.

Merge remote-tracking branch 'origin/hp/doc-quality-with-ordering' in…

5f7aeb0

…to hp/fixup-graph-doc

hughperkins added 2 commits June 26, 2026 09:01

fix doc ci failures

57f07bd

Merge branch 'main' into hp/fixup-graph-doc

715d454

hughperkins marked this pull request as ready for review June 26, 2026 15:25

graphite-app Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread docs/source/user_guide/graph.md Outdated

hughperkins and others added 3 commits June 26, 2026 14:13

doc tewaks for ci

cc8ea8a

Merge remote-tracking branch 'origin/main' into hp/fixup-graph-doc

2fc9569

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # docs/source/user_guide/contributing.md

Apply suggestion from @graphite-app[bot]

ee7754c

Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>

hughperkins added the awaiting review pass New PR or review comments addressed label Jun 26, 2026

Merge branch 'main' into hp/fixup-graph-doc

1af37b9

address doc CI

04f2f0e

address doc ci issues

a1d925b

duburcqa reviewed Jul 1, 2026

View reviewed changes

duburcqa removed the awaiting review pass New PR or review comments addressed label Jul 1, 2026

address some comments

04f1f09

graphite-app Bot reviewed Jul 2, 2026

View reviewed changes

hughperkins added 12 commits July 2, 2026 07:05

Trim python-side latency bullet in graph doc Advanced section

e323513

Drop the detailed enumeration of per-call python work; too much detail.

Trim c++-side latency bullet in graph doc Advanced section

e78e40e

Drop the detailed enumeration of per-task c++ runtime work; too much info.

Trim gpu/driver-side latency bullet in graph doc Advanced section

deb8969

Drop the "irreducible floor" sentence; doesn't add signal.

Use 'launch' instead of undefined 'replay' in graph doc Advanced section

b6f9732

Add latency hiding section to graph doc Advanced section

1801688

Explain that launches overlap with execution, so reducing launch latency only helps throughput when kernels are small enough that launch latency is exposed rather than hidden behind execution.

Reword 'This is why' to 'So' in graph doc latency hiding section

b192ca8

Clarify subject in graph doc latency hiding section ('It' -> 'Kernel …

b58a08b

…launch latency')

Soften claim in graph doc latency hiding section ('directly increases…

80eb935

…' -> 'can directly increase')

Soften unsupported-hardware claim in graph doc graph_do_while section

d634672

Reframe from 'typically faster' to 'launch latency can be slightly reduced'.

Remove closing summary paragraph from graph doc graph_do_while section

c14a85e

Soften graph_do_while subsection heading in graph doc ('helps' -> 'ca…

cfde9ce

…n help slightly')

hughperkins added the awaiting review pass New PR or review comments addressed label Jul 2, 2026

		In practice, for our own kernels, i.e. in genesis-world, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.
		In practice, for our own [genesis-world](https://github.com/Genesis-Embodied-AI/genesis-world) kernels, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.

		On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below):
		— the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:

Uh oh!

Conversation

hughperkins commented Jun 24, 2026

Brief Summary

Walkthrough

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa commented Jul 1, 2026

Uh oh!

graphite-app Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa Jul 1, 2026 •

edited

Loading