Skip to content

[Doc] Fix graph doc issues#762

Open
hughperkins wants to merge 26 commits into
mainfrom
hp/fixup-graph-doc
Open

[Doc] Fix graph doc issues#762
hughperkins wants to merge 26 commits into
mainfrom
hp/fixup-graph-doc

Conversation

@hughperkins

Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Add RULE 3 to the doc-quality agent prompt: within a single doc, information
must be ordered so a first-time reader can follow it top-to-bottom without
jumping ahead (no forward references). Includes carve-outs for roadmaps,
optional "see below" pointers, backward references, cross-doc references, and
conventional preamble ordering, and hands pure undefined-term cases to Rule 1.
Adds the [order] tag and reworks the violation cap so one rule cannot crowd
out the others.
The doc-quality check now enforces a third rule (reading order / no forward
references); describe it and its carve-outs in contributing.md.
Both added cognitive load to the agent for little gain. Revert to the original
"stop after 10 violations" cap and remove the Rule 1/Rule 3 delineation note.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@hughperkins hughperkins marked this pull request as ready for review June 26, 2026 15:25
Comment thread docs/source/user_guide/graph.md Outdated
hughperkins and others added 3 commits June 26, 2026 14:13
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	docs/source/user_guide/contributing.md
Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
@hughperkins hughperkins added the awaiting review pass New PR or review comments addressed label Jun 26, 2026
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Comment thread docs/source/user_guide/graph.md Outdated
### Restrictions

- The counter ndarray may be swapped between calls: the cached graph reads each counter through an indirection slot that is refreshed on every launch, so passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it.
- The counter ndarray may be swapped between calls: passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise that the notion of "replaying graph" was never introduced. It is not very intuitive I feel, unless one already has a clear view of what a compute graph is concretely.

Comment thread docs/source/user_guide/graph.md Outdated
### Caveats

On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** (HIP has no conditional / while node API as of ROCm 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:
On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** ([HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html) has no conditional / while node API as of [ROCm](https://www.amd.com/en/products/software/rocm.html) 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:

@duburcqa duburcqa Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

— currently CUDA pre-SM 9.0 and AMDGPU (HIP has no conditional / while node API as of ROCm 7.2) —

Just cross-reference 'Backend support' section. It is good enough.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would fail doc check, because it's a forward reference.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I think forward reference which are not necessary to read to understand what's going on should be considered fine. It should be reframed as (for details about the supported backend, see 'Backend support' section). At this point, we don't care about which backend is supported or not in practice.

Comment thread docs/source/user_guide/graph.md Outdated
### Caveats

On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** (HIP has no conditional / while node API as of ROCm 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:
On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** ([HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html) has no conditional / while node API as of [ROCm](https://www.amd.com/en/products/software/rocm.html) 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On platforms without [...]

On GPU backends [...] ?

Comment thread docs/source/user_guide/graph.md Outdated
- on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 kernels
- on other platforms, there is no change: we still launch 3 gpu kernels: no change: not better, not worse
- on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 separate kernels
- on other platforms, there is no change: we still launch 3 separate kernels: no change: not better, not worse

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 separate kernels

3 separate gpu kernels ?

Comment on lines 352 to -353
- on unsupported hardware, we still incur the pipeline stall, as before
- note that there will be some small acceleration, because the condition evaluation and kernel launch will take place entirely from c++, bypassing python
- no worse, incrementally better

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the condition evaluation and kernel launch will take place entirely from c++, bypassing python

Only this part is worth removing, the rest is relevant for users no?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing without the explanation? We could put this in an 'advanced' section potentailly. A typical end-user is not going to notice the difference until they are heavily optimizing, at which time, they could read an advanced section.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing without the explanation?

Why confusing? This statements are very clear. Maybe raising more questions, but not confusing by itself I feel.

We could put this in an 'advanced' section potentailly. A typical end-user is not going to notice the difference until they are heavily optimizing, at which time, they could read an advanced section.

Yes, sounds like a good idea.

- there is kernel launch latency associated with:
- running k1 from host-side python
- launching the gpu kernels for each of fn_1, fn_2, fn_3 from host-side c++
- running k1 from host-side

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"host-side python" was more explicit I think.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, but then we have to make a distinction btween python and c++, and talking about c++ is adding information that the user perhaps doesn't need to know.

Comment on lines -400 to +397
In practice, for our own kernels, i.e. in genesis-world, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.
In practice, for our own [genesis-world](https://github.com/Genesis-Embodied-AI/genesis-world) kernels, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this sentence? Like, what Quadrants' users are supposed to do with this information?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "for our own genesis-world kernels" with "for many real-world cases"?

@duburcqa duburcqa removed the awaiting review pass New PR or review comments addressed label Jul 1, 2026
@duburcqa

duburcqa commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

ok to merge

Comment on lines +158 to +159
On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below):
— the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markdown formatting issue: Line 158 ends with a colon, but line 159 starts with an em-dash (—) instead of a proper list marker or continuation. This will likely render incorrectly.

Suggestion:

On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below), the value of the `graph_do_while` parameter will be copied...

Or use a proper bullet list:

On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below):

- The value of the `graph_do_while` parameter will be copied...

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Explain what a GPU kernel launch is, break down launch latency into
python/c++/gpu-driver sources, and describe why graph_do_while helps
even on unsupported hardware (host-side loop driven in c++, avoiding
per-iteration python latency).
Drop the detailed enumeration of per-call python work; too much detail.
Drop the detailed enumeration of per-task c++ runtime work; too much info.
Drop the "irreducible floor" sentence; doesn't add signal.
Explain that launches overlap with execution, so reducing launch latency
only helps throughput when kernels are small enough that launch latency
is exposed rather than hidden behind execution.
Reframe from 'typically faster' to 'launch latency can be slightly reduced'.
@hughperkins hughperkins added the awaiting review pass New PR or review comments addressed label Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review pass New PR or review comments addressed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants