[Doc] Fix graph doc issues#762
Conversation
Add RULE 3 to the doc-quality agent prompt: within a single doc, information must be ordered so a first-time reader can follow it top-to-bottom without jumping ahead (no forward references). Includes carve-outs for roadmaps, optional "see below" pointers, backward references, cross-doc references, and conventional preamble ordering, and hands pure undefined-term cases to Rule 1. Adds the [order] tag and reworks the violation cap so one rule cannot crowd out the others.
The doc-quality check now enforces a third rule (reading order / no forward references); describe it and its carve-outs in contributing.md.
Both added cognitive load to the agent for little gain. Revert to the original "stop after 10 violations" cap and remove the Rule 1/Rule 3 delineation note.
…to hp/fixup-graph-doc
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # docs/source/user_guide/contributing.md
Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
| ### Restrictions | ||
|
|
||
| - The counter ndarray may be swapped between calls: the cached graph reads each counter through an indirection slot that is refreshed on every launch, so passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it. | ||
| - The counter ndarray may be swapped between calls: passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it. |
There was a problem hiding this comment.
I realise that the notion of "replaying graph" was never introduced. It is not very intuitive I feel, unless one already has a clear view of what a compute graph is concretely.
| ### Caveats | ||
|
|
||
| On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** (HIP has no conditional / while node API as of ROCm 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration: | ||
| On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** ([HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html) has no conditional / while node API as of [ROCm](https://www.amd.com/en/products/software/rocm.html) 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration: |
There was a problem hiding this comment.
would fail doc check, because it's a forward reference.
There was a problem hiding this comment.
As I said, I think forward reference which are not necessary to read to understand what's going on should be considered fine. It should be reframed as (for details about the supported backend, see 'Backend support' section). At this point, we don't care about which backend is supported or not in practice.
| ### Caveats | ||
|
|
||
| On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** (HIP has no conditional / while node API as of ROCm 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration: | ||
| On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** ([HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html) has no conditional / while node API as of [ROCm](https://www.amd.com/en/products/software/rocm.html) 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration: |
There was a problem hiding this comment.
On platforms without [...]
On GPU backends [...] ?
| - on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 kernels | ||
| - on other platforms, there is no change: we still launch 3 gpu kernels: no change: not better, not worse | ||
| - on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 separate kernels | ||
| - on other platforms, there is no change: we still launch 3 separate kernels: no change: not better, not worse |
There was a problem hiding this comment.
3 separate kernels
3 separate gpu kernels ?
| - on unsupported hardware, we still incur the pipeline stall, as before | ||
| - note that there will be some small acceleration, because the condition evaluation and kernel launch will take place entirely from c++, bypassing python | ||
| - no worse, incrementally better |
There was a problem hiding this comment.
because the condition evaluation and kernel launch will take place entirely from c++, bypassing python
Only this part is worth removing, the rest is relevant for users no?
There was a problem hiding this comment.
I think it's confusing without the explanation? We could put this in an 'advanced' section potentailly. A typical end-user is not going to notice the difference until they are heavily optimizing, at which time, they could read an advanced section.
There was a problem hiding this comment.
I think it's confusing without the explanation?
Why confusing? This statements are very clear. Maybe raising more questions, but not confusing by itself I feel.
We could put this in an 'advanced' section potentailly. A typical end-user is not going to notice the difference until they are heavily optimizing, at which time, they could read an advanced section.
Yes, sounds like a good idea.
| - there is kernel launch latency associated with: | ||
| - running k1 from host-side python | ||
| - launching the gpu kernels for each of fn_1, fn_2, fn_3 from host-side c++ | ||
| - running k1 from host-side |
There was a problem hiding this comment.
"host-side python" was more explicit I think.
There was a problem hiding this comment.
It was, but then we have to make a distinction btween python and c++, and talking about c++ is adding information that the user perhaps doesn't need to know.
| In practice, for our own kernels, i.e. in genesis-world, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section. | ||
| In practice, for our own [genesis-world](https://github.com/Genesis-Embodied-AI/genesis-world) kernels, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section. |
There was a problem hiding this comment.
What is the point of this sentence? Like, what Quadrants' users are supposed to do with this information?
There was a problem hiding this comment.
replace "for our own genesis-world kernels" with "for many real-world cases"?
|
ok to merge |
| On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below): | ||
| — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration: |
There was a problem hiding this comment.
Markdown formatting issue: Line 158 ends with a colon, but line 159 starts with an em-dash (—) instead of a proper list marker or continuation. This will likely render incorrectly.
Suggestion:
On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below), the value of the `graph_do_while` parameter will be copied...Or use a proper bullet list:
On GPU backends without native device-side conditional graph nodes (see [Backend Support](#backend-support) below):
- The value of the `graph_do_while` parameter will be copied...Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
Explain what a GPU kernel launch is, break down launch latency into python/c++/gpu-driver sources, and describe why graph_do_while helps even on unsupported hardware (host-side loop driven in c++, avoiding per-iteration python latency).
Drop the detailed enumeration of per-call python work; too much detail.
Drop the detailed enumeration of per-task c++ runtime work; too much info.
Drop the "irreducible floor" sentence; doesn't add signal.
Explain that launches overlap with execution, so reducing launch latency only helps throughput when kernels are small enough that launch latency is exposed rather than hidden behind execution.
…' -> 'can directly increase')
Reframe from 'typically faster' to 'launch latency can be slightly reduced'.
…n help slightly')
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough