[Pallas] [Mosaic GPU] Add GPU pipelining docs #28135

justinjfu · 2025-04-19T00:51:55Z

Adds docs covering emit_pipeline and emit_pipeline_warp_specialized.

Remaining TODOs possibly include having a flash attention example with ping-pong scheduling.

docs/pallas/gpu/pipelining.md

superbobry · 2025-04-22T17:40:39Z

docs/pallas/gpu/pipelining.md

+
+- `body`, `grid` have the same semantics as in `pl.pallas_call`. The `grid` denotes how many invocations of the `body` function to run. In contrast with a CUDA grid, the pipeline grid is guaranteed to run sequentially.
+- `in_specs` and `out_specs` also work the same as `pl.pallas_call`, except they also accept `plgpu.GPUBlockSpec` instances that can be used specify GPU-specific transforms, such as swizzling. See [memory reference transforms](https://docs.jax.dev/en/latest/pallas/gpu/reference.html#memory-reference-transforms) for more detail on available transformations.
+- `max_concurrent_steps` controls the maximum number of pipeline stages to use. Using additional stages will consume more SMEM to hold temporary buffers, so this option should be used carefully.


Not sure if "pipeline stages" is quite right here. Should we explain this in terms of the number of copy operations running in concurrently?

CC @apaszke

I updated this to just say the number of concurrent memory copies in flight.

docs/pallas/gpu/pipelining.md

superbobry · 2025-04-22T17:44:16Z

docs/pallas/gpu/pipelining.md

+- `delay_release` allows the user to specify the number of steps to wait before re-using a buffer. This is useful for certain optimizations such as allowing multiple async matmuls in flight to keep the tensor core pipeline filled.
+
+
+As an alternative to `emit_pipeline`, Mosaic GPU also implements the existing `pl.pallas_call` API for pipelining. Pipelining with `pl.pallas_call` directly requires the user to pass in a `plgpu.GPUCompilerParams` object as the `compiler_params` argument, which specifies the following options that are relevant for pipelining:


I would personally just leave this out tbh and tell people to use plgpu.kernel, since it allows you to do everything pl.pallas_call can and more (if you need to).

I'll mark this as a "Compatibility" API only and nudge people to use plgpu.kernel instead? I think some users might prefer having "one way to do things" and use pallas_call for both GPU/TPU.

docs/pallas/gpu/pipelining.md

cperivol

Nice work!

docs/pallas/gpu/pipelining.md

apaszke · 2025-04-29T10:00:14Z

docs/_static/pallas/gpu/warp_specialization.svg

I think the picture is a bit misleading in that it makes it seem as if the WS schedule was actually worse? Perhaps compact the space in WG 1 so that while WG0 is doing copy_start, WG1 is doing copy_wait and matmul? Also, consider making the pipelined copy_waits short since the whole point is that we should no longer wait this much for memory if the transaction had enough time to complete?

I compacted the figure so it's the two are mostly aligned in time.

So the memory thread spends most of the time on the consumed_barrier_wait, and compute_thread on matmul/copy_wait

apaszke · 2025-04-29T10:01:28Z

docs/pallas/gpu/pipelining.md

@@ -0,0 +1,325 @@
+---


Just OOC, is there any benefit for us in keeping this guild in both ipynb and as well .md? The only thing I can think of is that people might be able to run it on colab, but if it's mostly an explanation with not that much self-contained code then I'm not sure if it's useful?

I'm fine with just keeping the md, but it's also not really any extra work to have the ipynb and there's a couple runnable examples here for the matmuls.

apaszke · 2025-04-29T10:01:43Z

docs/pallas/gpu/pipelining.md

+    name: python3
+---
+
+(pallas_mgpu_pipelining)=


OOC is that some way to create links to docs?

Yes you can refer back to this with {ref}`pallas_mgpu_pipelining`

docs/pallas/gpu/pipelining.md

apaszke · 2025-04-29T10:24:44Z

docs/pallas/gpu/pipelining.md

+<!-- #region id="OkWmfqn7b53M" -->
+We use the `carry_coroutine` pattern to initialize the WGMMA accumulator, and copy the final accumulator from registers into SMEM. Here, the carry coroutine is defined in the function `compute_thread`. It is critical that the accumulator be created inside of the `compute_thread` function to avoid allocating it in the memory warpgroup which would waste registers. To perform the. WGMMA, we wrap the `wgmma` instruction in a `pl.run_state` in order to create an accumulator ref that is initialized to the carry value.
+
+Instead of using `pl.pallas_call` to call the kernel, we instead use the GPU-specific `plgpu.kernel` entry point. `plgpu.kernel` allows us to specify the number of warpgroups to launch per CUDA block via the `num_threads` argument, and allows us to specify a `thread_name` we can use to query the warpgroup index inside of the kernel.


Please please please stop talking about warpgroups this much. It can be helpful to mention it from time to time, but we really should talk more The warpgroup really is not visible and not a concept that makes sense in the semantics of Pallas:MGPU

I'll refer to this as "Mosaic thread" since there's potential for ambiguity between "CUDA thread" and "Mosaic thread" as Christos noted above.

docs/pallas/gpu/pipelining.md

justinjfu force-pushed the gpu_pipe_docs branch from a2708ed to 6485c7f Compare April 19, 2025 00:54

justinjfu requested review from apaszke and superbobry April 19, 2025 00:56

superbobry reviewed Apr 22, 2025

View reviewed changes

justinjfu force-pushed the gpu_pipe_docs branch 2 times, most recently from 9f69f60 to b33dd7d Compare April 22, 2025 21:24

superbobry approved these changes Apr 25, 2025

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Apr 25, 2025

cperivol reviewed Apr 25, 2025

View reviewed changes

docs/pallas/gpu/pipelining.md Outdated Show resolved Hide resolved

docs/pallas/gpu/pipelining.md Outdated Show resolved Hide resolved

docs/pallas/gpu/pipelining.md Outdated Show resolved Hide resolved

justinjfu force-pushed the gpu_pipe_docs branch from b33dd7d to ed3a842 Compare April 28, 2025 23:25

apaszke self-assigned this Apr 29, 2025

apaszke reviewed Apr 29, 2025

View reviewed changes

[Pallas][Mosaic GPU] Add GPU pipelining docs

edac1b1

justinjfu force-pushed the gpu_pipe_docs branch from ed3a842 to edac1b1 Compare May 1, 2025 20:11

		- `delay_release` allows the user to specify the number of steps to wait before re-using a buffer. This is useful for certain optimizations such as allowing multiple async matmuls in flight to keep the tensor core pipeline filled.


		As an alternative to `emit_pipeline`, Mosaic GPU also implements the existing `pl.pallas_call` API for pipelining. Pipelining with `pl.pallas_call` directly requires the user to pass in a `plgpu.GPUCompilerParams` object as the `compiler_params` argument, which specifies the following options that are relevant for pipelining:

[Pallas] [Mosaic GPU] Add GPU pipelining docs #28135

Are you sure you want to change the base?

[Pallas] [Mosaic GPU] Add GPU pipelining docs #28135

Uh oh!

Conversation

justinjfu commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cperivol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinjfu May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinjfu commented Apr 19, 2025 •

edited

Loading

justinjfu May 1, 2025 •

edited

Loading