Skip to content

Commit 3d28ad3

Browse files
authored
Fix figures in design doc (#18612)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
1 parent 6a7988c commit 3d28ad3

File tree

1 file changed

+12
-26
lines changed

1 file changed

+12
-26
lines changed

docs/design/kernel/paged_attention.md

Lines changed: 12 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -140,22 +140,18 @@ title: vLLM Paged Attention
140140
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
141141
```
142142

143-
<figure markdown="span">
144-
![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" }
145-
<figcaption>
146-
</figcaption>
147-
</figure>
143+
<figure markdown="span">
144+
![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" }
145+
</figure>
148146

149147
- Each thread defines its own `q_ptr` which points to the assigned
150148
query token data on global memory. For example, if `VEC_SIZE` is 4
151149
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
152150
total of 128 elements divided into 128 / 4 = 32 vecs.
153151

154-
<figure markdown="span">
155-
![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
156-
<figcaption>
157-
</figcaption>
158-
</figure>
152+
<figure markdown="span">
153+
![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
154+
</figure>
159155

160156
```cpp
161157
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
@@ -192,11 +188,9 @@ title: vLLM Paged Attention
192188
points to key token data based on `k_cache` at assigned block,
193189
assigned head and assigned token.
194190

195-
<figure markdown="span">
196-
![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" }
197-
<figcaption>
198-
</figcaption>
199-
</figure>
191+
<figure markdown="span">
192+
![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" }
193+
</figure>
200194

201195
- The diagram above illustrates the memory layout for key data. It
202196
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
@@ -209,11 +203,9 @@ title: vLLM Paged Attention
209203
elements for one token) that will be processed by 2 threads (one
210204
thread group) separately.
211205

212-
<figure markdown="span">
213-
![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
214-
<figcaption>
215-
</figcaption>
216-
</figure>
206+
<figure markdown="span">
207+
![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
208+
</figure>
217209

218210
```cpp
219211
K_vec k_vecs[NUM_VECS_PER_THREAD]
@@ -372,20 +364,14 @@ title: vLLM Paged Attention
372364

373365
<figure markdown="span">
374366
![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" }
375-
<figcaption>
376-
</figcaption>
377367
</figure>
378368

379369
<figure markdown="span">
380370
![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
381-
<figcaption>
382-
</figcaption>
383371
</figure>
384372

385373
<figure markdown="span">
386374
![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" }
387-
<figcaption>
388-
</figcaption>
389375
</figure>
390376

391377
- Now we need to retrieve the value data and perform dot multiplication

0 commit comments

Comments
 (0)