Skip to content

Commit fc95e41

Browse files
fixed image size
1 parent c5ef781 commit fc95e41

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

_blogs/tokasaurus.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,13 +89,17 @@ Tokasaurus can also efficiently serve bigger models across multiple GPUs! Here,
8989

9090
One of our original goals with Tokasaurus was to efficiently run multi-GPU inference on our lab’s L40S GPUs, which don’t have fast inter-GPU NVLink connections. Without NVLink, the communication costs incurred running TP across a node of eight GPUs are substantial. Therefore, efficient support for PP (which requires much less inter-GPU communication) was a high priority. PP needs a large batch in order to run efficiently, since batches from the manager are subdivided into microbatches that are spread out across pipeline stages. When optimizing for throughput, we’re generally already using the largest batch size that fits in GPU memory, so PP is often a natural fit for throughput-focused workloads. When benchmarking against vLLM’s pipeline implementation using Llama-3.1-70B on eight L40S GPUs, Tokasaurus improves throughput by over 2.5x:
9191

92-
![Tokasaurus pipeline parallelism](/imgs/blog/tokasaurus/pipeline.png)
92+
<div style="display: flex; gap: 16px; align-items: center;">
93+
<img src="/imgs/blog/tokasaurus/pipeline.png" alt="Tokasaurus small models" style="max-width: 98%; height: auto; display: block;">
94+
</div>
9395

9496
### Async Tensor Parallel for the GPU Rich
9597

9698
If you do have GPUs with NVLink (e.g. B200s and certain models of H100s and A100s), Tokasaurus has something for you too! Models in Tokasaurus can be torch compiled end-to-end, allowing us to take advantage of [Async Tensor Parallelism (Async-TP)](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487). This is a relatively new feature in PyTorch that can overlap inter-GPU communication with other computations, partially hiding the cost of communication. In our benchmarks, we found that Async-TP adds a lot of CPU overhead to the model forward pass and only starts improving throughput with very large batch sizes (e.g. 6k+ tokens). Tokasaurus maintains torch-compiled versions of our models with and without Async-TP enabled, allowing us to automatically switch on Async-TP whenever the batch size is big enough:
9799

98-
![Tokasaurus big models](/imgs/blog/tokasaurus/big.png)
100+
<div style="display: flex; gap: 16px; align-items: center;">
101+
<img src="/imgs/blog/tokasaurus/big.png" alt="Tokasaurus small models" style="max-width: 98%; height: auto; display: block;">
102+
</div>
99103

100104
---
101105

0 commit comments

Comments
 (0)