-
Notifications
You must be signed in to change notification settings - Fork 4
Getting Started
Our benchmark configs are written in YAML. Here is a sample config that will run a single benchmark of Llama-3.1-8B on an H100 with vLLM.
id: llama3
base_config:
model: meta-llama/Llama-3.1-8B-Instruct
region: us-chicago-1
llm_server_type: vllm
data: prompt_tokens=1024,output_tokens=128
gpu: H100
Now, make sure you've deployed Stopwatch as described in README.md.
Then, you can save this file to configs/llama.yaml
and run this benchmark config with the following command:
⚠️ This command costs money! The cost will vary depending on the number of configs and the sizes of the models you're benchmarking, but a rough estimate for an H100 Llama-3.1-8B vLLM benchmark like this one would be (1 synchronous + 1 throughput + 10 fixed-rate benchmarks) * (90 second startup + 120 second benchmark) * ($0.001097/H100-second) * (1.25 region-pinning multiplier) = $3.46.
modal run -d -m cli.run_benchmark_suite --config-path configs/llama.yaml
This will run one synchronous test to determine the server's minimum request rate, then one throughput test to determine the server's maximum request rate, and then a bunch of fixed-rate tests in parallel to determine the server's latency and throughput at various request rates in between. This should take about 8 minutes in total. Once this is done, you can open your deployed Datasette UI to view and plot the results of your benchmark.
The URL to this UI will be printed in your terminal after the benchmark completes, but it should look like this:
https://YOUR-ACCOUNT-NAME--datasette.modal.run/stopwatch/-/query?sql=select+*+from+llama3_averaged+where+rate_type+%21%3D+"throughput"
And you should see a plot like this, which you can adjust using the dropdowns at the bottom:
While the plots look nice, running a single benchmark is pretty boring: Stopwatch's strength lies in its ability to run many benchmarks at the same time.
For example, we might want to compare the performance of vLLM and SGLang.
Rather than writing out two separate benchmark configs, we can instead swap out the llm_server_type
parameter for a list, which Stopwatch will iterate over.
If any of the benchmarks you've defined have been run before, Stopwatch will cache the results and apply them toward the results of this config.
So if you ran the vLLM config above, the following command will run experiments with SGLang, but your results files will contain the results from both the vLLM and SGLang experiments:
id: llama3-vllm-sglang
base_config:
model: meta-llama/Llama-3.1-8B-Instruct
region: us-chicago-1
llm_server_type:
- vllm
- sglang
data: prompt_tokens=1024,output_tokens=128
gpu: H100
Now, let's say we also want to compare the performance of different data distributions with each server. If we provide Stopwatch with multiple lists, it will iterate over all combinations of each value in each list. For example, we can define a suite of 4 benchmarks (vLLM + data config 1, vLLM + data config 2, SGLang + data config 1, SGLang + data config 2) like this:
id: llama3-data-configs
base_config:
model: meta-llama/Llama-3.1-8B-Instruct
region: us-chicago-1
llm_server_type:
- vllm
- sglang
data:
- prompt_tokens=1024,output_tokens=128
- prompt_tokens=128,output_tokens=1024
gpu: H100
Which will yield a plot like this:
Models with 8 billion parameters are getting pretty small these days, so now, let's say we want to use multiple GPUs to benchmark Llama-3.1-70B. To run any LLM inference server with multiple GPUs, you will likely want to set the server's tensor parallelism, which needs to be passed as a command-line argument. Stopwatch supports passing arguments to each LLM server on startup. For example, if you wanted to benchmark a 8xH100 configuration with vLLM, you can set it up like this:
id: llama3-8xh100-vllm
base_config:
model: meta-llama/Llama-3.1-70B-Instruct
region: us-chicago-1
llm_server_type: vllm
data: prompt_tokens=1024,output_tokens=128
gpu: H100:8
llm_server_config:
extra_args: ["--tensor-parallel-size", "8"]
However, if you now want to compare the performance of vLLM, SGLang, and TensorRT-LLM, you'll run into a problem: SGLang and TensorRT-LLM don't use the --tensor-parallel-size
command-line argument in this way.
You'll now need to define the configurations for each server individually, each of which will inherit from your base_config
.
This is handled for you in TensorRT-LLM, which is generally more difficult to configure.
id: llama3-8xh100
base_config:
model: meta-llama/Llama-3.1-70B-Instruct
region: us-chicago-1
data: prompt_tokens=1024,output_tokens=128
gpu: H100:8
configs:
- llm_server_type: vllm
llm_server_config:
extra_args: ["--tensor-parallel-size", "8"]
- llm_server_type: sglang
llm_server_config:
extra_args: ["--tp", "8"]
- llm_server_type: tensorrt-llm
You now know everything you need to know to start writing Stopwatch configs!
Check out the configs in the configs
directory for more examples of how you can tune vLLM, SGLang, and TensorRT-LLM, and please write an issue, submit a pull request, or send us an email if you have any questions.