Getting Started

Our benchmark configs are written in YAML. Here is a sample config that will run a single benchmark of Llama-3.1-8B on an H100 with vLLM.

id: llama3
base_config:
  model: meta-llama/Llama-3.1-8B-Instruct
  region: us-chicago-1
  llm_server_type: vllm
  data: prompt_tokens=1024,output_tokens=128
  gpu: H100

Now, make sure you've deployed Stopwatch as described in README.md. Then, you can save this file to configs/llama.yaml and run this benchmark config with the following command:

⚠️ This command costs money! The cost will vary depending on the number of configs and the sizes of the models you're benchmarking, but a rough estimate for an H100 Llama-3.1-8B vLLM benchmark like this one would be (1 synchronous + 1 throughput + 10 fixed-rate benchmarks) * (90 second startup + 120 second benchmark) * ($0.001097/H100-second) * (1.25 region-pinning multiplier) = $3.46.

modal run -d -m cli.run_benchmark_suite --config-path configs/llama.yaml

This will run one synchronous test to determine the server's minimum request rate, then one throughput test to determine the server's maximum request rate, and then a bunch of fixed-rate tests in parallel to determine the server's latency and throughput at various request rates in between. This should take about 8 minutes in total. Once this is done, you can open your deployed Datasette UI to view and plot the results of your benchmark.

The URL to this UI will be printed in your terminal after the benchmark completes, but it should look like this:

https://YOUR-ACCOUNT-NAME--datasette.modal.run/stopwatch/-/query?sql=select+*+from+llama3_averaged+where+rate_type+%21%3D+"throughput"

And you should see a plot like this, which you can adjust using the dropdowns at the bottom:

Llama-3.1-8B benchmark results

Benchmarking many configs at once

While the plots look nice, running a single benchmark is pretty boring: Stopwatch's strength lies in its ability to run many benchmarks at the same time. For example, we might want to compare the performance of vLLM and SGLang. Rather than writing out two separate benchmark configs, we can instead swap out the llm_server_type parameter for a list, which Stopwatch will iterate over. If any of the benchmarks you've defined have been run before, Stopwatch will cache the results and apply them toward the results of this config. So if you ran the vLLM config above, the following command will run experiments with SGLang, but your results files will contain the results from both the vLLM and SGLang experiments:

id: llama3-vllm-sglang
base_config:
  model: meta-llama/Llama-3.1-8B-Instruct
  region: us-chicago-1
  llm_server_type:
    - vllm
    - sglang
  data: prompt_tokens=1024,output_tokens=128
  gpu: H100

Llama-3.1-8B vLLM vs. SGLang benchmark results

Now, let's say we also want to compare the performance of different data distributions with each server. If we provide Stopwatch with multiple lists, it will iterate over all combinations of each value in each list. For example, we can define a suite of 4 benchmarks (vLLM + data config 1, vLLM + data config 2, SGLang + data config 1, SGLang + data config 2) like this:

id: llama3-data-configs
base_config:
  model: meta-llama/Llama-3.1-8B-Instruct
  region: us-chicago-1
  llm_server_type:
    - vllm
    - sglang
  data:
    - prompt_tokens=1024,output_tokens=128
    - prompt_tokens=128,output_tokens=1024
  gpu: H100

Which will yield a plot like this:

Llama-3.1-8B data config benchmark results

More complex config examples

Models with 8 billion parameters are getting pretty small these days, so now, let's say we want to use multiple GPUs to benchmark Llama-3.1-70B. To run any LLM inference server with multiple GPUs, you will likely want to set the server's tensor parallelism, which needs to be passed as a command-line argument. Stopwatch supports passing arguments to each LLM server on startup. For example, if you wanted to benchmark a 8xH100 configuration with vLLM, you can set it up like this:

id: llama3-8xh100-vllm
base_config:
  model: meta-llama/Llama-3.1-70B-Instruct
  region: us-chicago-1
  llm_server_type: vllm
  data: prompt_tokens=1024,output_tokens=128
  gpu: H100:8
  llm_server_config:
    extra_args: ["--tensor-parallel-size", "8"]

Llama-3.1-70B 8xH100 benchmarking results

However, if you now want to compare the performance of vLLM, SGLang, and TensorRT-LLM, you'll run into a problem: SGLang and TensorRT-LLM don't use the --tensor-parallel-size command-line argument in this way. You'll now need to define the configurations for each server individually, each of which will inherit from your base_config. This is handled for you in TensorRT-LLM, which is generally more difficult to configure.

id: llama3-8xh100
base_config:
  model: meta-llama/Llama-3.1-70B-Instruct
  region: us-chicago-1
  data: prompt_tokens=1024,output_tokens=128
  gpu: H100:8
configs:
  - llm_server_type: vllm
    llm_server_config:
      extra_args: ["--tensor-parallel-size", "8"]
  - llm_server_type: sglang
    llm_server_config:
      extra_args: ["--tp", "8"]
  - llm_server_type: tensorrt-llm

Llama-3.1-70B 8xH100 vLLM vs. SGLang vs. TensorRT-LLM benchmarking results

You now know everything you need to know to start writing Stopwatch configs! Check out the configs in the configs directory for more examples of how you can tune vLLM, SGLang, and TensorRT-LLM, and please write an issue, submit a pull request, or send us an email if you have any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting Started

Benchmarking many configs at once

More complex config examples

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally