The vLLM benchmarking script is https://github.com/tenstorrent/vllm/blob/dev/examples/offline_inference_tt.py
It is recommended to run the vLLM model implementation via docker run
, at tt-inference-server/vllm-tt-metal-llama3-70b/README.md.
To measure performance for a single batch (with the default prompt length of 128 tokens):
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 128 --max_tokens 128
# for example, changing to input 2048, output 2048
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 2048 --max_tokens 2048
-
--prompts_json
(default:"tt_metal/prompts.json"
):- Path to prompts JSON file used for inference. Prompts should be in a list format. This will not be used if
measure_perf
is set.
- Path to prompts JSON file used for inference. Prompts should be in a list format. This will not be used if
-
--measure_perf
:- Measure model performance using synthetic inputs. If enabled, any provided
prompts_json
is ignored, and dummy prompts are used instead for benchmarking.
- Measure model performance using synthetic inputs. If enabled, any provided
-
--perf_prompt_len
(default:128
):- Length of dummy prompts (in tokens) for benchmarking. Used only when
--measure_perf
is provided.
- Length of dummy prompts (in tokens) for benchmarking. Used only when
-
--max_tokens
(default:128
):- Maximum output length (in tokens) generated by the model for each prompt.
-
--greedy_sampling
:- Use greedy decoding instead of probabilistic sampling (top-k/top-p). Greedy sampling always selects the token with the highest probability, leading to more deterministic output.
-
--max_seqs_in_batch
(default:32
):- Maximum batch size for inference, determining the number of prompts processed in parallel.