Skip to content

Latest commit

 

History

History

benchmarking

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Benchmarking

Llama 3.1 70B Instruct

vLLM offline benchmarking

The vLLM benchmarking script is https://github.com/tenstorrent/vllm/blob/dev/examples/offline_inference_tt.py

It is recommended to run the vLLM model implementation via docker run, at tt-inference-server/vllm-tt-metal-llama3-70b/README.md.

To measure performance for a single batch (with the default prompt length of 128 tokens):

export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 128 --max_tokens 128
# for example, changing to input 2048, output 2048
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 2048 --max_tokens 2048

Command Line Arguments

  • --prompts_json (default: "tt_metal/prompts.json"):

    • Path to prompts JSON file used for inference. Prompts should be in a list format. This will not be used if measure_perf is set.
  • --measure_perf:

    • Measure model performance using synthetic inputs. If enabled, any provided prompts_json is ignored, and dummy prompts are used instead for benchmarking.
  • --perf_prompt_len (default: 128):

    • Length of dummy prompts (in tokens) for benchmarking. Used only when --measure_perf is provided.
  • --max_tokens (default: 128):

    • Maximum output length (in tokens) generated by the model for each prompt.
  • --greedy_sampling:

    • Use greedy decoding instead of probabilistic sampling (top-k/top-p). Greedy sampling always selects the token with the highest probability, leading to more deterministic output.
  • --max_seqs_in_batch (default: 32):

    • Maximum batch size for inference, determining the number of prompts processed in parallel.