Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding online benchmarking scripts #55

Merged
merged 42 commits into from
Dec 31, 2024
Merged

Conversation

tstescoTT
Copy link
Contributor

@tstescoTT tstescoTT commented Dec 12, 2024

change log

  • add utils/prompt_client.p::PromptClient as vLLM client handling authentication and health checks
  • addres trace capture: vLLM run script prefill + decode trace pre-capture to avoid TTFT on first completions being unexpectedly high or stalling #56
  • improve prompt generation and handling with utils/prompt_configs.py and utils/batch_processor.py
  • remove explictly setting stop token in prompt client, this causes issues with instruct models correctly configured with instruct tokenizer
  • add trace capturing ahead of performance measurement in benchmarking scripts
  • add online benchmarking script using vllm/benchmarks/benchmark_serving.py
  • add vllm benchmarking patch at benchmarking/benchmark_serving.patch handling best_of which is unsupported in current Tenstorrent vllm fork
  • add benchmarking/prompt_client_online_benchmark.py to measure performance with different batch handling
  • update benchmarking docs
  • update prompt CLI and util docs
  • update mock model to be faster and not send stop tokens unexpected
  • add benchmarking, evals, and tests to Docker image vllm-tt-metal-llama3-70b/vllm.llama3.src.Dockerfile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a description of what this Class is trying to achieve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchProcessor runs multiple concurrent requests to the backend inference server (vLLM in this case). This adds some functionality for sending requests with a specific max number of requests allowed that is independent with the backend batch_size. Mostly this is for testing continous batching and seq lens, but can be used as an alternative method for benchmarking as in benchmarking/prompt_client_online_benchmark.py measuring TTFT as experienced by users by not exceeding the backend concurrent user capacity and having requests queued on the backend server before processing starts by the model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion is to add that description to the file / Class. That's a good explanation for the user.

…seq_lengths and output_seq_lengths directly args to test_api_call_threaded_full_queue() to allow for varied isl and osl within batch
…d utils/batch_processor.py with configs in utils/prompt_configs.py and utils/prompt_generation.py for prompt generation
@tstescoTT tstescoTT force-pushed the tstesco/online-benchmark branch from af1325f to d9e163c Compare December 20, 2024 03:25
Copy link
Contributor

@milank94 milank94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Pending one suggestion to add a description under BatchProcessor.

…ly provide incremental output saveing for debugging, default to not saving output for benchmarking
@tstescoTT tstescoTT merged commit fe563af into main Dec 31, 2024
1 check passed
@tstescoTT tstescoTT deleted the tstesco/online-benchmark branch January 15, 2025 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants