Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding online benchmarking scripts #55

Merged
merged 42 commits into from
Dec 31, 2024
Merged
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
7d68656
add print_prompts cli arg
tstescoTT Dec 4, 2024
8d78d64
remove redundant stop token from vLLM example api calls
tstescoTT Dec 4, 2024
3108bc0
add capture_trace.py util to pre-prompt vllm server to capture all tr…
tstescoTT Dec 4, 2024
ea3d75d
adding utils/startup_utils.py to refine handling of startup in automa…
tstescoTT Dec 4, 2024
cc1d17a
adding force_max_tokens as option to call_inference_api(), add input_…
tstescoTT Dec 4, 2024
059d513
faster mock model prefill
tstescoTT Dec 4, 2024
48d17de
make it not send stop tokens by default and speed up mock model decod…
tstescoTT Dec 5, 2024
fead1aa
adding token count verification for vllm open ai api server to prompt…
tstescoTT Dec 5, 2024
5a80551
add max-log-len to limit logging of prompts to avoid clutter in logs
tstescoTT Dec 5, 2024
d845f08
add InferenceServerContext to startup_utils.py, improve wait_for_healthy
tstescoTT Dec 5, 2024
632ac83
add all_responses to utils/prompt_client_cli.py not using globals
tstescoTT Dec 5, 2024
f563e32
adding new utils/prompt_client_cli.py using utils/prompt_client.py an…
tstescoTT Dec 5, 2024
2467c74
fix health endpoint
tstescoTT Dec 5, 2024
af5e8dc
add vllm_model to EnvironmentConfig instead of BatchConfig
tstescoTT Dec 5, 2024
60c7ab2
refactor utils/capture_traces.py with new prompt_client
tstescoTT Dec 5, 2024
10993a2
fix utils imports
tstescoTT Dec 5, 2024
20ccdf4
fix BatchConfig usage
tstescoTT Dec 6, 2024
eab7e76
add benchmarking/online_benchmark_prompt_client.py using prompt_clien…
tstescoTT Dec 6, 2024
90acdf6
add benchmarking/online_benchmark_prompt_client.py using prompt_clien…
tstescoTT Dec 6, 2024
ec486ad
add benchmarking, evals, and tests dirs to Dockerfile
tstescoTT Dec 6, 2024
c58d7b3
update patchfile and benchmarking README.md with commands
tstescoTT Dec 6, 2024
fe4f96d
update Docker IMAGE_VERSION to v0.0.3
tstescoTT Dec 6, 2024
f3d815a
improve doc
tstescoTT Dec 6, 2024
8246a72
update benchmark_serving.patch
tstescoTT Dec 6, 2024
765c4be
add tt_model_runner.py patch for best_of
tstescoTT Dec 6, 2024
b93370d
update benchmarking/benchmark_serving.patch
tstescoTT Dec 6, 2024
5e07baa
use CACHE_ROOT for vllm_online_benchmark_results dir
tstescoTT Dec 6, 2024
d0e0b0f
adding timestamped online benchmark run result directory, rps=1 for v…
tstescoTT Dec 9, 2024
5db2523
update benchmark output file naming convention
tstescoTT Dec 9, 2024
5ab742c
rename benchmarking/online_benchmark_prompt_client.py to benchmarking…
tstescoTT Dec 9, 2024
06420bd
increase num_prompts default, default to 128/128 online test
tstescoTT Dec 9, 2024
b7e4cfc
use min_tokens and ignore_eos=True to force output seq len
tstescoTT Dec 9, 2024
dda29a9
adding min_tokens to locust requests
tstescoTT Dec 9, 2024
f8b3033
add --ignore-eos to vllm_online_benchmark.py to force the output seq …
tstescoTT Dec 10, 2024
12c38fc
add context_lens (isl, osl) pairs to capture_traces() to capture corr…
tstescoTT Dec 10, 2024
1cabdc9
add trace pre-capture to prompt_client_cli.py with option to disable
tstescoTT Dec 10, 2024
68f08d0
better comment and logs for trace capture
tstescoTT Dec 10, 2024
962c507
use TPOT and TPS in benchmarking/prompt_client_online_benchmark.py, a…
tstescoTT Dec 12, 2024
62bf427
update utils/prompt_client_cli.py and docs
tstescoTT Dec 12, 2024
d9e163c
remove WIP utils/startup_utils.py from this branch
tstescoTT Dec 12, 2024
cd29085
adding doc string to BatchProcessor
tstescoTT Dec 31, 2024
376403d
add output_path arg to batch_processor.py::BatchProcessor to optional…
tstescoTT Dec 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
refactor utils/capture_traces.py with new prompt_client
  • Loading branch information
tstescoTT committed Dec 20, 2024
commit 60c7ab28674aa167f30cf18f8329c69627878b0b
79 changes: 7 additions & 72 deletions utils/capture_traces.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,10 @@
#
# SPDX-FileCopyrightText: © 2024 Tenstorrent AI ULC

import os
import logging
import argparse
from utils.prompt_generation import generate_prompts
from utils.prompt_client_cli import (
call_inference_api,
get_api_base_url,
get_authorization,
)
from utils.startup_utils import wait_for_healthy

from prompt_configs import EnvironmentConfig
from prompt_client import PromptClient

logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
Expand All @@ -21,69 +15,10 @@


def capture_input_sizes():
"""
Capture different input size graphs with the TT model on vLLM.
get_padded_prefill_len() defines the different input sizes for prefill:
https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_generation.py#L341
"""
input_sizes = [sz - 8 for sz in [32, 64, 128, 256, 512, 1024, 2048, 3072, 4096]]
prompts_per_size = 1
output_seq_len = 1

base_url = get_api_base_url()
if not wait_for_healthy(base_url):
raise RuntimeError("vLLM did not start correctly!")

api_url = f"{base_url}/completions"
headers = {"Authorization": f"Bearer {get_authorization()}"}
vllm_model = os.environ.get("VLLM_MODEL", "meta-llama/Llama-3.1-70B-Instruct")

for size in input_sizes:
logger.info(f"Capture input size: {size}")

args = argparse.Namespace(
tokenizer_model=vllm_model,
dataset="random",
max_prompt_length=size,
input_seq_len=size,
distribution="fixed",
template=None,
save_path=None,
print_prompts=False,
num_prompts=prompts_per_size,
)

prompts, prompt_lengths = generate_prompts(args)

for i, (prompt, prompt_len) in enumerate(zip(prompts, prompt_lengths)):
try:
response_data = call_inference_api(
prompt=prompt,
response_idx=i,
prompt_len=prompt_len,
stream=True,
headers=headers,
api_url=api_url,
max_tokens=output_seq_len,
vll_model=vllm_model,
tokenizer=None,
)

logger.info(
f"Input size: {size}, input_seq_len: {prompt_len}, TTFT: {response_data['ttft']:.3f}s"
)

except Exception as e:
logger.error(f"Error processing prompt: {e}")


def main():
try:
capture_input_sizes()
except Exception as e:
logger.error(f"Capturing input sizes failed: {e}")
raise
env_config = EnvironmentConfig()
prompt_client = PromptClient(env_config)
prompt_client.capture_traces()


if __name__ == "__main__":
main()
capture_input_sizes()