Add script for benchmarking serving throughput #145

WoosukKwon · 2023-06-10T20:05:41Z

This PR implements a script to benchmark the online & offline serving throughput. It uses Poisson process to synthesize the request arrival times and simulates the serving process with the simple FastAPI frontend.

I think this closes #45

WoosukKwon · 2023-06-11T23:26:10Z

@zhuohan123 The PR is ready for review. Please check it out.

zhuohan123

LGTM! Left some comments.

zhuohan123 · 2023-06-14T15:59:14Z

benchmarks/benchmark_serving.py

+REQUEST_LATENCY: List[Tuple[int, int, float]] = []
+
+
+def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase:


Import from cacheflow.server.tokenizer_utils?

I didn't use it because I thought that's an internal function that cacheflow does not expose to the users.

benchmarks/benchmark_serving.py

zhuohan123 · 2023-06-14T16:09:43Z

benchmarks/benchmark_serving.py

+    parser.add_argument("--port", type=int, default=8001)
+    parser.add_argument("--dataset", type=str, required=True,
+                        help="Path to the dataset.")
+    parser.add_argument("--tokenizer", type=str, required=True,


Does this argument actually ask for model_name?

Yes, is it confusing?

zhuohan123 · 2023-06-14T16:10:41Z

benchmarks/benchmark_throughput.py

+from tqdm import tqdm
+
+
+def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase:


Here this function is slightly different from get_tokenizer in CacheFlow. It has one additional line

tokenizer.pad_token = tokenizer.eos_token

which is required for batched generation in the HF backend.

zhuohan123 · 2023-06-14T16:12:32Z

benchmarks/benchmark_throughput.py

+    return AutoTokenizer.from_pretrained(model_name)


 def sample_requests(


Does this function duplicate with benchmark_serving.py?

Yes it is. I didn't find a good way to avoid that.

A merge conflict left this behind and it leaves a warning on some systems ``` warning: the following paths have collided (e.g. case-sensitive paths on a case-insensitive filesystem) and only one from the same colliding group is in the working tree: '.github/PULL_REQUEST_TEMPLATE.md' '.github/pull_request_template.md' ```

@iotamudelta

* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers * Add distributed executor backend to benchmark scripts (vllm-project#118) * Add weight padding for moe (vllm-project#119) * add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter * [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116) * fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: maleksan85 <maleksan@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * add emtpy_cache() after each padding (vllm-project#120) * [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124) * add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <maleksan@amd.com> * save shape when fp8 solution not found (vllm-project#123) Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Fix unit test for moe by adding padding (vllm-project#128) * fix test_moe * fix linter * Llama3.1 (vllm-project#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * chat/completions endpoint (vllm-project#121) * Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request * Optimize custom all reduce (vllm-project#130) * First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover * Add BF16 support to custom PA (vllm-project#133) * tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Making check for output match in original types. It saves some memory. (vllm-project#135) Co-authored-by: maleksan85 <maleksan@amd.com> * Make CAR ROCm 6.1 compatible. (vllm-project#137) * remove scoping * while there fix a typo * while there remove unused variable * Car revert (vllm-project#140) * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (vllm-project#130)" This reverts commit 636ff01. * Using the correct datatypes for streaming non-chat completions (vllm-project#134) * Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards (vllm-project#138) * Adding UNREACHABLE_CODE macro * clang format fixes * clang formatting fix * minor updates in syntax * clang format update * clang format fix one more try * clang format one more try * clang format fix one more try --------- Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * gfx90a typo fix (vllm-project#142) Co-authored-by: maleksan85 <maleksan@amd.com> * wvsplitk templatized and better tuned for MI300 (vllm-project#132) * improvements to wvSpltK * wvsplt gemm; better handle MI300 and large A[] sizes * lint fix * Adjustments to better handle small weights in TP8. * early-out bug fix * better wave load balancing in wvSplt * add missing skip for wvsplt_big * Bug fix for wvSplt_big in load balancing at M4, lint fix. * [Bugfix] Dockerfile.rocm (vllm-project#141) * Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Update test-template.j2 (vllm-project#145) * Adding Triton implementations awq_dequantize and awq_gemm to ROCm (vllm-project#136) * basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Matt Wong <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: maleksan85 <maleksan@amd.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: iotamudelta <dieterich@ogolem.org> Co-authored-by: sanyalington <shomy.sanyal@amd.com> Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com> Co-authored-by: Zachary Streeter <90640993+zstreet87@users.noreply.github.com> Co-authored-by: omkar kakarparthi <75638701+okakarpa@users.noreply.github.com> Co-authored-by: rasmith <Randall.Smith@amd.com>

Start by updating the image

WoosukKwon added 30 commits June 10, 2023 05:00

Minor fix

473c5b8

Minor

a644a9b

Minor

67ed51c

Minor

83acd5e

Add log-requests option to AsyncLLMServer

4957281

[WIP] Add benchmark_serving.py

c6b38d2

Minor

5210de0

Delete unused files

d4df348

Minor

fab12d6

Add docstring

3ddadf4

Bugfix

4269b11

Minor

af8974d

Minor

f8dee6e

Add script to launch HF server

d181f10

Add HF backend

fc02a02

Minor

99d9ce3

Bugfix

bc9ec63

Filter out long prompts

9477f2f

Minor fix

51a5332

Merge branch 'main' into benchmark-llama

6b0d77b

Repeat failed requests

00d158d

Stream=False

0c55c40

Minor

bcb8e16

Prune short sequences

6a7baaa

Add 1 hour timeout

071b4aa

Increase timeout

983cf97

Add shortcut

b55b1ee

Simplify

c45a2dd

Merge branch 'opt' into benchmark-llama

66f8c60

n -> best_of

a1b513e

WoosukKwon added 4 commits June 11, 2023 22:26

Minor

72d6a63

Add latency stats

44bc461

Increase max_best_of in HF server

6990fc5

Merge branch 'main' into benchmark-llama

2c610bd

WoosukKwon requested a review from zhuohan123 June 11, 2023 23:23

WoosukKwon added 5 commits June 13, 2023 00:08

hf -> tgi

5687f10

Add HF backend

672fbbd

Fix batching

60bccc4

Fix a bug & Add tqdm

b7fcade

Minor

6accbfd

zhuohan123 approved these changes Jun 14, 2023

View reviewed changes

WoosukKwon added 4 commits June 15, 2023 02:21

Fix

c7360d1

Comment

bf1bae6

Add docstring

7bebe29

Comment

5c1b852

WoosukKwon merged commit 311490a into main Jun 15, 2023

WoosukKwon deleted the benchmark-llama branch June 15, 2023 02:55

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add script for benchmarking serving throughput (vllm-project#145)

6836716

Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Aug 15, 2024

Re-enable FusedRoPE (vllm-project#145)

667c7f3

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Sep 5, 2024

Merge pull request vllm-project#145 from RH-steve-grubb/do-updates

ac04abb

Start by updating the image

mht-sharma pushed a commit to mht-sharma/vllm that referenced this pull request Oct 30, 2024

Update test-template.j2 (vllm-project#145)

7c5fd50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add script for benchmarking serving throughput #145

Add script for benchmarking serving throughput #145

Uh oh!

WoosukKwon commented Jun 10, 2023 •

edited

Loading

Uh oh!

WoosukKwon commented Jun 11, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

zhuohan123 Jun 14, 2023

Uh oh!

WoosukKwon Jun 15, 2023

Uh oh!

Uh oh!

zhuohan123 Jun 14, 2023

Uh oh!

WoosukKwon Jun 15, 2023

Uh oh!

zhuohan123 Jun 14, 2023

Uh oh!

WoosukKwon Jun 15, 2023

Uh oh!

zhuohan123 Jun 14, 2023

Uh oh!

WoosukKwon Jun 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		REQUEST_LATENCY: List[Tuple[int, int, float]] = []


		def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase:

		from tqdm import tqdm


		def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase:

		return AutoTokenizer.from_pretrained(model_name)


		def sample_requests(

Uh oh!

Add script for benchmarking serving throughput #145

Add script for benchmarking serving throughput #145

Uh oh!

Conversation

WoosukKwon commented Jun 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Jun 11, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WoosukKwon commented Jun 10, 2023 •

edited

Loading