[Misc] add gpu_memory_utilization arg #5079

pandyamarut · 2024-05-28T03:27:37Z

For Larger models like: meta-llama/Meta-Llama-3-70B It throws

ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (4688). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

It's required to specify gpu_memory_utilization to successfully run the benchmark.

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

sroy745 · 2024-05-29T00:05:56Z

benchmarks/benchmark_latency.py

@@ -211,5 +212,11 @@ def run_to_completion(profile_dir: Optional[str] = None):
        type=str,
        default=None,
        help='Path to save the latency results in JSON format.')
+    parser.add_argument('--gpu-memory-utilization',


Wondering if we want to expose max-model-len as well in the args. We are doing that in the benchmark_througput ( https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/benchmarks/benchmark_throughput.py?L301) and wondering if we should add it here as well since that is another parameter to tune for this error?

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

pandyamarut added 2 commits May 27, 2024 20:13

add gpu_memory_utilization arg

af40344

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

reformat

d4b1993

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

sroy745 reviewed May 29, 2024

View reviewed changes

simon-mo approved these changes May 29, 2024

View reviewed changes

simon-mo merged commit 616e600 into vllm-project:main May 29, 2024
63 checks passed

blinkbear pushed a commit to blinkbear/vllm that referenced this pull request May 29, 2024

[Misc] add gpu_memory_utilization arg (vllm-project#5079)

0c030aa

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024

[Misc] add gpu_memory_utilization arg (vllm-project#5079)

52c0867

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024

[Misc] add gpu_memory_utilization arg (vllm-project#5079)

95c2a3d

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Misc] add gpu_memory_utilization arg (vllm-project#5079)

264bbf0

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024

[Misc] add gpu_memory_utilization arg (vllm-project#5079)

2ed01a6

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Misc] add gpu_memory_utilization arg (vllm-project#5079)

57905a6

Signed-off-by: pandyamarut <pandyamarut@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] add gpu_memory_utilization arg #5079

[Misc] add gpu_memory_utilization arg #5079

pandyamarut commented May 28, 2024

sroy745 May 29, 2024

[Misc] add gpu_memory_utilization arg #5079

[Misc] add gpu_memory_utilization arg #5079

Conversation

pandyamarut commented May 28, 2024

sroy745 May 29, 2024

Choose a reason for hiding this comment