-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] add gpu_memory_utilization arg #5079
Conversation
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
@@ -211,5 +212,11 @@ def run_to_completion(profile_dir: Optional[str] = None): | |||
type=str, | |||
default=None, | |||
help='Path to save the latency results in JSON format.') | |||
parser.add_argument('--gpu-memory-utilization', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we want to expose max-model-len as well in the args. We are doing that in the benchmark_througput ( https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/benchmarks/benchmark_throughput.py?L301) and wondering if we should add it here as well since that is another parameter to tune for this error?
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
For Larger models like: meta-llama/Meta-Llama-3-70B It throws
ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (4688). Try increasing
gpu_memory_utilization
or decreasingmax_model_len
when initializing the engine.It's required to specify gpu_memory_utilization to successfully run the benchmark.