-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Misc]: Enable memory usage logging for vLLM GPU worker #17122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@markmc yeah I pretty much ported over what was being done in v0 to maintain consistency (with v0 master to be precise). |
vllm/v1/worker/gpu_worker.py
Outdated
torch_allocated_bytes = torch.cuda.memory_stats( | ||
)["allocated_bytes.all.current"] | ||
total_allocated_bytes = torch.cuda.mem_get_info( | ||
)[1] - torch.cuda.mem_get_info()[0] | ||
non_torch_allocations = total_allocated_bytes - torch_allocated_bytes | ||
if non_torch_allocations > 0: | ||
peak_memory += non_torch_allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These might still be needed for accuracy? cc @WoosukKwon @youkaichao @ywang96 who might be familiar with this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added them back for the current calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this, very much needed!
vllm/v1/worker/gpu_worker.py
Outdated
)[1] - torch.cuda.mem_get_info()[0] | ||
non_torch_allocations = total_allocated_bytes - torch_allocated_bytes | ||
if non_torch_allocations > 0: | ||
peak_memory += non_torch_allocations | ||
available_kv_cache_memory = ( | ||
total_gpu_memory * self.cache_config.gpu_memory_utilization - | ||
peak_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think peak_memory it's now missing the non_torch_allocation factor.
If the bit pointed out by simon resolves, we could use result.non_torch_increase
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying to match v0 as much as possible.
If we want to include non torch increase in peak memory, I can do that. Should we then change v0 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't touch v0 until we hear from long-term maintainers.
The sensitive bit here is available_kv_cache_memory
which can have unintended consequences (eg we're not adding non_torch_allocation
from peak memory).
Rather than mimicking v0 I would focus on showing that the available_kv_cache_memory
value pre and post PR does not change.
Once that is checked, we should be able to land the logging fairly easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. I modified the code to match V1's previous numbers.
I verified for the following models at 2-3 different --max-model-len
values (one below the chunking limit and one above the chunking threshold)
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.2-1B-Instruct
- Qwen/Qwen3-0.6B
- mistralai/Mistral-7B-Instruct-v0.3
- google/gemma-3-12b-it
For all the cases available_kv_cache_memory
does match up after the latest commit.
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you show the value of available_kv_cache_memory
does not change here pre and post PR for different models, if you find the time?
I feel once that is asserted we could land the logging bit fairly easily.
Don't worry about consistency with V0 too much.
vllm/v1/worker/gpu_worker.py
Outdated
)[1] - torch.cuda.mem_get_info()[0] | ||
non_torch_allocations = total_allocated_bytes - torch_allocated_bytes | ||
if non_torch_allocations > 0: | ||
peak_memory += non_torch_allocations | ||
available_kv_cache_memory = ( | ||
total_gpu_memory * self.cache_config.gpu_memory_utilization - | ||
peak_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't touch v0 until we hear from long-term maintainers.
The sensitive bit here is available_kv_cache_memory
which can have unintended consequences (eg we're not adding non_torch_allocation
from peak memory).
Rather than mimicking v0 I would focus on showing that the available_kv_cache_memory
value pre and post PR does not change.
Once that is checked, we should be able to land the logging fairly easily.
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@NickLucche @simon-mo can you please help by reviewing this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one! I think with the current changes now we're sure we're maintaining the same available_kv_cache_memory
value.
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
vllm/v1/worker/gpu_worker.py
Outdated
# Note that `result.non_torch_increase` is not the same as | ||
# `non_torch_allocations`. `result.non_torch_increase` doesn't | ||
# include the usage before `baseline_snapshot` or that of | ||
# torch initialisation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a fix #18296 , can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked the implementation and ran it in my machine. I found that there is 0.5 GiB of difference in available_kv_cache_memory
between the base of the PR and the changes made in the PR.
In my previous testings, this is due to the difference between result.non_torch_increase
and the above mentioned non_torch_allocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @youkaichao any thoughts on this?
…mory_logging Signed-off-by: datta0 <datta.nimmaturi@nutanix.com>
Signed-off-by: datta0 <datta.nimmaturi@nutanix.com>
Hey @ProExpertProg thanks for mentioning. I think with your changes and the ones you mentioned, this PR is not needed. |
Sorry about that @Datta0 ! |
You could've at least have added @Datta0 as co-author, this PR has been waiting for weeks |
Guys, its alright :) |
Enable memory usage logging for vLLM GPU worker.
On the v1 engine, I noticed that memory stats are not logged. So I referred to v0's worker.py to follow similar approach for enabling memory logging on v1.
Samples:
meta-llama/Llama-3.1-8B-Instruct
on 1xL40S GPU.Before this change:
Behaviour (after this change)
Note:
On the
non_torch_allocations
, there seems to be a difference betweenresult.non_torch_increase
and previous v1'snon_torch_allocations = total_allocated_bytes - torch_allocated_bytes
.To keep the current behaviour intact as suggested by @NickLucche, I have used the
non_torch_allocations
.For that I have tested the following configs to see if the currently reported
available_kv_cache_memory
matches the v1's previously reportedavailable_kv_cache_memory
(before the changes) and it does match.Configs Tested: