Skip to content

[Misc]: Enable memory usage logging for vLLM GPU worker #17122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

Datta0
Copy link

@Datta0 Datta0 commented Apr 24, 2025

Enable memory usage logging for vLLM GPU worker.

On the v1 engine, I noticed that memory stats are not logged. So I referred to v0's worker.py to follow similar approach for enabling memory logging on v1.

Samples: meta-llama/Llama-3.1-8B-Instruct on 1xL40S GPU.

Before this change:

INFO 04-24 16:32:59 [loader.py:458] Loading weights took 2.95 seconds
INFO 04-24 16:33:00 [gpu_model_runner.py:1316] Model loading took 14.9889 GiB and 3.652170 seconds
INFO 04-24 16:33:06 [backends.py:420] Using cache directory: /home/datta0/.cache/vllm/torch_compile_cache/0dd1850a5d/rank_0_0 for vLLM's torch.compile
INFO 04-24 16:33:06 [backends.py:430] Dynamo bytecode transform time: 6.41 s
INFO 04-24 16:33:11 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 4.673 s
INFO 04-24 16:33:12 [monitor.py:33] torch.compile takes 6.41 s in total
INFO 04-24 16:33:13 [kv_cache_utils.py:634] GPU KV cache size: 158,656 tokens
INFO 04-24 16:33:13 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 4.84x

Behaviour (after this change)

INFO 04-24 16:32:03 [loader.py:458] Loading weights took 2.72 seconds
INFO 04-24 16:32:03 [gpu_model_runner.py:1316] Model loading took 14.9889 GiB and 3.561512 seconds
INFO 05-16 07:27:49 [backends.py:420] Using cache directory: /home/datta0/.cache/vllm/torch_compile_cache/16766e47a3/rank_0_0 for vLLM's torch.compile
INFO 05-16 07:27:49 [backends.py:430] Dynamo bytecode transform time: 7.19 s
INFO 05-16 07:27:55 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 4.822 s
INFO 05-16 07:27:56 [monitor.py:33] torch.compile takes 7.19 s in total
INFO 05-16 07:27:57 [gpu_worker.py:247] Memory profiling takes 15.12 seconds
INFO 05-16 07:27:57 [gpu_worker.py:247] the current vLLM instance can use total_gpu_memory (44.53GiB) x gpu_memory_utilization (0.90) = 40.07GiB
INFO 05-16 07:27:57 [gpu_worker.py:247] model weights take 14.99GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.17GiB; the rest of the memory reserved for KV Cache is 23.28GiB.

Note:

On the non_torch_allocations, there seems to be a difference between result.non_torch_increase and previous v1's non_torch_allocations = total_allocated_bytes - torch_allocated_bytes.
To keep the current behaviour intact as suggested by @NickLucche, I have used the non_torch_allocations.

For that I have tested the following configs to see if the currently reported available_kv_cache_memory matches the v1's previously reported available_kv_cache_memory (before the changes) and it does match.

Configs Tested:

  • meta-llama/Lama-3.1-8B-Instruct (8K and 128K)
  • meta-llama/Llama-3.2-1B-Instruct (8K and full 128K)
  • Qwen/Qwen3-0.6B
  • mistralai/Mistral-7B-Instruct-v0.3
  • google/gemma-3-12b-it

Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 24, 2025
@simon-mo simon-mo requested a review from youkaichao April 24, 2025 16:48
@markmc
Copy link
Member

markmc commented Apr 25, 2025

For reference, the V0 log message was first added by #9352 and then overhauled in #10511. At a glance it looks like this PR is faithfully porting over the latter version. HTH.

@Datta0 Datta0 marked this pull request as ready for review April 28, 2025 01:51
@Datta0
Copy link
Author

Datta0 commented Apr 28, 2025

@markmc yeah I pretty much ported over what was being done in v0 to maintain consistency (with v0 master to be precise).

Comment on lines 196 to 209
torch_allocated_bytes = torch.cuda.memory_stats(
)["allocated_bytes.all.current"]
total_allocated_bytes = torch.cuda.mem_get_info(
)[1] - torch.cuda.mem_get_info()[0]
non_torch_allocations = total_allocated_bytes - torch_allocated_bytes
if non_torch_allocations > 0:
peak_memory += non_torch_allocations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might still be needed for accuracy? cc @WoosukKwon @youkaichao @ywang96 who might be familiar with this part.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added them back for the current calculations.

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, very much needed!

)[1] - torch.cuda.mem_get_info()[0]
non_torch_allocations = total_allocated_bytes - torch_allocated_bytes
if non_torch_allocations > 0:
peak_memory += non_torch_allocations
available_kv_cache_memory = (
total_gpu_memory * self.cache_config.gpu_memory_utilization -
peak_memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think peak_memory it's now missing the non_torch_allocation factor.
If the bit pointed out by simon resolves, we could use result.non_torch_increase instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying to match v0 as much as possible.
If we want to include non torch increase in peak memory, I can do that. Should we then change v0 as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't touch v0 until we hear from long-term maintainers.
The sensitive bit here is available_kv_cache_memory which can have unintended consequences (eg we're not adding non_torch_allocation from peak memory).

Rather than mimicking v0 I would focus on showing that the available_kv_cache_memory value pre and post PR does not change.
Once that is checked, we should be able to land the logging fairly easily.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I modified the code to match V1's previous numbers.
I verified for the following models at 2-3 different --max-model-len values (one below the chunking limit and one above the chunking threshold)

  • meta-llama/Llama-3.1-8B-Instruct
  • meta-llama/Llama-3.2-1B-Instruct
  • Qwen/Qwen3-0.6B
  • mistralai/Mistral-7B-Instruct-v0.3
  • google/gemma-3-12b-it

For all the cases available_kv_cache_memory does match up after the latest commit.

Copy link

mergify bot commented May 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Datta0.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 12, 2025
Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show the value of available_kv_cache_memory does not change here pre and post PR for different models, if you find the time?
I feel once that is asserted we could land the logging bit fairly easily.

Don't worry about consistency with V0 too much.

)[1] - torch.cuda.mem_get_info()[0]
non_torch_allocations = total_allocated_bytes - torch_allocated_bytes
if non_torch_allocations > 0:
peak_memory += non_torch_allocations
available_kv_cache_memory = (
total_gpu_memory * self.cache_config.gpu_memory_utilization -
peak_memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't touch v0 until we hear from long-term maintainers.
The sensitive bit here is available_kv_cache_memory which can have unintended consequences (eg we're not adding non_torch_allocation from peak memory).

Rather than mimicking v0 I would focus on showing that the available_kv_cache_memory value pre and post PR does not change.
Once that is checked, we should be able to land the logging fairly easily.

Datta0 added 2 commits May 16, 2025 07:22
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@mergify mergify bot removed the needs-rebase label May 16, 2025
@Datta0
Copy link
Author

Datta0 commented May 22, 2025

@NickLucche @simon-mo can you please help by reviewing this :)

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! I think with the current changes now we're sure we're maintaining the same available_kv_cache_memory value.

Datta0 and others added 2 commits May 22, 2025 08:46
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Comment on lines 213 to 216
# Note that `result.non_torch_increase` is not the same as
# `non_torch_allocations`. `result.non_torch_increase` doesn't
# include the usage before `baseline_snapshot` or that of
# torch initialisation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a fix #18296 , can you take a look?

Copy link
Author

@Datta0 Datta0 May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the implementation and ran it in my machine. I found that there is 0.5 GiB of difference in available_kv_cache_memory between the base of the PR and the changes made in the PR.
In my previous testings, this is due to the difference between result.non_torch_increase and the above mentioned non_torch_allocations

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @youkaichao any thoughts on this?

@ProExpertProg
Copy link
Collaborator

@Datta0 #18974 landed and #19312 is adding more cleanup - should we close this PR and add any additional improvements into #19312?

Signed-off-by: datta0 <datta.nimmaturi@nutanix.com>
@Datta0
Copy link
Author

Datta0 commented Jun 9, 2025

Hey @ProExpertProg thanks for mentioning. I think with your changes and the ones you mentioned, this PR is not needed.
Will close this shortly

@Datta0 Datta0 closed this Jun 9, 2025
@simon-mo
Copy link
Collaborator

Sorry about that @Datta0 !

@NickLucche
Copy link
Contributor

You could've at least have added @Datta0 as co-author, this PR has been waiting for weeks

@Datta0
Copy link
Author

Datta0 commented Jun 10, 2025

Guys, its alright :)
I'm happy as long as the functionality is met
Also thanks for all the reviews...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants