[Misc]: Enable memory usage logging for vLLM GPU worker #17122

Datta0 · 2025-04-24T16:34:21Z

Enable memory usage logging for vLLM GPU worker.

On the v1 engine, I noticed that memory stats are not logged. So I referred to v0's worker.py to follow similar approach for enabling memory logging on v1.

Samples: meta-llama/Llama-3.1-8B-Instruct on 1xL40S GPU.

Before this change:

INFO 04-24 16:32:59 [loader.py:458] Loading weights took 2.95 seconds
INFO 04-24 16:33:00 [gpu_model_runner.py:1316] Model loading took 14.9889 GiB and 3.652170 seconds
INFO 04-24 16:33:06 [backends.py:420] Using cache directory: /home/datta0/.cache/vllm/torch_compile_cache/0dd1850a5d/rank_0_0 for vLLM's torch.compile
INFO 04-24 16:33:06 [backends.py:430] Dynamo bytecode transform time: 6.41 s
INFO 04-24 16:33:11 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 4.673 s
INFO 04-24 16:33:12 [monitor.py:33] torch.compile takes 6.41 s in total
INFO 04-24 16:33:13 [kv_cache_utils.py:634] GPU KV cache size: 158,656 tokens
INFO 04-24 16:33:13 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 4.84x

Behaviour (after this change)

INFO 04-24 16:32:03 [loader.py:458] Loading weights took 2.72 seconds
INFO 04-24 16:32:03 [gpu_model_runner.py:1316] Model loading took 14.9889 GiB and 3.561512 seconds
INFO 05-16 07:27:49 [backends.py:420] Using cache directory: /home/datta0/.cache/vllm/torch_compile_cache/16766e47a3/rank_0_0 for vLLM's torch.compile
INFO 05-16 07:27:49 [backends.py:430] Dynamo bytecode transform time: 7.19 s
INFO 05-16 07:27:55 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 4.822 s
INFO 05-16 07:27:56 [monitor.py:33] torch.compile takes 7.19 s in total
INFO 05-16 07:27:57 [gpu_worker.py:247] Memory profiling takes 15.12 seconds
INFO 05-16 07:27:57 [gpu_worker.py:247] the current vLLM instance can use total_gpu_memory (44.53GiB) x gpu_memory_utilization (0.90) = 40.07GiB
INFO 05-16 07:27:57 [gpu_worker.py:247] model weights take 14.99GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.17GiB; the rest of the memory reserved for KV Cache is 23.28GiB.

Note:

On the non_torch_allocations, there seems to be a difference between result.non_torch_increase and previous v1's non_torch_allocations = total_allocated_bytes - torch_allocated_bytes.
To keep the current behaviour intact as suggested by @NickLucche, I have used the non_torch_allocations.

For that I have tested the following configs to see if the currently reported available_kv_cache_memory matches the v1's previously reported available_kv_cache_memory (before the changes) and it does match.

Configs Tested:

meta-llama/Lama-3.1-8B-Instruct (8K and 128K)
meta-llama/Llama-3.2-1B-Instruct (8K and full 128K)
Qwen/Qwen3-0.6B
mistralai/Mistral-7B-Instruct-v0.3
google/gemma-3-12b-it

Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>

github-actions · 2025-04-24T16:34:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

markmc · 2025-04-25T10:58:33Z

For reference, the V0 log message was first added by #9352 and then overhauled in #10511. At a glance it looks like this PR is faithfully porting over the latter version. HTH.

Datta0 · 2025-04-28T01:51:41Z

@markmc yeah I pretty much ported over what was being done in v0 to maintain consistency (with v0 master to be precise).

simon-mo · 2025-04-29T02:35:49Z

vllm/v1/worker/gpu_worker.py

-        torch_allocated_bytes = torch.cuda.memory_stats(
-        )["allocated_bytes.all.current"]
-        total_allocated_bytes = torch.cuda.mem_get_info(
-        )[1] - torch.cuda.mem_get_info()[0]
-        non_torch_allocations = total_allocated_bytes - torch_allocated_bytes
-        if non_torch_allocations > 0:
-            peak_memory += non_torch_allocations


These might still be needed for accuracy? cc @WoosukKwon @youkaichao @ywang96 who might be familiar with this part.

I have added them back for the current calculations.

NickLucche

Thanks for adding this, very much needed!

NickLucche · 2025-05-12T19:20:40Z

vllm/v1/worker/gpu_worker.py

-        )[1] - torch.cuda.mem_get_info()[0]
-        non_torch_allocations = total_allocated_bytes - torch_allocated_bytes
-        if non_torch_allocations > 0:
-            peak_memory += non_torch_allocations
        available_kv_cache_memory = (
            total_gpu_memory * self.cache_config.gpu_memory_utilization -
            peak_memory)


I think peak_memory it's now missing the non_torch_allocation factor.
If the bit pointed out by simon resolves, we could use result.non_torch_increase instead.

I was just trying to match v0 as much as possible.
If we want to include non torch increase in peak memory, I can do that. Should we then change v0 as well?

I wouldn't touch v0 until we hear from long-term maintainers.
The sensitive bit here is available_kv_cache_memory which can have unintended consequences (eg we're not adding non_torch_allocation from peak memory).

Rather than mimicking v0 I would focus on showing that the available_kv_cache_memory value pre and post PR does not change.
Once that is checked, we should be able to land the logging fairly easily.

Thanks for pointing that out. I modified the code to match V1's previous numbers.
I verified for the following models at 2-3 different --max-model-len values (one below the chunking limit and one above the chunking threshold)

meta-llama/Llama-3.1-8B-Instruct

meta-llama/Llama-3.2-1B-Instruct

Qwen/Qwen3-0.6B

mistralai/Mistral-7B-Instruct-v0.3

google/gemma-3-12b-it

For all the cases available_kv_cache_memory does match up after the latest commit.

mergify · 2025-05-12T19:21:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Datta0.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche

Could you show the value of available_kv_cache_memory does not change here pre and post PR for different models, if you find the time?
I feel once that is asserted we could land the logging bit fairly easily.

Don't worry about consistency with V0 too much.

NickLucche · 2025-05-14T07:47:18Z

vllm/v1/worker/gpu_worker.py

-        )[1] - torch.cuda.mem_get_info()[0]
-        non_torch_allocations = total_allocated_bytes - torch_allocated_bytes
-        if non_torch_allocations > 0:
-            peak_memory += non_torch_allocations
        available_kv_cache_memory = (
            total_gpu_memory * self.cache_config.gpu_memory_utilization -
            peak_memory)


I wouldn't touch v0 until we hear from long-term maintainers.
The sensitive bit here is available_kv_cache_memory which can have unintended consequences (eg we're not adding non_torch_allocation from peak memory).

Rather than mimicking v0 I would focus on showing that the available_kv_cache_memory value pre and post PR does not change.
Once that is checked, we should be able to land the logging fairly easily.

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

Datta0 · 2025-05-22T06:04:26Z

@NickLucche @simon-mo can you please help by reviewing this :)

NickLucche

Nice one! I think with the current changes now we're sure we're maintaining the same available_kv_cache_memory value.

vllm/v1/worker/gpu_worker.py

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

youkaichao · 2025-05-23T16:15:31Z

vllm/v1/worker/gpu_worker.py

+            # Note that `result.non_torch_increase` is not the same as
+            # `non_torch_allocations`. `result.non_torch_increase` doesn't
+            # include the usage before `baseline_snapshot` or that of
+            # torch initialisation


there's a fix #18296 , can you take a look?

I just checked the implementation and ran it in my machine. I found that there is 0.5 GiB of difference in available_kv_cache_memory between the base of the PR and the changes made in the PR.
In my previous testings, this is due to the difference between result.non_torch_increase and the above mentioned non_torch_allocations

Hey @youkaichao any thoughts on this?

…mory_logging Signed-off-by: datta0 <datta.nimmaturi@nutanix.com>

ProExpertProg · 2025-06-09T17:41:33Z

@Datta0 #18974 landed and #19312 is adding more cleanup - should we close this PR and add any additional improvements into #19312?

Signed-off-by: datta0 <datta.nimmaturi@nutanix.com>

Datta0 · 2025-06-09T17:49:51Z

Hey @ProExpertProg thanks for mentioning. I think with your changes and the ones you mentioned, this PR is not needed.
Will close this shortly

simon-mo · 2025-06-10T02:47:03Z

Sorry about that @Datta0 !

NickLucche · 2025-06-10T07:46:03Z

You could've at least have added @Datta0 as co-author, this PR has been waiting for weeks

Datta0 · 2025-06-10T10:35:24Z

Guys, its alright :)
I'm happy as long as the functionality is met
Also thanks for all the reviews...

Enable memory usage logging for vLLM GPU worker

d47e85d

Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>

Datta0 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 24, 2025 16:34

Datta0 marked this pull request as draft April 24, 2025 16:34

mergify bot added the v1 label Apr 24, 2025

simon-mo requested a review from youkaichao April 24, 2025 16:48

Datta0 marked this pull request as ready for review April 28, 2025 01:51

simon-mo reviewed Apr 29, 2025

View reviewed changes

NickLucche suggested changes May 12, 2025

View reviewed changes

mergify bot added the needs-rebase label May 12, 2025

NickLucche suggested changes May 14, 2025

View reviewed changes

Datta0 added 2 commits May 16, 2025 07:22

Add non torch allocation factor to peak memory usage

bc7c068

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

Merge remote-tracking branch 'origin/main' into v1_memory_logging

1f06850

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

mergify bot removed the needs-rebase label May 16, 2025

NickLucche approved these changes May 22, 2025

View reviewed changes

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

Datta0 and others added 2 commits May 22, 2025 08:46

Cleanup cruft and improve comment message

a43b502

Cleanup cruft and improve comment message

c580f83

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

youkaichao reviewed May 23, 2025

View reviewed changes

NickLucche mentioned this pull request Jun 3, 2025

[BugFix][V1] Fix memory profiling bug #18974

Merged

Datta0 added 2 commits June 9, 2025 17:29

Merge remote-tracking branch 'origin/main' into v1_memory_logging

58b6c66

Merge branch 'v1_memory_logging' of github.com:Datta0/vllm into v1_me…

63d5736

…mory_logging Signed-off-by: datta0 <datta.nimmaturi@nutanix.com>

ProExpertProg mentioned this pull request Jun 9, 2025

[V1] Reuse V0's memory_profiling util for gpu worker memory profiling #19312

Merged

3 tasks

cleanup log mesage

4716f7d

Signed-off-by: datta0 <datta.nimmaturi@nutanix.com>

Datta0 closed this Jun 9, 2025

Uh oh!

[Misc]: Enable memory usage logging for vLLM GPU worker #17122

[Misc]: Enable memory usage logging for vLLM GPU worker #17122

Uh oh!

Conversation

Datta0 commented Apr 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note:

Uh oh!

github-actions bot commented Apr 24, 2025

Uh oh!

markmc commented Apr 25, 2025

Uh oh!

Datta0 commented Apr 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented May 12, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Datta0 commented May 22, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Datta0 May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Jun 9, 2025

Uh oh!

Datta0 commented Jun 9, 2025

Uh oh!

simon-mo commented Jun 10, 2025

Uh oh!

NickLucche commented Jun 10, 2025

Uh oh!

Datta0 commented Jun 10, 2025

Uh oh!

Uh oh!

Datta0 commented Apr 24, 2025 •

edited by github-actions bot

Loading

Datta0 May 23, 2025 •

edited

Loading