-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
[V1][Core] Fix memory issue with logits & sampling #13721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Maybe this PR helps fix the lora test failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@jeejeelee I turned on LoRA CI for this PR to see if it passes |
These modifications caused the V1 lora to fail |
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
This PR is a followup to #13594 (comment) that describes the memory issue during online serving even after sampler profiling is added to
profile_run
. After some investigation, the root cause is memory fragmentation issue of logits and other related sampling tensors since we don't preallocate buffers for these beforehand.This memory issue can be reproduced by modifying the temperature in the file below to non zero to trigger the logits sampling code path.
vllm/benchmarks/backend_request_func.py
Line 249 in eb24dc4
Server command:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B --disable-log-requests --no-enable-prefix-caching
Client command:
The graphs below tracks the memory usage of the server process on 1xH100. Timestamp starts when the server is ready to receive the traffic, and around T=25 is when it starts receiving traffic.
On main, since

logits.shape[0]
goes up incrementally because of the nature of online traffic pattern, the memory usage starts growing as a result of the memory fragmentation issue of the intermediate tensors fromlogits
inSampler
. This issue will not crash the server since PyTorch itself will garbage collect these cached buffers to prevent OOM (as observed from the dips in the graph), but this should be fixed and handled by vLLM for a few obvious reasons (e.g, memory release requires synchronization).The root cause of this issue is that the sampler was not included in
compile_or_warm_up_model
butcapture_model
implicitly callstorch.cuda.empty_cache()
, therefore even if the memory usage of sampler was captured inprofile_run
, the memory buffers were cleared from this method.This PR addresses this issue by adding

dummy_sampler_run
and calls it after the model forward itself is warmed up and captured. We do not want to put them both in_dummy_run
since this method is needed elsewhere for other purposes.The memory usage is rather stable from this PR, and one can observe the initial server memory usage increases from ~68K MiB to ~73K MiB (this accounts for all sampling related buffers that were taken into account during profiling but cleared from warmup), but stayed stable during the actual inference. We indeed observe a very small bump when it started receiving traffic, but IMO it is small enough for us to leave it for later investigation.
In addition, with the default GMU=0.9 and thus one may expect initial server launch takes 81559 * 0.9 = 73403 + w/e cuda graphs require (although this is technically speaking not how it works), the memory usage (~73714) with the fix from this PR should be acceptable. Without needing PyTorch to do gc, we also observe a tiny perf improvement.
An alternative fix is to have persistent buffer of
logits
but this may encounter some practical issues, and we will leave it for future investigation too.