Skip to content

Use runtime profiling to replace manual memory analyzers #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 19, 2023

Conversation

zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented May 7, 2023

Fix #59.

Previously we used a manual memory profiler, which must be implemented separately for each model. This PR replaces it with a general memory profiler.

@zhuohan123 zhuohan123 requested a review from WoosukKwon May 9, 2023 04:11
@zhuohan123
Copy link
Member Author

@WoosukKwon This PR is ready for review.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuohan123 for this PR. Please check my comments.

@WoosukKwon
Copy link
Collaborator

Additionally, I've found that the PR changes the output of the examples in simple_server.py. We should reset the random generator states after the profiling run.

@WoosukKwon
Copy link
Collaborator

@zhuohan123 I will start to merge the other PRs first, so that this PR does not block any.

@zhuohan123
Copy link
Member Author

zhuohan123 commented May 13, 2023

@WoosukKwon I fixed all the review comments and merged with the latest master. Please review again. Feel free to merge it.

One thing I noticed is this bug in the sampler. With this bug, I discover that these two sorts (link1, link2) take a lot of memory (4-5GB) when sampling for 2560 tokens at the same time. In this case, they are sorting a (2560, 51200) shape tensor. I'm not sure why these two sorts take that much memory and whether there are more memory efficient way to implement topp and topk.

@zhuohan123
Copy link
Member Author

Additionally, I've found that the PR changes the output of the examples in simple_server.py. We should reset the random generator states after the profiling run.

I haven't added it in the current PR since it looks pretty redundant to add an additional "set_random_seed" call to all the workers after profiling. It will be a nested ray call and complicates the code. I think this will confuse future people when they read the code.

@WoosukKwon
Copy link
Collaborator

I haven't added it in the current PR since it looks pretty redundant to add an additional "set_random_seed" call to all the workers after profiling. It will be a nested ray call and complicates the code. I think this will confuse future people when they read the code.

What do you mean by "nested ray call"? I didn't actually get it. And I think the purpose of resetting the random state is to make sure everything that happens before starting serving should not affect the outputs. In this sense, I think we should reset the random states after the profiling run, while we don't have to reset the states before profiling run.

@WoosukKwon
Copy link
Collaborator

One thing I noticed is this bug in the sampler. With this bug, I discover that these two sorts (link1, link2) take a lot of memory (4-5GB) when sampling for 2560 tokens at the same time. In this case, they are sorting a (2560, 51200) shape tensor. I'm not sure why these two sorts take that much memory and whether there are more memory efficient way to implement topp and topk.

  1. Thanks for finding the bug. As you did, -1 should be replaced with vocab_size.
  2. On the memory consumption, it seems it's a known issue: Peak GPU-memory usage extremely huge when sorting with torch.sort pytorch/pytorch#77049. For now, I think we can take into account max_num_sequences (which is 256 by default) in calculating the peak memory usage. This will reduce the memory consumption of sorting by 10x.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@zhuohan123 zhuohan123 merged commit f756799 into main May 19, 2023
@zhuohan123 zhuohan123 deleted the dynamic-memory-profiler branch May 24, 2023 04:40
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024
JHLEE17 pushed a commit to JHLEE17/vllm that referenced this pull request Aug 1, 2024
dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Mar 26, 2025
…io-vllm-cpu-v2-19

Red Hat Konflux update vllm-cpu-v2-19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Profile memory usage
2 participants