-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE #17211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Thanks for the PR. This is so cool! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luyuzhe111 Thanks for the PR!
Left some comments. Please take a look!
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luyuzhe111 Thanks for addressing the comments! One last thing: Can you please add tests for this?
Hi @ekagra-ranjan, thanks for this PR! but wondering if you could create a new PR later that makes sure the draft model takes in |
@WoosukKwon I feel like acceptance length tests are probably the most meaningful tests for this PR. I tested the acceptance length by cherry picking commits from this PR. You can see that the difference is less than 0.01. With On MT Bench When max number generated tokens = 256
I can do a follow-up PR adding acceptance length tests after #17010 is merged. |
Thank you @luyuzhe111 for the PR!
This is fantastic! |
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
0f3d045
to
85b3ade
Compare
Hey @ekagra-ranjan Re:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @youkaichao Any final comments?
@luyuzhe111 - here is an example setup for benchmark. Looking fwd to the results! |
@ekagra-ranjan I got the following results: Target model: it looks like a further 10% speedup. |
These checks lead to performance degradation so vLLM decided to drop all of them. Some of the lower compilation_levels (e.g. 1) should do the checking but those also have slightly different behavior for other things |
if compilation_counter.num_graphs_seen > 0: | ||
cache_dir = self.compilation_config.cache_dir + \ | ||
f'-{compilation_counter.num_graphs_seen}' | ||
else: | ||
cache_dir = self.compilation_config.cache_dir | ||
os.makedirs(cache_dir, exist_ok=True) | ||
self.compilation_config.cache_dir = cache_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luyuzhe111 This is suspicious. Why do you need a different cache directory for each graph? Also, this looks like it modifies everything, even the models that don't use eagle.
If there isn't a good reason I would prefer going back to the "single cache directory" that we had previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 thanks for reviewing! if there isn't a separate cache directory, the compiled code for the draft model (EAGLE) will not be saved at all. for models without EAGLE, my understanding is that the backend is invoked only once so this should not impact other models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luyuzhe111 thanks for the response and clarifying that. Woosuk also filled me in on some more details offline. I understand why we need a separate cache directory.
Which of the "original model" and the "eagle head" get compiled first? (I'm trying to figure out if the first cache dir is for the original model or for the eagle head)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 original model should be compiled first! also if you wanna double check, the transformed code of EAGLE in the cache directory has a slightly different signature with hidden_states
as an additional arg. if there is a more elegant solution, that would be great! I think my approach is a bit hacky indeed : )))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the discussion! I added some comments and an assertion into #17662 , please take a look.
I think in the future we'll want a better way to handle multiple compiled regions in a vLLM model, but that will take some re-designing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luyuzhe111 the asserts in #17662 triggered, which means that this PR does affect non-eagle models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 Thanks for the catch! I guess a simple fix would be just to create a separate cache directory only for EAGLE, via looking at the vllm speculative config, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would work
…ect#17211) Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
I'm recording down my understanding of how eagle and the compilation cache works after discussing vllm-project#17211 with @luyuzhe111 and @WoosukKwon. In the future we likely will have a situation where we want to torch.compile multiple pieces of code (e.g. decoder and encoder separately) and then we'll need to refactor the system to support it (each compiled region needs its own cache directory with its own hash) But until then the current design seems fine. Signed-off-by: rzou <zou3519@gmail.com>
…ect#17211) Signed-off-by: Bryan Lu <yuzhelu@amazon.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
…ect#17211) Signed-off-by: Bryan Lu <yuzhelu@amazon.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
cache_dir = self.compilation_config.cache_dir | ||
if compilation_counter.num_graphs_seen > 0: | ||
cache_dir = self.compilation_config.cache_dir + \ | ||
f'-{compilation_counter.num_graphs_seen}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to get the component's prefix
to use as the cache directory, it could be more meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make that change, @luyuzhe111 and I were discussing something similar above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ect#17211) Signed-off-by: Bryan Lu <yuzhelu@amazon.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Task 8 of #15901
A few notes regarding the implementation.
Torch.compile
vllm/model_executor/models/llama_eagle.py
address this requirement.vllm/compilation/backends.py
, we wouldn't be able to cache EAGLE's compilation properly since the EAGLE module was registered under vllm config from the target model (see here)tensor.argmax()
returns int64 by default. Feeding int64 input ids to the compiled model will completely mess things up and lead to gibberish draft tokens. Currently the compiled model does not even give warnings when the input data type mismatches. Wonder if we can prevent similar bugs in the future by some more checks.CudaGraph
Changes in
vllm/v1/spec_decode/eagle.py
andvllm/v1/worker/gpu_model_runner.py
are mostly for CudaGraph. Nothing fancy other than registering additional persistent buffers and making sure to use them for EAGLE's forward pass. I do want to mention that with torch.compile & CudaGraph, the EAGLE model's forward pass has been drastically improved (2.5x faster), which makes the small but abundant torch operations look inefficient. Any advice to further optimize these overheads is greatly appreciated.Finally, it would be great to have #17010 reviewed and merged so that we don't have to pull in other PRs to test the acceptance length.
Note that the current PR does not directly make torch.compile & cuda graph available for EAGLE3. I think it's worth a separate PR since the work is non-trivial due to the fact that EAGLE-3's input hidden states have dynamic shapes. Maybe @benchislett could help.
@WoosukKwon @LiuXiaoxuanPKU