-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE #17211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
644e254
262356d
92796e1
b6a5c3d
7bea2d1
a49c3e4
784d1b6
85b3ade
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -347,8 +347,12 @@ def configure_post_pass(self): | |
PASS_KEY = "post_grad_custom_post_pass" | ||
if PASS_KEY in inductor_config: | ||
# Config should automatically wrap all inductor passes | ||
assert isinstance(inductor_config[PASS_KEY], InductorPass) | ||
self.post_grad_pass_manager.add(inductor_config[PASS_KEY]) | ||
if isinstance(inductor_config[PASS_KEY], PostGradPassManager): | ||
assert (inductor_config[PASS_KEY].uuid() == | ||
self.post_grad_pass_manager.uuid()) | ||
else: | ||
assert isinstance(inductor_config[PASS_KEY], InductorPass) | ||
self.post_grad_pass_manager.add(inductor_config[PASS_KEY]) | ||
inductor_config[PASS_KEY] = self.post_grad_pass_manager | ||
|
||
def __call__(self, graph: fx.GraphModule, example_inputs) -> Callable: | ||
|
@@ -408,8 +412,13 @@ def __call__(self, graph: fx.GraphModule, example_inputs) -> Callable: | |
) | ||
self.compilation_config.cache_dir = cache_dir | ||
|
||
cache_dir = self.compilation_config.cache_dir | ||
if compilation_counter.num_graphs_seen > 0: | ||
cache_dir = self.compilation_config.cache_dir + \ | ||
f'-{compilation_counter.num_graphs_seen}' | ||
else: | ||
cache_dir = self.compilation_config.cache_dir | ||
os.makedirs(cache_dir, exist_ok=True) | ||
self.compilation_config.cache_dir = cache_dir | ||
Comment on lines
+415
to
+421
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @luyuzhe111 This is suspicious. Why do you need a different cache directory for each graph? Also, this looks like it modifies everything, even the models that don't use eagle. If there isn't a good reason I would prefer going back to the "single cache directory" that we had previously. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zou3519 thanks for reviewing! if there isn't a separate cache directory, the compiled code for the draft model (EAGLE) will not be saved at all. for models without EAGLE, my understanding is that the backend is invoked only once so this should not impact other models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @luyuzhe111 thanks for the response and clarifying that. Woosuk also filled me in on some more details offline. I understand why we need a separate cache directory. Which of the "original model" and the "eagle head" get compiled first? (I'm trying to figure out if the first cache dir is for the original model or for the eagle head) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zou3519 original model should be compiled first! also if you wanna double check, the transformed code of EAGLE in the cache directory has a slightly different signature with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the discussion! I added some comments and an assertion into #17662 , please take a look. I think in the future we'll want a better way to handle multiple compiled regions in a vLLM model, but that will take some re-designing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @luyuzhe111 the asserts in #17662 triggered, which means that this PR does affect non-eagle models There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zou3519 Thanks for the catch! I guess a simple fix would be just to create a separate cache directory only for EAGLE, via looking at the vllm speculative config, for example? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah that would work |
||
rank = vllm_config.parallel_config.rank | ||
dp_rank = vllm_config.parallel_config.data_parallel_rank | ||
local_cache_dir = os.path.join(cache_dir, f"rank_{rank}_{dp_rank}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to get the component's
prefix
to use as the cache directory, it could be more meaningful.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make that change, @luyuzhe111 and I were discussing something similar above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#19027