-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Bugfix] Fix TP inference for Flex attention backend #19657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -84,6 +84,8 @@ def __init__(self, | |||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
vllm_config.cache_config.num_gpu_blocks = num_gpu_blocks | ||||||||||||||||||||||||||||||
vllm_config.cache_config.num_cpu_blocks = num_cpu_blocks | ||||||||||||||||||||||||||||||
self.collective_rpc("initialize_cache", | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wondering why only TP + FlexAttention needs this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because FlexAttention needs Not sure if this is intended, but in V1, only engine core's cache_config has updated Therefore, in distributed inference, worker's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Could you check if we need to add some condition to only call this function if tp > 1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For single-process, we use vllm/vllm/v1/executor/abstract.py Lines 44 to 48 in 91b2c17
Given that it also has vllm/vllm/executor/uniproc_executor.py Lines 50 to 58 in 91b2c17
Have checked |
||||||||||||||||||||||||||||||
args=(num_gpu_blocks, num_cpu_blocks)) | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
self.structured_output_manager = StructuredOutputManager(vllm_config) | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -112,6 +112,11 @@ def wake_up(self, tags: Optional[list[str]] = None) -> None: | |||||||||||||||||||||||||||||||||||||||
buffer.data.copy_(self._sleep_saved_buffers[name].data) | ||||||||||||||||||||||||||||||||||||||||
self._sleep_saved_buffers = {} | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
def initialize_cache(self, num_gpu_blocks: int, | ||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this sounds more like "setting_cache_size" instead of initialize_cache? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm, cache_config's vllm/vllm/worker/worker_base.py Lines 65 to 69 in 3d330c4
Lines 312 to 325 in 3d330c4
Although this method not used by v1 before this PR, I think using this method shared by v0 can keep the worker implementation consistent. |
||||||||||||||||||||||||||||||||||||||||
num_cpu_blocks: int) -> None: | ||||||||||||||||||||||||||||||||||||||||
self.cache_config.num_gpu_blocks = num_gpu_blocks | ||||||||||||||||||||||||||||||||||||||||
self.cache_config.num_cpu_blocks = num_cpu_blocks | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
def init_device(self): | ||||||||||||||||||||||||||||||||||||||||
if self.device_config.device.type == "cuda": | ||||||||||||||||||||||||||||||||||||||||
# torch.distributed.all_reduce does not free the input tensor until | ||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
FIXME
comment clearly explains the issue withcreate_block_mask_compiled
when the tensor parallel world size is greater than 1. To ensure this is addressed in the future, consider creating a GitHub issue to track this underlying CUDA error if one doesn't exist already. This would help in eventually enabling the compiled version universally.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full trace back of the illegal memory error:
Log
@drisspg Any idea about this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @zou3519 for torch.compile related issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey so I actually just noticed this too, this was not the cause until pretty recently, going to create an issue + tracking for this