-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use NCCL instead of ray for control-plane communication to remove serialization overhead #2221
Conversation
|
||
def broadcast(input_, src=0): | ||
"""Broadcast the input tensor.""" | ||
world_size = torch.distributed.get_world_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a function. You need to call it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅 great catch. Will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucky catch. Browsed through the code in 10s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, When can this branch be merged into the main branch, and will it bring significant performance improvements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Lvjinhong Please find the performance numbers in the description of this PR. This PR is waiting for review and will be merged soon.
Hi @zhuohan123, this PR doesn't work for me.
|
@esmeetu Sorry this is a bug! Can you test the latest commit again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 Awesome! Thanks for addressing my comments. Let's merge this asap after addressing the remaining minor comments from @njhill!
@zhuohan123 Can you give me a few hours to look over this? Thanks! |
vllm/engine/ray_utils.py
Outdated
placement_group_specs = ([{ | ||
"GPU": 1, | ||
"node:__internal_head__": 0.01 | ||
}] + [{ | ||
"GPU": 1 | ||
}] * parallel_config.world_size) | ||
}] * (parallel_config.world_size - 1)) | ||
current_placement_group = ray.util.placement_group( | ||
placement_group_specs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 In this PR, we assume the placement group will have one bundle that includes "node:internal_head". This is to reserve a GPU for the driver worker. This may conflict with some of the existing logic using placement groups.
vllm/engine/llm_engine.py
Outdated
if (bundle.get("node:__internal_head__", 0) > 0 | ||
and self.driver_dummy_worker is None): | ||
self.driver_dummy_worker = ray.remote( | ||
num_cpus=0, | ||
num_gpus=num_gpus, | ||
scheduling_strategy=scheduling_strategy, | ||
**ray_remote_kwargs, | ||
)(RayWorkerVllm).remote() | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 This is where we use the bundle with the node:__internal_head__
resource to hold the resource for the driver worker.
Sure! Just highlighted several places where we changed the logic of how we use placement groups, which I think should be important for you to take a look. |
worker = ray.remote( | ||
num_cpus=0, | ||
num_gpus=num_gpus, | ||
scheduling_strategy=scheduling_strategy, | ||
**ray_remote_kwargs, | ||
)(RayWorkerVllm).remote(self.model_config.trust_remote_code) | ||
self.workers.append(worker) | ||
|
||
worker_ip = ray.get(worker.get_node_ip.remote()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a minor optimization you can make it another loop so that the Workers can be initialized in a non-blocking fashion but considering that there's nothing really happening in the __init__
I think it's ok to leave it in (though it is an anti-pattern).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah also this only happens once so I think this should not relate to the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, looks good!
…ialization overhead (vllm-project#2221)
…ialization overhead (vllm-project#2221)
This PR modifies vLLM to use NCCL instead of ray for control-plane communication. The architectural change of vLLM with this PR can be summarized in the following figure:
Before this change, vLLM has one driver process (only on GPU), and N worker processes, each of which is a ray actor that manages one GPU. After this change, we will move one worker into the driver process (driver worker), and keep N-1 ray workers. All the control messages will be broadcast from the driver worker to all the remaining workers with NCCL. This avoids the high sterilization cost of ray communication.
For the throughput benchmark of LLaMA-70B on 8 A100-40G GPUs:
This PR has passed most of the tests and is ready for review.
Should be merged after #2270, #2273.