[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation #3762

youkaichao · 2024-03-31T20:52:45Z

An implementation draft of #3587 .

Check the test code for example:

import pytest
import torch

from vllm.implementations.launcher.mp_launcher import MPLauncher
from vllm.implementations.coordinator import CoordinatorType
from vllm.implementations.communicator import CommunicatorType
from vllm.implementations.distributed_tasks.global_coordinator_task import GlobalCoordinatorDistributedTask
from vllm.implementations.distributed_tasks.group_coordinator_task import GroupCoordinatorDistributedTask

class AllReduceDistributedTask(GlobalCoordinatorDistributedTask):
    def post_init_distributed(self, **kwargs):
        tensor = torch.ones(16, 1024, 1024, dtype=torch.float32).cuda(self.coordinator.get_local_rank())
        self.communicator.all_reduce(tensor_in=tensor)
        result = tensor.mean().cpu().item()
        assert result == self.coordinator.get_local_world_size()    

@pytest.mark.skipif(torch.cuda.device_count() < 2,
                    reason="Need at least 2 GPUs to run the test.")
def test_pynccl():
    MPLauncher(n_tasks=2).launch(
        task_type=AllReduceDistributedTask,
        coordinator_type=CoordinatorType.TORCH_DISTRIBUTED,
        communicator_type=CommunicatorType.PYNCCL,
    )

And the code for defining a task type:

from vllm.implementations.coordinator import CoordinatorType, get_coordinator_class
from vllm.implementations.communicator import CommunicatorType, get_communicator_class
from vllm.interfaces.launcher import DistributedTask
from vllm.interfaces.communicator import Communicator
from vllm.interfaces.coordinator import Coordinator


class GlobalCoordinatorDistributedTask(DistributedTask):
    def run(self, *, coordinator_type: CoordinatorType, communicator_type: CommunicatorType, **kwargs):
        coordinator_cls = get_coordinator_class(coordinator_type)
        communicator_cls = get_communicator_class(communicator_type)
        self.coordinator : Coordinator = coordinator_cls()
        self.coordinator.initialize()
        self.communicator : Communicator = communicator_cls(self.coordinator)
        self.post_init_distributed(**kwargs)

    def post_init_distributed(self, **kwargs):
        """Subclasses can override this method to do whatever they want.
        They can use `self.coordinator` for global communication over the whole process group.
        They can use `self.communicator` for communication between devices.
        """
        return

We can see the full composibility. MPLauncher only accepts launcher-specific args, and task_type. It passes the rest arg to task, which initialize coordinator and communicator.

It is worth to note that GlobalCoordinatorDistributedTask knows nothing about specific coordinator or communicator. It just operates on interfaces provided by Communicator and Coordinator .

This is a draft implementation, and is open for discussion.

youkaichao · 2024-03-31T21:15:45Z

Here is a description of the general design:

youkaichao · 2024-03-31T21:34:51Z

Just for visibility: I also notice that there are several concurrent efforts to refactor the multi-gpu execution, e.g. #3466 and #3691 . I will take a look at all these, and try to find out if these efforts can be unified to reduce redundant work.

njhill · 2024-03-31T21:43:53Z

@youkaichao #3466 has actually been ready now for about 6 weeks (it is a rebase of prior PR #2898). I just opened another one #3763, which should be orthogonal/complimentary to #3466. And I think either/both of these should address the performance issues that #3691 attempts to circumvent.

cadedaniel

Great stuff! Some question:

Can the interfaces be kept as pure abstract classes? It is a little hard to follow the layers upon first reading -- I feel we should avoid super().method() for user implementations as much as possible, as then the code relevant to the layer is completely contained within the implementation and not spread between implementation and interface.
The MPLauncher starts processes for the task. Is it possible with this design to run the Engine in a different process? The concern I have is that the Engine will do a lot of work that can contend for CPU with the model execution. Things like receiving requests/async detokenization/async tokenization. Of course we can push all those things to different threads, but the dependency on Python for LLM ecosystem means we won't be able to escape the GIL unless we cordon it off to a different process.

cadedaniel · 2024-04-01T17:51:52Z

vllm/interfaces/communicator.py

+from vllm.interfaces.coordinator import Coordinator
+
+
+class Communicator(object):


why does this extend object?

I feel we should avoid super().method() for user implementations as much as possible, as then the code relevant to the layer is completely contained within the implementation and not spread between implementation and interface.

Technically it is doable. But then some code will be replicated over many implementations. There is a trade-off here.

Is it possible with this design to run the Engine in a different process?

Yes we can. The launcher design is general, and we can have both CPU workers and GPU workers. Although we need to think about how they coordinate with each other.

why does this extend object?
see https://stackoverflow.com/questions/4015417/why-do-python-classes-inherit-object

👍

btw python 2 is EOL'd a long time ago (in 2020 IIRC), we don't need to extend object anymore!

I disagree that code will be replicated if not using super().method(). There is a third way--factor out common utilities that can be used in multiple implementations. Basically, composition vs. inheritance.

zhuohan123 · 2024-04-01T20:27:23Z

I am a bit confused about this PR. Where will these communicator be used? How do they interact with the existing code?

youkaichao · 2024-04-01T20:34:22Z

Where will these communicator be used

They will be used for GPU-collective-communication, like allreduce . For example (in the test code):

class AllReduceDistributedTask(GlobalCoordinatorDistributedTask):

    def post_init_distributed(self, **kwargs):
        tensor = torch.ones(16, 1024, 1024, dtype=torch.float32).cuda(
            self.coordinator.get_local_rank())
        self.communicator.all_reduce(tensor_in=tensor)
        result = tensor.mean().cpu().item()
        assert result == self.coordinator.get_local_world_size()

I didn't push these code into existing code yet. This is just a draft for comment.

davidthomas426 · 2024-04-09T17:21:27Z

I think for several current uses, usage like communicator.all_reduce is too high-level, as we may want to have configurable parallelization strategy. An example is gather-logits-to-rank-0 and then single driver running sampling, vs. allgather-logits and then replicating sampling. Also could replicate scheduling and then there's no need for broadcasting a bunch of input tensors that we are currently doing.

But this can happen at higher abstraction layers, and in stages.

"Coordinator" and "Communicator" is a confusing split to me, and it doesn't help that the words sound too similar to each other. Maybe this is yak shaving, but could we consider different names that make the distinction clearer? Like "WorkerCoordinator" vs. "DeviceCommunicator", or something? Or even "CollectiveComms" or "CollectiveCommunicator" to make the "NCCL wrapper" idea more clear?

youkaichao · 2024-04-09T18:25:43Z

could we consider different names that make the distinction clearer? Like "WorkerCoordinator" vs. "DeviceCommunicator"

That's a great idea! Indeed I met this exact problem when I try to convey this abstraction to others. They get confused about "Coordinator" and "Communicator". I think "WorkerCoordinator" and "DeviceCommunicator" are much better. Thank you for the proposal!

youkaichao · 2024-06-14T00:53:57Z

close as it is broken up in several prs.

youkaichao added 18 commits March 30, 2024 16:35

initial draft of coordinator

e359ebf

initial interface of communicator

6fca830

use group inside torch coordinator

3a6293c

add group to coordinator

ecb8057

add group to coordinator

547ae94

add CoordinatorType

b502812

keep writing

0a720f9

initial draft

ecd635c

fix lint

a5b43b2

add launcher

4496f7c

lint

13b9e48

move get_communicator_class under communicator

c7a58a4

add launch_id for mp_launcher

ffc2f62

update pynccl

0b8ed56

update launcher with keyword only args

b92a01b

add test for pynccl w and w/o cudagraph

2593b10

update port selection

b50dd64

fix linter

92fbcbf

youkaichao added 2 commits March 31, 2024 14:28

fix __init__.py

8ac282d

fix lint

28c74e0

youkaichao added 5 commits March 31, 2024 15:57

fix port selection

eb84442

fix groups

f3d6a1d

add comment about dist.new_group

8410b72

add exitcode check

df6a897

fix assert

9cb4be4

cadedaniel reviewed Apr 1, 2024

View reviewed changes

youkaichao closed this Jun 14, 2024

youkaichao deleted the interface branch June 14, 2024 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation #3762

[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation #3762

youkaichao commented Mar 31, 2024

youkaichao commented Mar 31, 2024

youkaichao commented Mar 31, 2024

njhill commented Mar 31, 2024

cadedaniel left a comment

cadedaniel Apr 1, 2024

youkaichao Apr 1, 2024

cadedaniel Apr 1, 2024

davidthomas426 Apr 9, 2024

zhuohan123 commented Apr 1, 2024

youkaichao commented Apr 1, 2024

davidthomas426 commented Apr 9, 2024 •

edited

Loading

youkaichao commented Apr 9, 2024

youkaichao commented Jun 14, 2024

		from vllm.interfaces.coordinator import Coordinator


		class Communicator(object):

[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation #3762

[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation #3762

Conversation

youkaichao commented Mar 31, 2024

youkaichao commented Mar 31, 2024

youkaichao commented Mar 31, 2024

njhill commented Mar 31, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

cadedaniel Apr 1, 2024

Choose a reason for hiding this comment

youkaichao Apr 1, 2024

Choose a reason for hiding this comment

cadedaniel Apr 1, 2024

Choose a reason for hiding this comment

davidthomas426 Apr 9, 2024

Choose a reason for hiding this comment

zhuohan123 commented Apr 1, 2024

youkaichao commented Apr 1, 2024

davidthomas426 commented Apr 9, 2024 • edited Loading

youkaichao commented Apr 9, 2024

youkaichao commented Jun 14, 2024

davidthomas426 commented Apr 9, 2024 •

edited

Loading