You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 24, 2025. It is now read-only.
Is your feature request related to a problem? Please describe.
ModelService currently supports disaggregated prefill and decode deployments, including tensor-parallel setups across multiple GPUs on a single node. However, to enable multi-node vLLM inference, we require integration with the leaderworkerset controller.
For example, consider the following use-case: I want to run vLLM with tensor_parallel_size=8, distributed across 2 nodes with 8 GPUs each. For this use-case, I would like ModelService to support orchestrating a multi-pod workload under a LeaderWorkerSet, where each pod (worker) forms part of a single logical model instance.
Describe the solution approach you'd like
We need native support in ModelService controller for:
Creating and managing LeaderWorkerSet CRs.
Propagating relevant vLLM CLI/env args to each pod based on its role (leader vs worker).
Handling lifecycle and readiness conditions for a distributed setup.
Reusing existing prefill/decode fields as much as possible, perhaps adding a new field like useLeaderWorkerSet: true.
Complete the implementation of the parallelism stanza in ModelService (including node parallelism)
Describe alternatives you've considered
Manually creating LeaderWorkerSet CRs and bypassing ModelService, which breaks the declarative, unified model serving approach.