Description
Hi vLLM genius @zhuohan123 @WoosukKwon
We noticed the plan to support Triton server in the vLLM roadmap. I collaborate with @defined1007. We have also made some attempts on our own. Here, we share our choices and practices in the hope of jointly pushing forward the construction.
Background and Objectives
Our intention is to utilize the Triton server internally to facilitate model management and its integration with our internal services.
Current Situation
On the RPC level, Triton server supports asynchronous operations, yet, at the instance execution level, operations are executed synchronously. It's static batching. Consequently, with only a single instance, our operations become a multi-producer single-consumer (MPSC). Our aspiration, however, is to enable a multi-producer multi-consumer (MPMC).
Strategy
Strategy One: Triton Server + Python Backend
This approach employs multi-processing for handling multiple instances but lacks memory sharing.
-
We are unable to initiate a sufficient number of instances, resulting in a low throughput.
-
On enabling max_batch_size, although the throughput can match that of the API server, the latency is high, failing to meet our requirements.
-
We use the Python Backend as a proxy, interacting with the API server process via HTTP requests. Therefore, we don't need to initialize the model multiple times. Although the implementation might not be elegant, both throughput and latency fulfill our requirements.
Strategy Two: Triton Server + Custom Backend (C++)
This approach uses multi-threading and memory sharing, so we can initiate a sufficient number of instances. We make use of Pybind11 to call the vLLM async engine. However, Python GIL constraints apply here.
Other
Short-term Resolution
Our choice for the immediate term is to stick with Triton Server + Python Backend, utilizing the proxy method to interact with the API server.
Long-term Perspective
- Enable the Triton server to support continuous batching in the schedule.
or - Re-implement vLLM library in C++, facilitating integration.
We welcome any advice on this matter.