NVIDIA Triton support

Hi vLLM genius @zhuohan123 @WoosukKwon 

We noticed the plan to support Triton server in the [vLLM roadmap](https://github.com/vllm-project/vllm/issues/244). I collaborate with @defined1007. We have also made some attempts on our own. Here, we share our choices and practices in the hope of jointly pushing forward the construction.

### Background and Objectives

Our intention is to utilize the Triton server internally to facilitate model management and its integration with our internal services.

### Current Situation

On the RPC level, Triton server supports asynchronous operations, yet, at the instance execution level, operations are executed synchronously. It's static batching. Consequently, with only a single instance, our operations become a multi-producer single-consumer (MPSC). Our aspiration, however, is to enable a multi-producer multi-consumer (MPMC).

### Strategy

#### Strategy One: Triton Server + Python Backend

This approach employs multi-processing for handling multiple instances but lacks memory sharing.

- We are unable to initiate a sufficient number of instances, resulting in a low throughput.

- On enabling max_batch_size, although the throughput can match that of the API server, the latency is high, failing to meet our requirements.

- We use the Python Backend as a proxy, interacting with the API server process via HTTP requests. Therefore, we don't need to initialize the model multiple times. Although the implementation might not be elegant, both throughput and latency fulfill our requirements.

#### Strategy Two: Triton Server + Custom Backend (C++)

This approach uses multi-threading and memory sharing, so we can initiate a sufficient number of instances. We make use of Pybind11 to call the vLLM async engine. However, Python [GIL constraints](https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil) apply here.

### Other

#### Short-term Resolution

Our choice for the immediate term is to stick with Triton Server + Python Backend, utilizing the proxy method to interact with the API server.

#### Long-term Perspective

- Enable the Triton server to support continuous batching in the schedule.
or
- Re-implement vLLM library in C++, facilitating integration.

We welcome any advice on this matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NVIDIA Triton support #541

Background and Objectives

Current Situation

Strategy

Strategy One: Triton Server + Python Backend

Strategy Two: Triton Server + Custom Backend (C++)

Other

Short-term Resolution

Long-term Perspective

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

NVIDIA Triton support #541

Description

Background and Objectives

Current Situation

Strategy

Strategy One: Triton Server + Python Backend

Strategy Two: Triton Server + Custom Backend (C++)

Other

Short-term Resolution

Long-term Perspective

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions