Triton can support backends and models that send multiple responses for a request or zero responses for a request. A decoupled model/backend may also send responses out-of-order relative to the order that the request batches are executed. This allows backend to deliver response whenever it deems fit. This is specifically useful in Automated Speech Recognition (ASR). The requests with large number of responses, will not block the responses from other requests from being delivered.
Read carefully about the Triton Backend API, Inference Requests and Responses and Decoupled Responses. The repeat backend and square backend demonstrate how the Triton Backend API can be used to implement a decoupled backend. The example is designed to show the flexibility of the Triton API and in no way should be used in production. This example may process multiple batches of requests at the same time without having to increase the instance count. In real deployment, the backend should not allow the caller thread to return from TRITONBACKEND_ModelInstanceExecute until that instance is ready to handle another set of requests. If not designed properly the backend can be easily over-subscribed. This can also cause under-utilization of features like Dynamic Batching as it leads to eager batching.
Read carefully about the Python Backend,
and specifically execute
.
The decoupled examples demonstrates how decoupled API can be used to implement a decoupled python model. As noted in the examples, these are designed to show the flexibility of the decoupled API and in no way should be used in production.
The decoupled model transaction policy must be set in the provided model configuration file for the model. Triton requires this information to enable special handling required for decoupled models. Deploying decoupled models without this configuration setting will throw errors at the runtime.
Inference Protocols and APIs describes various ways a client can communicate and run inference on the server. For decoupled models, Triton's HTTP endpoint cannot be used for running inference as it supports exactly one response per request. Even standard ModelInfer RPC in the GRPC endpoint does not support decoupled responses. In order to run inference on a decoupled model, the client must use the bi-directional streaming RPC. See here for more details. The decoupled_test.py demonstrates how the gRPC streaming can be used to infer decoupled models.
If using Triton's in-process C API,
your application should be cognizant that the callback function you registered with
TRITONSERVER_InferenceRequestSetResponseCallback
can be invoked any number of times,
each time with a new response. You can take a look at grpc_server.cc