This document describes Triton's sequence extension. The sequence extension allows Triton to support stateful models that expect a sequence of related inference requests. Because this extension is supported, Triton reports “sequence” in the extensions field of its Server Metadata.
An inference request can specify that it is part of a sequence using the “sequence_id” parameter in the request and by using the “sequence_start” and “sequence_end” parameters to indicate the start and end of sequences.
-
"sequence_id" : uint64 value that indicates which sequence a request belongs to. All inference requests that belong to the same sequence must use the same sequence ID. A sequence ID of 0 indicates the inference request is not part of a sequence.
-
"sequence_start" : boolean value if set to true in a request indicates that the request is the first in a sequence. If not set, or set to false the request is not the first in a sequence. If set the "sequence_id" parameter must be set to a non-zero value.
-
"sequence_end" : booleam value if set to true in a request indicates that the request is the last in a sequence. If not set, or set to false the request is not the last in a sequence. If set the "sequence_id" parameter must be set to a non-zero value.
The following example shows how a request is marked as part of a sequence. In this case the sequence_start and sequence_end parameters are not used which means that this request is neither the start nor end of the sequence.
POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
"parameters" : { "sequence_id" : 42 }
"inputs" : [
{
"name" : "input0",
"shape" : [ 2, 2 ],
"datatype" : "UINT32",
"data" : [ 1, 2, 3, 4 ]
}
],
"outputs" : [
{
"name" : "output0",
}
]
}
In addition to supporting the sequence parameters described above, the GRPC API adds a streaming version of the inference API to allow a sequence of inference requests to be sent over the same GRPC stream. This streaming API is not required to be used for requests that specify a sequence_id and may be used by requests that do not specify a sequence_id. The ModelInferRequest is the same as for the ModelInfer API. The ModelStreamInferResponse message is shown below.
service GRPCInferenceService
{
…
// Perform inference using a specific model with GRPC streaming.
rpc ModelStreamInfer(stream ModelInferRequest) returns (stream ModelStreamInferResponse) {}
}
// Response message for ModelStreamInfer.
message ModelStreamInferResponse
{
// The message describing the error. The empty message
// indicates the inference was successful without errors.
String error_message = 1;
// Holds the results of the request.
ModelInferResponse infer_response = 2;
}