In this section we demonstrate an end-to-end examples for developing and serving decoupled models in Python backend.
repeat_model.py and square_model.py demonstrate
how to write a decoupled model where each request can generate 0 to many responses.
These files are heavily commented to describe each fuinction call.
These example models are designed to show the flexibility available to decoupled models
and in no way should be used in production. These examples circumvents
the restriction placed by the instance count
and allows multiple requests to be in process even for single instance. In
real deployment, the model should not allow the caller thread to return from
execute
until that instance is ready to handle another set of requests.
- Create the model repository:
$ mkdir -p models/repeat_int32/1
$ mkdir -p models/square_int32/1
# Copy the Python models
$ cp examples/decoupled/repeat_model.py models/repeat_int32/1/model.py
$ cp examples/decoupled/repeat_config.pbtxt models/repeat_int32/config.pbtxt
$ cp examples/decoupled/square_model.py models/square_int32/1/model.py
$ cp examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
- Start the tritonserver:
tritonserver --model-repository `pwd`/models
Send inference requests to repeat model using repeat_client.py.
python3 examples/decoupled/repeat_client.py
You should see an output similar to the output below:
stream started...
async_stream_infer
model_name: "repeat_int32"
id: "0"
inputs {
name: "IN"
datatype: "INT32"
shape: 4
}
inputs {
name: "DELAY"
datatype: "UINT32"
shape: 4
}
inputs {
name: "WAIT"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
}
outputs {
name: "IDX"
}
raw_input_contents: "\004\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000"
raw_input_contents: "\001\000\000\000\002\000\000\000\003\000\000\000\004\000\000\000"
raw_input_contents: "\005\000\000\000"
enqueued request 0 to stream...
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\000\000\000\000"
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\001\000\000\000"
raw_output_contents: "\002\000\000\000"
}
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\002\000\000\000"
raw_output_contents: "\000\000\000\000"
}
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\003\000\000\000"
raw_output_contents: "\001\000\000\000"
}
PASS: repeat_int32
stream stopped...
Look how a single request generated 4 responses.
Send inference requests to square model using square_client.py.
python3 examples/decoupled/square_client.py
You should see an output similar to the output below:
stream started...
async_stream_infer
model_name: "square_int32"
id: "0"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\004\000\000\000"
enqueued request 0 to stream...
async_stream_infer
model_name: "square_int32"
id: "1"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\002\000\000\000"
enqueued request 1 to stream...
async_stream_infer
model_name: "square_int32"
id: "2"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\000\000\000\000"
enqueued request 2 to stream...
async_stream_infer
model_name: "square_int32"
id: "3"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\001\000\000\000"
enqueued request 3 to stream...
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "1"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\002\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "3"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\001\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "1"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\002\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
PASS: square_int32
stream stopped...
Look how responses were delivered out-of-order of requests.
The generated responses can be tracked to their request using
the id
field.