Add extension documentation (triton-inference-server#1315)

* Add extension documentation * Add GRPC proto file for core protocol
dsgibbons · Apr 15, 2020 · cfe7b78 · cfe7b78
1 parent fd79735
commit cfe7b78
Show file tree

Hide file tree

Showing 10 changed files with 2,057 additions and 0 deletions.
diff --git a/docs/protocol/README.md b/docs/protocol/README.md
@@ -0,0 +1,44 @@
+<!--
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# HTTP/REST and GRPC Protocol
+
+This directory contains documents related to the HTTP/REST and GRPC
+protocols used by Triton. Triton uses the [KFServing community
+standard inference
+protocols](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2)
+plus several extensions that are defined in the following:
+
+- [Binary tensor data extension](./extension_binary_data.md)
+- [Classification extension](./extension_classification.md)
+- [Model configuration extension](./extension_model_configuration.md)
+- [Model repository extension](./extension_model_repository.md)
+- [Schedule policy extension](./extension_schedule_policy.md)
+- [Sequence extension](./extension_sequence.md)
+- [Shared-memory extension](./extension_shared_memory.md)
+- [Statistics extension](./extension_statistics.md)
diff --git a/docs/protocol/extension_binary_data.md b/docs/protocol/extension_binary_data.md
@@ -0,0 +1,138 @@
+<!--
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Binary Tensor Data Extension
+
+This document describes Triton's binary tensor data extension. The
+binary tensor data extension allows Triton to support tensor data
+represented in a binary format in the body of an HTTP/REST
+request. Because this extension is supported, Triton reports
+“binary_tensor_data” in the extensions field of its Server Metadata.
+
+Tensor data represented as binary data is organized in little-endian
+byte order, row major, without stride or padding between elements. All
+tensor data types are representable as binary data in the native size
+of the data type. For BOOL type element true is a single byte with
+value 1 and false is a single byte with value 0. For BYTES type an
+element is represented by a 4-byte unsigned integer giving the length
+followed by the actual bytes. The binary data for a tensor is
+delivered in the HTTP body after the JSON object (see Examples).
+
+The binary tensor data extension uses a set of parameters to indicate
+that an input or output tensor is communicated as binary data. The
+first parameter is used in $request_input and $response_output to
+indicate that the input or output tensor is communicated as binary
+data:
+
+- "binary_data_size" : int64 parameter indicating the size of the
+  tensor binary data, in bytes.
+
+The second parameter is used in $request_output to indicate that the
+output should be returned from Triton as binary data.
+
+- "binary_data" : bool parameter that is true if the output should be
+  returned as binary data and false (or not given) if the tensor
+  should be returned as JSON.
+
+When one or more tensors are communicated as binary data, the HTTP
+body of the request or response will contain the JSON inference
+request or response object followed by the binary tensor data in the
+same order as the order of the input or output tensors are specified
+in the JSON. If any binary data is present in the request or response
+the Inference-Header-Content-Length header must be provided to give
+the length of the JSON object, and Content-Length continues to give
+the full body length (as HTTP requires).
+
+## Examples
+
+For the following request the input tensors are sent as binary data
+and the output tensor must be returned as binary data as that is what
+is requested. Also note that the total size of the binary data is 19
+bytes and that size must be reflected in the content length headers.
+
+```
+POST /v2/models/mymodel/infer HTTP/1.1
+Host: localhost:8000
+Content-Type: application/octet-stream
+Inference-Header-Content-Length: <xx>
+Content-Length: <xx+19>
+{
+  "model_name" : "mymodel",
+  "inputs" : [
+    {
+      "name" : "input0",
+      "shape" : [ 2, 2 ],
+      "datatype" : "UINT32",
+      "parameters" : {
+        "binary_data_size" : 16
+      }
+    },
+    {
+      "name" : "input1",
+      "shape" : [ 3 ],
+      "datatype" : "BOOL",
+      "parameters" : {
+        "binary_data_size" : 3
+      }
+    }
+  ],
+  "outputs" : [
+    {
+      "name" : "output0",
+      "parameters" : {
+        "binary_data" : true
+      }
+    }
+  ]
+}
+<16 bytes of data for input0 tensor>
+<3 bytes of data for input1 tensor>
+```
+
+Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the
+following response would be returned.
+
+```
+HTTP/1.1 200 OK
+Content-Type: application/octet-stream
+Inference-Header-Content-Length: <yy>
+Content-Length: <yy+24>
+{
+  "outputs" : [
+    {
+      "name" : "output0",
+      "shape" : [ 3, 2 ],
+      "datatype"  : "FP32",
+      "parameters" : {
+        "binary_data_size" : 24
+      }
+    }
+  ]
+}
+<24 bytes of data for output0 tensor>
+```
diff --git a/docs/protocol/extension_classification.md b/docs/protocol/extension_classification.md
@@ -0,0 +1,200 @@
+<!--
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Classification Extension
+
+This document describes Triton's classification extension.  The
+classification extension allows Triton to return an output as a
+classification index and (optional) label instead of returning the
+output as raw tensor data.  Because this extension is supported,
+Triton reports “classification” in the extensions field of its Server
+Metadata.
+
+An inference request can use the “classification” parameter to request
+that one or more classifications be returned for an output. For such
+an output the returned tensor will not be the shape and type produced
+by the model, but will instead be type BYTES with shape [ batch-size,
+<count> ] where each element returns the classification index and
+label as a single string. The <count> dimension of the returned tensor
+will equal the “count” value specified in the classification
+parameter.
+
+When the classification parameter is used, Triton will determine the
+top-n classifications as the n highest-valued elements in the output
+tensor compared using the output tensor’s data type. For example, if
+an output tensor is [ 1, 5, 10, 4 ], the highest-valued element is 10
+(index 2), followed by 5 (index 1), followed by 4 (index 2), followed
+by 1 (index 0). So, for example, the top-2 classifications by index
+are [ 2, 1 ].
+
+The format of the returned string will be “<value>:<index>[:<label>]”,
+where <index> is the index of the class in the model output tensor,
+<value> is the value associated with that index in the model output,
+and the <label> associated with that index is optional. For example,
+continuing the example from above, the returned tensor will be [
+“10:2”, “5:1” ]. If the model has labels associated with those
+indices, the returned tensor will be [ “10:2:apple”, “5:1:pickle” ].
+
+## HTTP/REST
+
+In all JSON schemas shown in this document $number, $string, $boolean,
+$object and $array refer to the fundamental JSON types. #optional
+indicates an optional JSON field.
+
+The classification extension requires that the “classification”
+parameter, when applied to a requested inference output, be recognized
+by Triton as follows:
+
+- “classification” : $number indicating the number of classes that
+  should be returned for the output.
+
+The following example shows how the classification parameter is used
+in an inference request.
+
+```
+POST /v2/models/mymodel/infer HTTP/1.1
+Host: localhost:8000
+Content-Type: application/json
+Content-Length: <xx>
+{
+  "id" : "42",
+  "inputs" : [
+    {
+      "name" : "input0",
+      "shape" : [ 2, 2 ],
+      "datatype" : "UINT32",
+      "data" : [ 1, 2, 3, 4 ]
+    }
+  ],
+  "outputs" : [
+    {
+      "name" : "output0",
+      "parameters" : { "classification" : 2 }
+    }
+  ]
+}
+```
+
+For the above request Triton will return the “output0” output tensor
+as a STRING tensor with shape [ 2 ]. Assuming the model produces
+output0 tensor [ 1.1, 3.3, 0.5, 2.4 ] from the above inputs, the
+response will be the following.
+
+```
+HTTP/1.1 200 OK
+Content-Type: application/json
+Content-Length: <yy>
+{
+  "id" : "42"
+  "outputs" : [
+    {
+      "name" : "output0",
+      "shape" : [ 2 ],
+      "datatype"  : "STRING",
+      "data" : [ "3.3:1", "2.4:3" ]
+    }
+  ]
+}
+```
+
+If the model has labels associated with each classification index
+Triton will return those as well, as shown below.
+
+```
+HTTP/1.1 200 OK
+Content-Type: application/json
+Content-Length: <yy>
+{
+  "id" : "42"
+  "outputs" : [
+    {
+      "name" : "output0",
+      "shape" : [ 2 ],
+      "datatype"  : "STRING",
+      "data" : [ "3.3:1:index_1_label", "2.4:3:index_3_label" ]
+    }
+  ]
+}
+```
+
+## GRPC
+
+The classification extension requires that the “classification”
+parameter, when applied to a requested inference output, be recognized
+by Triton as follows:
+
+- “classification” : int64_param indicating the number of classes that
+  should be returned for the output.
+
+The following example shows how the classification parameter is used
+in an inference request.
+
+```
+ModelInferRequest {
+  model_name : "mymodel"
+  model_version : -1
+  inputs [
+    {
+      name : "input0"
+      shape : [ 2, 2 ]
+      datatype : "UINT32"
+      contents { int_contents : [ 1, 2, 3, 4 ] }
+    }
+  ]
+  outputs [
+    {
+      name : "output0"
+      parameters [
+        {
+          key : "classification"
+          value : { int64_param : 2 }
+        }
+      ]
+    }
+  ]
+}
+```
+
+For the above request Triton will return the “output0” output tensor
+as a STRING tensor with shape [ 2 ]. Assuming the model produces
+output0 tensor [ 1.1, 3.3, 0.5, 2.4 ] from the above inputs, the
+response will be the following.
+
+```
+ModelInferResponse {
+  model_name : "mymodel"
+  outputs [
+    {
+      name : "output0"
+      shape : [ 2 ]
+      datatype  : "STRING"
+      contents { bytes_contents : [ "3.3:1", "2.4:3" ] }
+    }
+  ]
+}
+```