Fleet Command Deploy: Triton Inference Server Cluster

A helm chart for installing a single cluster of Triton Inference Server on Fleet Command is provided. By default the cluster contains a single instance of the inference server but the replicaCount configuration parameter can be set to create a cluster of any size, as described below.

This guide assumes you already have a functional Fleet Command location deployed. Please refer to the Fleet Command Documentation

The steps below describe how to set-up a model repository, use helm to launch the inference server, and then send inference requests to the running server. You can access a Grafana endpoint to see real-time metrics reported by the inference server.

Model Repository

If you already have a model repository you may use that with this helm chart. If you do not have a model repository, you can checkout a local copy of the inference server source repository to create an example model repository::

$ git clone https://github.com/triton-inference-server/server.git

Triton Server needs a repository of models that it will make available for inferencing. For this example you will place the model repository in an S3 Storage bucket (either in AWS or other S3 API compatible on-premises object storage).

$ aws mb s3://triton-inference-server-repository

Following the QuickStart download the example model repository to your system and copy it into the AWS S3 bucket.

$ aws cp -r docs/examples/model_repository s3://triton-inference-server-repository/model_repository

AWS Model Repository

To load the model from the AWS S3, you need to convert the following AWS credentials in the base64 format and add it to the Application Configuration section when creating the Fleet Command Deployment.

echo -n 'REGION' | base64

echo -n 'SECRECT_KEY_ID' | base64

echo -n 'SECRET_ACCESS_KEY' | base64

Deploy the Inference Server

Deploy the inference server to your Location in Fleet Command by creating a Deployment. You can specify configuration parameters to override the default values.yaml in the Application Configuration section.

Note: You must provide a --model-repository parameter with a path to your prepared model repository in your S3 bucket. Otherwise, the Triton Inference Server will not start.

See Fleet Command documentation for more info.

Using Triton Inference Server

Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a NodePort service type, where the same port is opened on all systems in a Location.

The inference server exposes an HTTP endpoint on port 30343, and GRPC endpoint on port 30344 and a Prometheus metrics endpoint on port 30345. These ports can be overridden in the application configuration when deploying. You can use curl to get the meta-data of the inference server from the HTTP endpoint. For example, if a system in your location has the IP 34.83.9.133:

$ curl 34.83.9.133:30343/v2

Follow the QuickStart to get the example image classification client that can be used to perform inferencing using image classification models being served by the inference server. For example,

$ image_client -u 34.83.9.133:30343 -m inception_graphdef -s INCEPTION -c3 mug.jpg
Request 0, batch size 1
Image 'images/mug.jpg':
    504 (COFFEE MUG) = 0.723992
    968 (CUP) = 0.270953
    967 (ESPRESSO) = 0.00115997

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fleet Command Deploy: Triton Inference Server Cluster

Model Repository

AWS Model Repository

Deploy the Inference Server

Using Triton Inference Server

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fleet Command Deploy: Triton Inference Server Cluster

Model Repository

AWS Model Repository

Deploy the Inference Server

Using Triton Inference Server