A helm chart for installing a single cluster of Triton Inference Server on Fleet Command is provided. By default the cluster contains a single instance of the inference server but the replicaCount configuration parameter can be set to create a cluster of any size, as described below.
This guide assumes you already have a functional Fleet Command location deployed. Please refer to the Fleet Command Documentation
The steps below describe how to set-up a model repository, use helm to launch the inference server, and then send inference requests to the running server. You can access a Grafana endpoint to see real-time metrics reported by the inference server.
If you already have a model repository you may use that with this helm chart. If you do not have a model repository, you can checkout a local copy of the inference server source repository to create an example model repository::
$ git clone https://github.com/triton-inference-server/server.git
Triton Server needs a repository of models that it will make available for inferencing. For this example you will place the model repository in an S3 Storage bucket (either in AWS or other S3 API compatible on-premises object storage).
$ aws mb s3://triton-inference-server-repository
Following the QuickStart download the example model repository to your system and copy it into the AWS S3 bucket.
$ aws cp -r docs/examples/model_repository s3://triton-inference-server-repository/model_repository
To load the model from the AWS S3, you need to convert the following AWS credentials in the base64 format and add it to the Application Configuration section when creating the Fleet Command Deployment.
echo -n 'REGION' | base64
echo -n 'SECRECT_KEY_ID' | base64
echo -n 'SECRET_ACCESS_KEY' | base64
Deploy the inference server to your Location in Fleet Command by creating a Deployment. You can specify configuration parameters to override the default values.yaml in the Application Configuration section.
Note: You must provide a --model-repository
parameter with a path to your
prepared model repository in your S3 bucket. Otherwise, the Triton Inference
Server will not start.
See Fleet Command documentation for more info.
Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a NodePort service type, where the same port is opened on all systems in a Location.
The inference server exposes an HTTP endpoint on port 30343, and GRPC endpoint
on port 30344 and a Prometheus metrics endpoint on port 30345. These ports can
be overridden in the application configuration when deploying. You can use curl
to get the meta-data of the inference server from the HTTP endpoint. For
example, if a system in your location has the IP 34.83.9.133
:
$ curl 34.83.9.133:30343/v2
Follow the QuickStart to get the example image classification client that can be used to perform inferencing using image classification models being served by the inference server. For example,
$ image_client -u 34.83.9.133:30343 -m inception_graphdef -s INCEPTION -c3 mug.jpg
Request 0, batch size 1
Image 'images/mug.jpg':
504 (COFFEE MUG) = 0.723992
968 (CUP) = 0.270953
967 (ESPRESSO) = 0.00115997