To simplify communication with Triton, the Triton project provides C++ and Python client libraries. This section describes several example applications that show how to use these libraries. Many of these examples use models from the example model repository.
-
C++ and Python versions of image_client, an example application that uses the C++ or Python client library to execute image classification models on Triton. See Image Classification Example.
-
Several simple C++ examples show how to use the C++ library to communicate with Triton to perform inferencing and other task. The C++ examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix. See Simple Example Applications.
-
Several simple Python examples show how to use the Python library to communicate with Triton to perform inferencing and other task. The Python examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix. See Simple Example Applications.
-
A couple of Python examples that communicate with Triton using a Python GRPC API generated by the protoc compiler. grpc_client.py is a simple example that shows simple API usage. grpc_image_client.py is functionally equivalent to image_client but that uses a generated GRPC client stub to communicate with Triton.
-
The protoc compiler can generate a GRPC API in a large number of programming languages. See src/clients/go for an example for the Go programming language.
The client examples are included with the client libraries when you download from NGC, download from GitHub or build using Docker or cmake.
This section describes several of the simple example applications and the features that they illustrate.
Some frameworks support tensors where each element in the tensor is a string (see Datatypes for information on supported datatypes).
String tensors are demonstrated in the C++ example applications simple_http_string_infer_client.cc and simple_grpc_string_infer_client.cc. String tensors are demonstrated in the Python example application simple_http_string_infer_client.py and simple_grpc_string_infer_client.py.
Using system shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases.
Using system shared memory is demonstrated in the C++ example applications simple_http_shm_client.cc and simple_grpc_shm_client.cc. Using system shared memory is demonstrated in the Python example application simple_http_shm_client.py and simple_grpc_shm_client.py.
Python does not have a standard way of allocating and accessing shared memory so as an example a simple system shared memory module is provided that can be used with the Python client library to create, set and destroy system shared memory.
Using CUDA shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases.
Using CUDA shared memory is demonstrated in the C++ example applications simple_http_cudashm_client.cc and simple_grpc_cudashm_client.cc. Using CUDA shared memory is demonstrated in the Python example application simple_http_cudashm_client.py and simple_grpc_cudashm_client.py.
Python does not have a standard way of allocating and accessing shared memory so as an example a simple CUDA shared memory module is provided that can be used with the Python client library to create, set and destroy CUDA shared memory.
When performing inference using a stateful model, a client must identify which inference requests belong to the same sequence and also when a sequence starts and ends.
Each sequence is identified with a sequence ID that is provided when an inference request is made. It is up to the clients to create a unique sequence ID. For each sequence the first inference request should be marked as the start of the sequence and the last inference requests should be marked as the end of the sequence.
The use of sequence ID and start and end flags are demonstrated in the C++ example applications simple_http_sequence_stream_infer_client.cc and simple_grpc_sequence_stream_infer_client.cc. The use of sequence ID and start and end flags are demonstrated in the Python example application simple_http_sequence_stream_infer_client.py and simple_grpc_sequence_stream_infer_client.py.
The image classification example that uses the C++ client API is available at src/clients/c++/examples/image_client.cc. The Python version of the image classification client is available at src/clients/python/examples/image_client.py.
To use image_client (or image_client.py) you must first have a running Triton that is serving one or more image classification models. The image_client application requires that the model have a single image input and produce a single classification output. If you don't have a model repository with image classification models see QuickStart for instructions on how to create one.
Once Triton is running you can use the image_client application to send inference requests. You can specify a single image or a directory holding images. Here we send a request for the inception_graphdef model for an image from the qa/images.
$ image_client -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
0.754130 (505) = COFFEE MUG
The Python version of the application accepts the same command-line arguments.
$ python image_client.py -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
0.826384 (505) = COFFEE MUG
The image_client and image_client.py applications use the client libraries to talk to Triton. By default image_client instructs the client library to use HTTP/REST protocol, but you can use the GRPC protocol by providing the -i flag. You must also use the -u flag to point at the GRPC endpoint on Triton.
$ image_client -i grpc -u localhost:8001 -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
0.754130 (505) = COFFEE MUG
By default the client prints the most probable classification for the image. Use the -c flag to see more classifications.
$ image_client -m inception_graphdef -s INCEPTION -c 3 qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
0.754130 (505) = COFFEE MUG
0.157077 (969) = CUP
0.002880 (968) = ESPRESSO
The -b flag allows you to send a batch of images for inferencing. The image_client application will form the batch from the image or images that you specified. If the batch is bigger than the number of images then image_client will just repeat the images to fill the batch.
$ image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images/mug.jpg
Request 0, batch size 2
Image '../qa/images/mug.jpg':
0.754130 (505) = COFFEE MUG
0.157077 (969) = CUP
0.002880 (968) = ESPRESSO
Image '../qa/images/mug.jpg':
0.754130 (505) = COFFEE MUG
0.157077 (969) = CUP
0.002880 (968) = ESPRESSO
Provide a directory instead of a single image to perform inferencing on all images in the directory.
$ image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images
Request 0, batch size 2
Image '/opt/tritonserver/qa/images/car.jpg':
0.819196 (818) = SPORTS CAR
0.033457 (437) = BEACH WAGON
0.031232 (480) = CAR WHEEL
Image '/opt/tritonserver/qa/images/mug.jpg':
0.754130 (505) = COFFEE MUG
0.157077 (969) = CUP
0.002880 (968) = ESPRESSO
Request 1, batch size 2
Image '/opt/tritonserver/qa/images/vulture.jpeg':
0.977632 (24) = VULTURE
0.000613 (9) = HEN
0.000560 (137) = EUROPEAN GALLINULE
Image '/opt/tritonserver/qa/images/car.jpg':
0.819196 (818) = SPORTS CAR
0.033457 (437) = BEACH WAGON
0.031232 (480) = CAR WHEEL
The grpc_image_client.py application behaves the same as the image_client except that instead of using the client library it uses the GRPC generated library to communicate with Triton.
In comparison to the image classification example above, this example uses an ensemble of an image-preprocessing model implemented as a DALI backend and a TensorFlow Inception model. The ensemble model allows you to send the raw image binaries in the request and receive classification results without preprocessing the images on the client.
To try this example you should follow the DALI ensemble example instructions.