Skip to content

Latest commit

 

History

History

Deploying the BERT model on Triton Inference Server

This folder contains instructions for deployment to run inference on Triton Inference Server, as well as detailed performance analysis. The purpose of this document is to help you with achieving the best inference performance.

Table of contents

Solution overview

Introduction

The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.

This README provides step-by-step deployment instructions for models generated during training (as described in the model README). Additionally, this README provides the corresponding deployment scripts that ensure optimal GPU utilization during inferencing on the Triton Inference Server.

Deployment process

The deployment process consists of two steps:

  1. Conversion.

The purpose of conversion is to find the best performing model format supported by the Triton Inference Server. Triton Inference Server uses a number of runtime backends such as TensorRT, LibTorch and ONNX Runtime to support various model types. Refer to the Triton documentation for a list of available backends.

  1. Configuration.

Model configuration on the Triton Inference Server, which generates necessary configuration files.

After deployment, the Triton inference server is used for evaluation of the converted model in two steps:

  1. Accuracy tests.

Produce results that are tested against given accuracy thresholds.

  1. Performance tests.

Produce latency and throughput results for offline (static batching) and online (dynamic batching) scenarios.

All steps are executed by the provided runner script. Refer to Quick Start Guide

Setup

Ensure you have the following components:

Quick Start Guide

Deployment is supported for the following architectures. For the deployment steps, refer to the appropriate readme file:

Release Notes

We’re constantly refining and improving our performance on AI and HPC workloads with frequent updates to our software stack. For our latest performance data refer to these pages for AI and HPC benchmarks.

Changelog

Known issues

  • There are no known issues with this model.