Name		Name	Last commit message	Last commit date
parent directory ..
deployment_toolkit		deployment_toolkit
dist4l		dist4l
dist6l		dist6l
large		large
runner		runner
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
calculate_metrics.py		calculate_metrics.py
dataloader.py		dataloader.py
export_model.py		export_model.py
metrics.py		metrics.py
model.py		model.py
prepare_input_data.py		prepare_input_data.py
requirements.txt		requirements.txt
run_inference_on_fw.py		run_inference_on_fw.py
run_inference_on_triton.py		run_inference_on_triton.py
run_performance_on_triton.py		run_performance_on_triton.py

README.md

Deploying the BERT model on Triton Inference Server

This folder contains instructions for deployment to run inference on Triton Inference Server, as well as detailed performance analysis. The purpose of this document is to help you with achieving the best inference performance.

Solution overview

Introduction

The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.

This README provides step-by-step deployment instructions for models generated during training (as described in the model README). Additionally, this README provides the corresponding deployment scripts that ensure optimal GPU utilization during inferencing on the Triton Inference Server.

Deployment process

The deployment process consists of two steps:

Conversion.

The purpose of conversion is to find the best performing model format supported by the Triton Inference Server. Triton Inference Server uses a number of runtime backends such as TensorRT, LibTorch and ONNX Runtime to support various model types. Refer to the Triton documentation for a list of available backends.

Configuration.

Model configuration on the Triton Inference Server, which generates necessary configuration files.

After deployment, the Triton inference server is used for evaluation of the converted model in two steps:

Accuracy tests.

Produce results that are tested against given accuracy thresholds.

Performance tests.

Produce latency and throughput results for offline (static batching) and online (dynamic batching) scenarios.

All steps are executed by the provided runner script. Refer to Quick Start Guide

Setup

Ensure you have the following components:

NVIDIA Docker
PyTorch NGC container 21.10
Triton Inference Server NGC container 21.10
NVIDIA CUDA
NVIDIA Ampere, Volta or Turing based GPU

Quick Start Guide

Deployment is supported for the following architectures. For the deployment steps, refer to the appropriate readme file:

Release Notes

We’re constantly refining and improving our performance on AI and HPC workloads with frequent updates to our software stack. For our latest performance data refer to these pages for AI and HPC benchmarks.

Changelog

Known issues

There are no known issues with this model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

triton

triton

README.md

Deploying the BERT model on Triton Inference Server

Table of contents

Solution overview

Introduction

Deployment process

Setup

Quick Start Guide

Release Notes

Changelog

Known issues

Files

triton

Directory actions

More options

Directory actions

More options

Latest commit

History

triton

Folders and files

parent directory

README.md

Deploying the BERT model on Triton Inference Server

Table of contents

Solution overview

Introduction

Deployment process

Setup

Quick Start Guide

Release Notes

Changelog

Known issues