You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+86-7Lines changed: 86 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,78 @@
1
1
# TensorRT-LLM Backend
2
-
The Triton backend for TensorRT-LLM.
2
+
The Triton backend for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
3
3
4
-
## Usage
4
+
## Introduction
5
+
6
+
This document describes how to serve models by TensorRT-LLM Triton backend. This backend is only an interface to call TensorRT-LLM in Triton. The heavy lifting, in terms of implementation, can be found in the TensorRT-LLM source code.
7
+
8
+
## Setup Environment
9
+
10
+
### Prepare the repository
11
+
12
+
Clone the repository, and update submodules recursively.
The rest of the documentation assumes that the Docker image has already been built.
27
+
28
+
### How to select the models
29
+
There are two models under `all_models/`:
30
+
- gpt: A Python implementation of the TensorRT-LLM Triton backend
31
+
- inflight_batcher_llm: A C++ implementation of the TensorRT-LLM Triton backend
32
+
33
+
### Prepare TensorRT-LLM engines
34
+
Follow the [guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md) in TensorRT-LLM to prepare the engines for deployment.
35
+
36
+
For example, please find the details in the document of TensorRT-LLM GPT for instrutions to build GPT engines: [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt#usage)
- This mainly describes the server and TensorRT-LLM inference hyperparameters.
44
+
45
+
There are several components in each implemented backend, and there is a config.pbtxt for each component, take `all_models/inflight_batcher_llm` as an example:
RUN wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.0.1/tars/TensorRT-9.0.1.4.Linux.x86_64-gnu.cuda-12.2.tar.gz -P /workspace
15
-
RUN tar -xvf /workspace/TensorRT-9.0.1.4.Linux.x86_64-gnu.cuda-12.2.tar.gz -C /usr/local/ && mv /usr/local/TensorRT-9.0.1.4 /usr/local/tensorrt
16
-
RUN pip install /usr/local/tensorrt/python/tensorrt-9.0.1*cp310-none-linux_x86_64.whl && rm -fr /workspace/TensorRT-9.0.1.4.Linux.x86_64-gnu.cuda-12.2.tar.gz
0 commit comments