triton-inference-server
diff --git a/‎README.md
Lines changed: 86 additions & 7 deletions b/‎README.md
Lines changed: 86 additions & 7 deletions
diff --git a/‎all_models/gpt/preprocessing/1/model.py
Lines changed: 1 addition & 2 deletions b/‎all_models/gpt/preprocessing/1/model.py
Lines changed: 1 addition & 2 deletions
diff --git a/‎all_models/gpt/tensorrt_llm/1/model.py
Lines changed: 1 addition & 1 deletion b/‎all_models/gpt/tensorrt_llm/1/model.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎all_models/inflight_batcher_llm/preprocessing/1/model.py
Lines changed: 1 addition & 2 deletions b/‎all_models/inflight_batcher_llm/preprocessing/1/model.py
Lines changed: 1 addition & 2 deletions
diff --git a/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Lines changed: 24 additions & 0 deletions b/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Lines changed: 24 additions & 0 deletions
diff --git a/‎dockerfile/Dockerfile.trt_llm_backend
Lines changed: 34 additions & 13 deletions b/‎dockerfile/Dockerfile.trt_llm_backend
Lines changed: 34 additions & 13 deletions
@@ -1,7 +1,78 @@
 # TensorRT-LLM Backend
-The Triton backend for TensorRT-LLM.
+The Triton backend for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
 
-## Usage
+## Introduction
+
+This document describes how to serve models by TensorRT-LLM Triton backend. This backend is only an interface to call TensorRT-LLM in Triton. The heavy lifting, in terms of implementation, can be found in the TensorRT-LLM source code.
+
+## Setup Environment
+
+### Prepare the repository
+
+Clone the repository, and update submodules recursively.
+```
+git clone git@github.com:triton-inference-server/tensorrtllm_backend.git
+git submodule update --init --recursive
+git lfs install
+git lfs pull
+```
+
+### Build the Docker image.
+```
+cd tensorrtllm_backend
+docker build -f dockerfile/Dockerfile.trt_llm_backend -t tritonserver:w_trt_llm_backend .
+```
+
+The rest of the documentation assumes that the Docker image has already been built.
+
+### How to select the models
+There are two models under `all_models/`:
+- gpt: A Python implementation of the TensorRT-LLM Triton backend
+- inflight_batcher_llm: A C++ implementation of the TensorRT-LLM Triton backend
+
+### Prepare TensorRT-LLM engines
+Follow the [guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md) in TensorRT-LLM to prepare the engines for deployment.
+
+For example, please find the details in the document of TensorRT-LLM GPT for instrutions to build GPT engines: [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt#usage)
+
+### How to set the model configuration
+
+**TensorRT-LLM Triton Serving Configuration: config.pbtxt**
+
+- This will be loaded by Triton servers
+- This mainly describes the server and TensorRT-LLM inference hyperparameters.
+
+There are several components in each implemented backend, and there is a config.pbtxt for each component, take `all_models/inflight_batcher_llm` as an example:
+- preprocessing: Used for tokenizing.
+- tensorrt_llm: Inferencing.
+- postprocessing: Used for de-tokenizing.
+- ensemble: Connect preprocessing -> tensorrt_llm -> postprocessing
+
+The following table shows the fields that need to be modified before deployment:
+
+*all_models/inflight_batcher_llm/preprocessing/config.pbtxt*
+
+| Name | Description
+| :----------------------: | :-----------------------------: |
+| `tokenizer_dir` | The path to the tokenizer for the model |
+| `tokenizer_type` | The type of the tokenizer for the model, t5, auto and llama are supported |
+
+*all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt*
+
+| Name | Description
+| :----------------------: | :-----------------------------: |
+| `decoupled` | Controls streaming. Decoupled mode must be set to true if using the streaming option from the client. |
+| `gpt_model_type` | "inflight_fused_batching" or "V1" (disable in-flight batching) |
+| `gpt_model_path` | Path to the TensorRT-LLM engines for deployment |
+
+*all_models/inflight_batcher_llm/postprocessing/config.pbtxt*
+
+| Name | Description
+| :----------------------: | :-----------------------------: |
+| `tokenizer_dir` | The path to the tokenizer for the model |
+| `tokenizer_type` | The type of the tokenizer for the model, t5, auto and llama are supported |
+
+## Run Serving on Single Node
 
 ### Launch the backend *within Docker*
 
@@ -15,7 +86,7 @@ nvidia-docker run -it --rm -e LOCAL_USER_ID=`id -u ${USER}` --shm-size=2g -v <yo
 3. all_models/<model>/postprocessing/config.pbtxt
 
 # 3. Launch triton server
-python3 scripts/launch_triton_server.py --world_size=1 \
+python3 scripts/launch_triton_server.py --world_size=<num_gpus> \
     --model_repo=all_models/<model>
 ```
 
@@ -56,20 +127,28 @@ ${TRITONSERVER} --model-repository=${MODEL_REPO} --disable-auto-complete-config
 sbatch tensorrt_llm_triton.sub
 ```
 
+When successfully deployed, the server produces logs similar to the following ones.
+```
+I0919 14:52:10.475738 293 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
+I0919 14:52:10.475968 293 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
+I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
+```
+
 ### Kill the backend
 
 ```bash
 pgrep tritonserver | xargs kill -9
 ```
 
-## Examples
+## C++ backend examples (support inflight batching)
+Please follow the guide in [`inflight_batcher_llm/README.md`](inflight_batcher_llm/README.md).
+
+## Python backend examples (not support inflight batching)
 
-### GPT/OPT/LLaMA/GPT-J...
+### GPT
 ```bash
 cd tools/gpt/
 
-# Download vocab and merge table for HF models
-# Take GPT as an example:
 rm -rf gpt2 && git clone https://huggingface.co/gpt2
 pushd gpt2 && rm pytorch_model.bin model.safetensors && \
     wget -q https://huggingface.co/gpt2/resolve/main/pytorch_model.bin && popd
 
@@ -5,11 +5,10 @@
 import numpy as np
 import torch
 import triton_python_backend_utils as pb_utils
+from tensorrt_llm.runtime import to_word_list_format
 from torch.nn.utils.rnn import pad_sequence
 from transformers import AutoTokenizer, LlamaTokenizer, T5Tokenizer
 
-from tensorrt_llm.runtime import to_word_list_format
-
 
 class TritonPythonModel:
     """Your Python model must use the same class name. Every Python model
 
@@ -3,10 +3,10 @@
 
 import torch
 import triton_python_backend_utils as pb_utils
+from tensorrt_llm.runtime import GenerationSession, ModelConfig, SamplingConfig
 from torch import from_numpy
 
 import tensorrt_llm
-from tensorrt_llm.runtime import GenerationSession, ModelConfig, SamplingConfig
 
 
 def mpi_comm():
 
@@ -5,11 +5,10 @@
 import numpy as np
 import torch
 import triton_python_backend_utils as pb_utils
+from tensorrt_llm.runtime import to_word_list_format
 from torch.nn.utils.rnn import pad_sequence
 from transformers import AutoTokenizer, LlamaTokenizer, T5Tokenizer
 
-from tensorrt_llm.runtime import to_word_list_format
-
 
 class TritonPythonModel:
     """Your Python model must use the same class name. Every Python model
 
@@ -162,3 +162,27 @@ parameters: {
     string_value: "${engine_dir}"
   }
 }
+parameters: {
+  key: "max_tokens_in_paged_kv_cache"
+  value: {
+    string_value: "${max_tokens_in_paged_kv_cache}"
+  }
+}
+parameters: {
+  key: "batch_scheduler_policy"
+  value: {
+    string_value: "${batch_scheduler_policy}"
+  }
+}
+parameters: {
+  key: "kv_cache_free_gpu_mem_fraction"
+  value: {
+    string_value: "${kv_cache_free_gpu_mem_fraction}"
+  }
+}
+parameters: {
+  key: "max_num_sequences"
+  value: {
+    string_value: "${max_num_sequences}"
+  }
+}
@@ -1,6 +1,7 @@
-ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:23.07-py3
+ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver
+ARG BASE_TAG=23.08-py3
 
-FROM ${BASE_IMAGE} as base
+FROM ${BASE_IMAGE}:${BASE_TAG} as base
 
 COPY requirements.txt /tmp/
 RUN pip3 install -r /tmp/requirements.txt --extra-index-url https://pypi.ngc.nvidia.com
@@ -10,17 +11,37 @@ RUN pip3 install -r /tmp/requirements.txt --extra-index-url https://pypi.ngc.nvi
 RUN apt-get remove --purge -y tensorrt*
 RUN pip uninstall -y tensorrt
 
-# Download and install TensorRT
-RUN wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.0.1/tars/TensorRT-9.0.1.4.Linux.x86_64-gnu.cuda-12.2.tar.gz -P /workspace
-RUN tar -xvf /workspace/TensorRT-9.0.1.4.Linux.x86_64-gnu.cuda-12.2.tar.gz -C /usr/local/ && mv /usr/local/TensorRT-9.0.1.4 /usr/local/tensorrt
-RUN pip install /usr/local/tensorrt/python/tensorrt-9.0.1*cp310-none-linux_x86_64.whl && rm -fr /workspace/TensorRT-9.0.1.4.Linux.x86_64-gnu.cuda-12.2.tar.gz
-ENV LD_LIBRARY_PATH=/usr/local/tensorrt/lib/:$LD_LIBRARY_PATH
-ENV TRT_ROOT=/usr/local/tensorrt
-
 FROM base as dev
 
-# Download and install polygraphy, only required if you need to run TRT-LLM python tests
-RUN pip install https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.0.1/tars/polygraphy-0.48.1-py2.py3-none-any.whl
+# Download & install internal TRT release
+ARG TENSOR_RT_VERSION="9.1.0.1"
+ARG CUDA_VERSION="12.2"
+ARG RELEASE_URL_TRT
+ARG TARGETARCH
+
+RUN --mount=type=cache,target=/root/.cache \
+    if [ -z "$RELEASE_URL_TRT"];then \
+        ARCH=${TARGETARCH} && \
+        if [ "$ARCH" = "arm64" ];then ARCH="aarch64";fi && \
+        if [ "$ARCH" = "amd64" ];then ARCH="x86_64";fi && \
+        if [ "$ARCH" = "x86_64" ];then DIR_NAME="x64-agnostic"; else DIR_NAME=${ARCH};fi &&\
+        if [ "$ARCH" = "aarch64" ];then OS1="Ubuntu22_04" && OS2="Ubuntu-22.04"; else OS1="Linux" && OS2="Linux";fi &&\
+        RELEASE_URL_TRT=http://cuda-repo.nvidia.com/release-candidates/Libraries/TensorRT/v9.1/${TENSOR_RT_VERSION}-b6aa91dc/${CUDA_VERSION}-r535/${OS1}-${DIR_NAME}/tar/TensorRT-${TENSOR_RT_VERSION}.${OS2}.${ARCH}-gnu.cuda-${CUDA_VERSION}.tar.gz;\
+    fi &&\
+    wget --no-verbose ${RELEASE_URL_TRT} -O /workspace/TensorRT.tar && \
+    tar -xf /workspace/TensorRT.tar -C /usr/local/ && \
+    mv /usr/local/TensorRT-${TENSOR_RT_VERSION} /usr/local/tensorrt && \
+    pip install /usr/local/tensorrt/python/tensorrt-*-cp310-*.whl && \
+    rm -rf /workspace/TensorRT.tar
+
+ENV LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH}
+ENV TRT_ROOT=/usr/local/tensorrt
+
+# Install latest Polygraphy
+ARG RELEASE_URL_PG=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.0.1/tars/polygraphy-0.48.1-py2.py3-none-any.whl
+RUN --mount=type=cache,target=/root/.cache \
+    pip uninstall -y polygraphy && \
+    pip install ${RELEASE_URL_PG}
 
 # CMake
 RUN wget https://github.com/Kitware/CMake/releases/download/v3.18.1/cmake-3.18.1-Linux-x86_64.sh
@@ -35,13 +56,13 @@ FROM dev as trt_llm_builder
 WORKDIR /app
 COPY scripts scripts
 COPY tensorrt_llm tensorrt_llm
-RUN cd tensorrt_llm; python3 scripts/build_wheel.py --trt_root="${TRT_ROOT}" -i; cd ..
+RUN cd tensorrt_llm && python3 scripts/build_wheel.py --trt_root="${TRT_ROOT}" -i && cd ..
 
 FROM trt_llm_builder as trt_llm_backend_builder
 
 WORKDIR /app/
 COPY inflight_batcher_llm inflight_batcher_llm
-RUN cd inflight_batcher_llm; bash scripts/build.sh; cd ..
+RUN cd inflight_batcher_llm && bash scripts/build.sh && cd ..
 
 FROM trt_llm_backend_builder as final