@@ -47,7 +47,7 @@ repo. If you don't find your answer there you can ask questions on the
47
47
48
48
There are several ways to access the TensorRT-LLM Backend.
49
49
50
- ** Before Triton 23.10 release, please use [ Option 3 to build TensorRT-LLM backend via CMake] ( #option-3-build-via-cmake ) **
50
+ ** Before Triton 23.10 release, please use [ Option 3 to build TensorRT-LLM backend via CMake] ( #option-3-build-via-docker ) **
51
51
52
52
### Option 1. Run the Docker Container
53
53
@@ -96,7 +96,7 @@ the TensorRT-LLM backend and Python backend repositories that will be used
96
96
to build the container. You can also remove the features or endpoints that you
97
97
don't need by removing the corresponding flags.
98
98
99
- ### Option 3. Build via CMake
99
+ ### Option 3. Build via Docker
100
100
101
101
``` bash
102
102
# Update the submodules
@@ -105,43 +105,10 @@ git submodule update --init --recursive
105
105
git lfs install
106
106
git lfs pull
107
107
108
- # Patch the CMakeLists.txt file for different ABI builds
109
- patch inflight_batcher_llm/CMakeLists.txt < inflight_batcher_llm/CMakeLists.txt.patch
110
-
111
- # Move the source code to the current directory
112
- mv inflight_batcher_llm/src .
113
- mv inflight_batcher_llm/cmake .
114
- mv inflight_batcher_llm/CMakeLists.txt .
115
-
116
- # Create a build directory and run cmake
117
- mkdir build
118
- cd build
119
- cmake -DTRITON_BUILD=ON -DTRTLLM_BUILD_CONTAINER=nvcr.io/nvidia/tritonserver:23.09-py3-min -DTRITON_BACKEND_REPO_TAG=< GIT_BRANCH_NAME> -DTRITON_COMMON_REPO_TAG=< GIT_BRANCH_NAME> -DTRITON_CORE_REPO_TAG=< GIT_BRANCH_NAME> -DCMAKE_INSTALL_PREFIX:PATH=` pwd` /install ..
120
- make install
108
+ # Use the Dockerfile to build the backend in a container
109
+ DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
121
110
```
122
111
123
- The resulting ` install/backends/tensorrtllm directory ` can be added to a
124
- Triton installation as ` /opt/tritonserver/backends/tensorrtllm ` within the Triton
125
- NGC container.
126
-
127
- When building the TensorRT-LLM Backend with the flag ` TRITON_BUILD ` set to ` ON ` ,
128
- it will launch a separate docker image to build an appropriate TRT-LLM
129
- implementation as part of the build. This setting is useful to avoid having
130
- extra dependencies that are not needed for building the backend. The image used
131
- to build the TRT-LLM is specified by the CMake variable
132
- ` TRTLLM_BUILD_CONTAINER ` . It is recommended to use the Triton min image on the
133
- NGC that matches the Triton release you are building for so that it contains
134
- the required CUDA dependencies.
135
-
136
- The following required Triton repositories will be pulled and used in
137
- the build. If the CMake variables below are not specified, "main" branch
138
- of those repositories will be used. ` [tag] ` should be the same
139
- as the TensorRT-LLM backend repository branch that you are trying to compile.
140
-
141
- * triton-inference-server/backend: ` -DTRITON_BACKEND_REPO_TAG=[tag] `
142
- * triton-inference-server/common: ` -DTRITON_COMMON_REPO_TAG=[tag] `
143
- * triton-inference-server/core: ` -DTRITON_CORE_REPO_TAG=[tag] `
144
-
145
112
## Using the TensorRT-LLM Backend
146
113
147
114
Below is an example of how to serve a TensorRT-LLM model with the Triton
@@ -247,11 +214,11 @@ The following table shows the fields that need to be modified before deployment:
247
214
Before the Triton 23.10 release, you can launch the Triton 23.09 container
248
215
` nvcr.io/nvidia/tritonserver:23.09-py3 ` and add the directory
249
216
` /opt/tritonserver/backends/tensorrtllm ` within the container following the
250
- instructions in [ Option 3 Build via CMake ] ( #option-3-build-via-cmake ) .
217
+ instructions in [ Option 3 Build via Docker ] ( #option-3-build-via-docker ) .
251
218
252
219
``` bash
253
220
# Launch the Triton container
254
- docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash
221
+ docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
255
222
256
223
cd /tensorrtllm_backend
257
224
# --world_size is the number of GPUs you want to use for serving
@@ -360,7 +327,7 @@ You can have a look at the client code to see how early stopping is achieved.
360
327
361
328
sudo nvidia-smi -lgc 1410,1410
362
329
363
- srun --mpi=pmix --container-image nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 \
330
+ srun --mpi=pmix --container-image triton_trt_llm \
364
331
--container-mounts /path/to/tensorrtllm_backend:/tensorrtllm_backend \
365
332
--container-workdir /tensorrtllm_backend \
366
333
--output logs/tensorrt_llm_%t.out \
0 commit comments