Skip to content

Commit 35d8620

Browse files
authored
Update README (#14)
1 parent 9717e97 commit 35d8620

File tree

1 file changed

+7
-40
lines changed

1 file changed

+7
-40
lines changed

README.md

Lines changed: 7 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ repo. If you don't find your answer there you can ask questions on the
4747

4848
There are several ways to access the TensorRT-LLM Backend.
4949

50-
**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via CMake](#option-3-build-via-cmake)**
50+
**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via CMake](#option-3-build-via-docker)**
5151

5252
### Option 1. Run the Docker Container
5353

@@ -96,7 +96,7 @@ the TensorRT-LLM backend and Python backend repositories that will be used
9696
to build the container. You can also remove the features or endpoints that you
9797
don't need by removing the corresponding flags.
9898

99-
### Option 3. Build via CMake
99+
### Option 3. Build via Docker
100100

101101
```bash
102102
# Update the submodules
@@ -105,43 +105,10 @@ git submodule update --init --recursive
105105
git lfs install
106106
git lfs pull
107107

108-
# Patch the CMakeLists.txt file for different ABI builds
109-
patch inflight_batcher_llm/CMakeLists.txt < inflight_batcher_llm/CMakeLists.txt.patch
110-
111-
# Move the source code to the current directory
112-
mv inflight_batcher_llm/src .
113-
mv inflight_batcher_llm/cmake .
114-
mv inflight_batcher_llm/CMakeLists.txt .
115-
116-
# Create a build directory and run cmake
117-
mkdir build
118-
cd build
119-
cmake -DTRITON_BUILD=ON -DTRTLLM_BUILD_CONTAINER=nvcr.io/nvidia/tritonserver:23.09-py3-min -DTRITON_BACKEND_REPO_TAG=<GIT_BRANCH_NAME> -DTRITON_COMMON_REPO_TAG=<GIT_BRANCH_NAME> -DTRITON_CORE_REPO_TAG=<GIT_BRANCH_NAME> -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ..
120-
make install
108+
# Use the Dockerfile to build the backend in a container
109+
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
121110
```
122111

123-
The resulting `install/backends/tensorrtllm directory` can be added to a
124-
Triton installation as `/opt/tritonserver/backends/tensorrtllm` within the Triton
125-
NGC container.
126-
127-
When building the TensorRT-LLM Backend with the flag `TRITON_BUILD` set to `ON`,
128-
it will launch a separate docker image to build an appropriate TRT-LLM
129-
implementation as part of the build. This setting is useful to avoid having
130-
extra dependencies that are not needed for building the backend. The image used
131-
to build the TRT-LLM is specified by the CMake variable
132-
`TRTLLM_BUILD_CONTAINER`. It is recommended to use the Triton min image on the
133-
NGC that matches the Triton release you are building for so that it contains
134-
the required CUDA dependencies.
135-
136-
The following required Triton repositories will be pulled and used in
137-
the build. If the CMake variables below are not specified, "main" branch
138-
of those repositories will be used. `[tag]` should be the same
139-
as the TensorRT-LLM backend repository branch that you are trying to compile.
140-
141-
* triton-inference-server/backend: `-DTRITON_BACKEND_REPO_TAG=[tag]`
142-
* triton-inference-server/common: `-DTRITON_COMMON_REPO_TAG=[tag]`
143-
* triton-inference-server/core: `-DTRITON_CORE_REPO_TAG=[tag]`
144-
145112
## Using the TensorRT-LLM Backend
146113

147114
Below is an example of how to serve a TensorRT-LLM model with the Triton
@@ -247,11 +214,11 @@ The following table shows the fields that need to be modified before deployment:
247214
Before the Triton 23.10 release, you can launch the Triton 23.09 container
248215
`nvcr.io/nvidia/tritonserver:23.09-py3` and add the directory
249216
`/opt/tritonserver/backends/tensorrtllm` within the container following the
250-
instructions in [Option 3 Build via CMake](#option-3-build-via-cmake).
217+
instructions in [Option 3 Build via Docker](#option-3-build-via-docker).
251218

252219
```bash
253220
# Launch the Triton container
254-
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash
221+
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
255222

256223
cd /tensorrtllm_backend
257224
# --world_size is the number of GPUs you want to use for serving
@@ -360,7 +327,7 @@ You can have a look at the client code to see how early stopping is achieved.
360327

361328
sudo nvidia-smi -lgc 1410,1410
362329

363-
srun --mpi=pmix --container-image nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 \
330+
srun --mpi=pmix --container-image triton_trt_llm \
364331
--container-mounts /path/to/tensorrtllm_backend:/tensorrtllm_backend \
365332
--container-workdir /tensorrtllm_backend \
366333
--output logs/tensorrt_llm_%t.out \

0 commit comments

Comments
 (0)