triton-inference-server · juney-nvidia · Nov 2, 2023 · Nov 2, 2023
diff --git a/README.md b/README.md
@@ -26,8 +26,6 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 
-[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
-
 # TensorRT-LLM Backend
 The Triton backend for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
 You can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend).
@@ -51,18 +49,14 @@ There are several ways to access the TensorRT-LLM Backend.
 
 ### Option 1. Run the Docker Container
 
-**The NGC container will be available with Triton 23.10 release soon**
-
-Starting with release 23.10, Triton includes a container with the TensorRT-LLM
+Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
 Backend and Python Backend. This container should have everything to run a
 TensorRT-LLM model. You can find this container on the
 [Triton NGC page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).
 
 ### Option 2. Build via the build.py Script in Server Repo
 
-**Building via the build.py script will be available with Triton 23.10 release soon**
-
-You can follow steps described in the
+Starting with Triton 23.10 release, you can follow steps described in the
 [Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
 guide and use the
 [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
@@ -73,7 +67,7 @@ shown below, which will build the same TRT-LLM container as the one on the NGC.
 
 ```bash
 BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.10-py3-min
-TENSORRTLLM_BACKEND_REPO_TAG=r23.10
+TENSORRTLLM_BACKEND_REPO_TAG=release/0.5.0
 PYTHON_BACKEND_REPO_TAG=r23.10
 
 # Run the build script. The flags for some features or endpoints can be removed if not needed.
@@ -98,12 +92,14 @@ don't need by removing the corresponding flags.
 
 ### Option 3. Build via Docker
 
+The version of Triton Server used in this build option can be found in the
+[Dockerfile](./dockerfile/Dockerfile.trt_llm_backend).
+
 ```bash
 # Update the submodules
 cd tensorrtllm_backend
-git submodule update --init --recursive
 git lfs install
-git lfs pull
+git submodule update --init --recursive
 
 # Use the Dockerfile to build the backend in a container
 # For x86_64
@@ -210,19 +206,31 @@ The following table shows the fields that need to be modified before deployment:
 | `tokenizer_dir` | The path to the tokenizer for the model. In this example, the path should be set to `/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
 | `tokenizer_type` | The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. In this example, the type should be set to `auto` |
 
-### Launch Triton server *within NGC container*
+### Launch Triton server
+
+Please follow the option corresponding to the way you build the TensorRT-LLM backend.
+
+#### Option 1. Launch Triton server *within Triton NGC container*
+
+```bash
+docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
+```
+
+#### Option 2. Launch Triton server *within the Triton container built via build.py script*
 
-**The NGC container will be available with Triton 23.10 release soon**
+```bash
+docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend tritonserver bash
+```
 
-Before the Triton 23.10 release, you can launch the Triton 23.09 container
-`nvcr.io/nvidia/tritonserver:23.09-py3` and add the directory
-`/opt/tritonserver/backends/tensorrtllm` within the container following the
-instructions in [Option 3 Build via Docker](#option-3-build-via-docker).
+#### Option 3. Launch Triton server *within the Triton container built via Docker*
 
 ```bash
-# Launch the Triton container
 docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
+```
+
+Once inside the container, you can launch the Triton server with the following command:
 
+```bash
 cd /tensorrtllm_backend
 # --world_size is the number of GPUs you want to use for serving
 python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo
@@ -237,9 +245,7 @@ I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0
 
 ### Query the server with the Triton generate endpoint
 
-**This feature will be available with Triton 23.10 release soon**
-
-You can query the server using Triton's
+Starting with Triton 23.10 release, you can query the server using Triton's
 [generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
 with a curl command based on the following general format within your client
 environment/container:
@@ -364,7 +370,7 @@ You might have to contact your cluster's administrator to help you customize the
 ### Kill the Triton server
 
 ```bash
-pgrep tritonserver | xargs kill -9
+pkill tritonserver
 ```
 
 ## Testing the TensorRT-LLM Backend

diff --git a/ci/README.md b/ci/README.md
@@ -32,8 +32,6 @@ Tests in this CI directory can be run manually to provide extensive testing.
 
 ## Run QA Tests
 
-**The NGC container will be available with Triton 23.10 release soon**
-
 Before the Triton 23.10 release, you can launch the Triton 23.09 container
 `nvcr.io/nvidia/tritonserver:23.09-py3` and add the directory
 `/opt/tritonserver/backends/tensorrtllm` within the container following the
@@ -42,7 +40,7 @@ instructions in [Option 3 Build via CMake](../README.md#option-3-build-via-cmake
 Run the testing within the Triton container.
 
 ```bash
-docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash
+docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
 
 # Change directory to the test and run the test.sh script
 cd /tensorrtllm_backend/ci/<test directory>