Skip to content

Commit 5283ae5

Browse files
authored
Llama tutorial for TRTLLM (#62)
Added Llama2 tutorial for TensorRT-LLM backend
1 parent 17d6b79 commit 5283ae5

File tree

2 files changed

+157
-1
lines changed

2 files changed

+157
-1
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
<!--
2+
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
## Pre-build instructions
30+
31+
For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights.
32+
Clone the repo of the model with weights and tokens [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main).
33+
You will need to get permissions for the Llama2 repository as well as get access to the huggingface cli. To get access to the huggingface cli, go here: [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
34+
35+
## Installation
36+
37+
1. The installation starts with cloning the TensorRT-LLM Backend and update the TensorRT-LLM submodule:
38+
```bash
39+
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
40+
# Update the submodules
41+
cd tensorrtllm_backend
42+
git submodule update --init --recursive
43+
git lfs install
44+
git lfs pull
45+
```
46+
47+
2. Then launch Triton docker container with TensorRT-LLM backend
48+
```docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash```
49+
50+
Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Triton Server with Tensorrt-LLM Backend if you want to build a specialized container.
51+
52+
Don't forget to allow gpu usage when you launch the container.
53+
54+
## Create Engines for each model [skip this step if you already have an engine]
55+
TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Triton Server you will need to create a TensorRT-LLM engine for the model for the configuration you want with the following steps:
56+
57+
1. Install Tensorrt-LLM python package
58+
```bash
59+
# TensorRT-LLM is required for generating engines.
60+
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git
61+
mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
62+
cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
63+
```
64+
65+
2. Log in to huggingface-cli
66+
67+
```bash
68+
huggingface-cli login --token hf_*****
69+
```
70+
71+
3. Compile model engines
72+
73+
The script to build Llama models is located in [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples). We use the one located in the docker container as `/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py`.
74+
This command compiles the model with inflight batching and 1 GPU. To run with more GPUs, you will need to change the build command to use `--world_size X`.
75+
More details for the scripting please see the documentation for the Llama example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md).
76+
77+
```bash
78+
python build.py --model_dir /<path to your llama repo>/Llama-2-7b-hf/ \
79+
--dtype bfloat16 \
80+
--use_gpt_attention_plugin bfloat16 \
81+
--use_inflight_batching \
82+
--paged_kv_cache \
83+
--remove_input_padding \
84+
--use_gemm_plugin bfloat16 \
85+
--output_dir /<path to your engine>/1-gpu/ \
86+
--world_size 1
87+
```
88+
89+
> Optional: You can check test the output of the model with `run.py`
90+
> located in the same llama examples folder.
91+
>
92+
> ```bash
93+
> python3 run.py --engine_dir=<path to your engine>/1-gpu/ --max_output_len 100 --tokenizer_dir <path to your llama repo>/Llama-2-7b-hf --input_text "How do I count to ten in French?"
94+
> ```
95+
96+
## Serving with Triton
97+
98+
The last step is to create a Triton readable model. You can
99+
find a template of a model that uses inflight batching in [tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm).
100+
To run our Llama2-7B model, you will need to:
101+
102+
103+
1. Copy over the inflight batcher models repository
104+
105+
```bash
106+
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
107+
```
108+
109+
2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps. See details in [documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#create-the-model-repository):
110+
111+
```bash
112+
# preprocessing
113+
sed -i 's#${tokenizer_dir}#/<path to your engine>/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
114+
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
115+
sed -i 's#${tokenizer_dir}#/<path to your engine>/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt
116+
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt
117+
118+
sed -i 's#${decoupled_mode}#false#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
119+
sed -i 's#${engine_dir}#/<path to your engine>/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
120+
```
121+
Also, ensure that the `gpt_model_type` parameter is set to `inflight_fused_batching`
122+
123+
3. Launch Tritonserver
124+
125+
```bash
126+
tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm
127+
```
128+
Note if you built the engine with `--world_size X` where `X` is greater than 1, you will need to use the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script.
129+
```bash
130+
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=X --model_repo=/opt/tritonserver/inflight_batcher_llm
131+
```
132+
133+
## Client
134+
135+
You can test the results of the run with:
136+
1. The [inflight_batcher_llm_client.py script](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm)
137+
138+
```bash
139+
python3 /tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200
140+
```
141+
142+
2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint) if you are using the Triton TensorRT-LLM Backend container with versions greater than `r23.10`.
143+
144+
145+

README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,21 @@ For users experiencing the "Tensor in" & "Tensor out" approach to Deep Learning
1010

1111
The focus of these examples is to demonstrate deployment for models trained with various frameworks. These are quick demonstrations made with an understanding that the user is somewhat familiar with Triton.
1212

13-
#### Deploy a ...
13+
### Deploy a ...
1414
| [PyTorch Model](./Quick_Deploy/PyTorch/README.md) | [TensorFlow Model](./Quick_Deploy/TensorFlow/README.md) | [ONNX Model](./Quick_Deploy/ONNX/README.md) | [TensorRT Accelerated Model](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton) | [vLLM Model](./Quick_Deploy/vLLM/README.md)
1515
| --------------- | ------------ | --------------- | --------------- | --------------- |
1616

17+
## LLM Tutorials
18+
The table below contains some popular models that are supported in our tutorials
19+
| Example Models | Tutorial Link |
20+
| :-------------: | :------------------------------: |
21+
| [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[TensorRT-LLM Tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) |
22+
| [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) | [HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) |
23+
[Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) |[HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) |
24+
25+
**Note:**
26+
This is not an exhausitive list of what Triton supports, just what is included in the tutorials.
27+
1728
## What does this repository contain?
1829
This repository contains the following resources:
1930
* [Conceptual Guide](./Conceptual_Guide/): This guide focuses on building a conceptual understanding of the general challenges faced whilst building inference infrastructure and how to best tackle these challenges with Triton Inference Server.

0 commit comments

Comments
 (0)