A comprehensive toolkit for managing different quantizations, builds, and models for fast testing and benchmarking with TensorRT-LLM. This project provides a streamlined workflow to build, tag, and serve multiple model configurations for performance evaluation.
This project simplifies the process of:
- Managing multiple model variants with tagged configurations
- Building TensorRT-LLM engines with different quantization settings
- Serving models for benchmarking and testing
- Comparing performance across different build configurations
- Docker with GPU support
- NVIDIA TensorRT-LLM container
- HuggingFace access token (for downloading models)
Use the build-container wizard
./build-container
Create a directory for your model in model_weights/ and download the model files:
# Use git-lfs or huggingface-hub to download
git clone https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8 model_weights/nvidia--Llama-3.1-8B-Instruct-FP8
OR
huggingface-cli download nvidia/Llama-3.1-8B-Instruct-FP8 --local-dir ./model_weights/nvidia--Llama-3.1-8B-Instruct-FP8 --local-dir-use-symlinks False
This path covers serving a model with pytorch as opposed to a native TensorRT-LLM engine. It's unclear if NVIDIA will be supporting native TensorRT-LLM engines in the future because some folks at NVIDIA say they're moving towards pytorch.
Create a JSON configuration file in ./model_serve_args that will be used to serve your model:
Configuration Format to serve with pytorch
# Example: ./model_serve_args/nvidia--Llama-3.1-8B-Instruct-FP8.default.json
{
"notes": "From Alex Steiner NVIDIA saying serve it straight",
"args": {
"--backend": "pytorch",
"--extra_llm_api_options": "pytorch-small-batch.yml",
"--host": "0.0.0.0",
"--port": "8000",
"--max_batch_size": 128,
"--tp_size": 1,
"--max_num_tokens": 2048
}
}Notice that we're putting additional llm api options there. Let's create that file in ./extra_llm_api_options
# extra_llm_api_options/pytorch-small-batch.yml
print_iter_log: true
cuda_graph_config:
batch_sizes: [1,2,4,8,16,32,64,128,256,512,1024,2048]
./trt-llab trtllm-serve