This repository demonstrates how to optimize and deploy machine learning models using Pruna alongside NVIDIA's Triton Inference Server. The project includes examples for both Stable Diffusion (image generation) and LLM (text generation) models using Pruna's step caching compiler for efficient inference.
The repository contains the following structure:
README.md
: Project documentation.Dockerfile
: Docker image setup for Triton Server and Pruna.client.py
: Example client script for Stable Diffusion inference.client_llm.py
: Example client script for LLM inference.model_repository/
: Model repository for Triton Server.stable_diffusion/
: Stable Diffusion model directory.config.pbtxt
: Model configuration for Triton.1/
: Version-specific model folder.model.py
: Triton Python backend model implementation.
llm_model/
: LLM model directory.config.pbtxt
: Model configuration for Triton.1/
: Version-specific model folder.model.py
: Triton Python backend model implementation.
This repository includes:
- Pruna Model Optimization: Integration of step caching for accelerated inference.
- Triton Inference Server Integration: Simplifies scalable model deployment.
- Stable Diffusion Deployment: End-to-end setup of a Stable Diffusion pipeline.
- LLM Deployment: End-to-end setup of a Large Language Model pipeline.
- Ease of Use: Simple instructions for setup and deployment.
- Install Docker from Docker's official website.
- Install Python Dependencies using pip:
pip install tritonclient[grpc]
.
-
Clone the Repository: Clone the repository and navigate to the project directory.
-
Build the Docker Image: Build the Docker image with:
docker build -t tritonserver_pruna .
-
Run the Docker Container: Start the container with:
docker run --rm --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v "$(pwd)/model_repository:/models" tritonserver_pruna tritonserver --model-repository=/models
-
Run the Client Scripts: After starting the Triton Server, execute the client scripts:
- For Stable Diffusion:
python3 client.py
- For LLM:
python3 client_llm.py
- For Stable Diffusion:
The stable_diffusion/
folder contains:
- Inputs: Named
INPUT_TEXT
, of typeTYPE_STRING
, and shape[1]
(text prompts). - Outputs: Named
OUTPUT
, of typeTYPE_FP32
, and shape[3, 512, 512]
(generated images). - Batch Size: Supports up to 4 concurrent requests.
The llm_model/
folder contains:
- Inputs: Named
INPUT_TEXT
, of typeTYPE_STRING
, and shape[1]
(text prompts). - Outputs: Named
OUTPUT_TEXT
, of typeTYPE_STRING
, and shape[1]
(generated text responses). - Batch Size: Supports up to 4 concurrent requests.
- Default Model: Uses
microsoft/DialoGPT-medium
(can be changed inmodel.py
).
Demonstrates how to send text prompts and retrieve generated images as numpy arrays.
Demonstrates how to send text prompts and retrieve generated text responses from the language model.
The Dockerfile builds a container that includes:
- NVIDIA Triton Inference Server.
- Pruna with GPU support.
- Additional dependencies for Stable Diffusion (diffusers, PIL).
- Transformers library for LLM support.
- PyTorch and accelerate for model optimization.
Both models are preconfigured to use Pruna's step caching compiler for optimized inference.
To use a different LLM model, modify the model_name
variable in model_repository/llm_model/1/model.py
:
model_name = "your-preferred-model" # e.g., "gpt2", "microsoft/DialoGPT-large", etc.
You can modify the text generation parameters in the LLM model's execute
method:
max_length
: Maximum length of generated texttemperature
: Controls randomness (0.1 = more focused, 1.0 = more random)do_sample
: Whether to use sampling or greedy decoding
- Ensure your hardware supports CUDA and has sufficient GPU memory for the models.
- LLM models may require significant GPU memory depending on the model size.
- For additional optimizations, explore the
smash_config
options in both model implementations. - The first inference request may take longer due to model loading and optimization.
Feel free to customize the repository and reach out for support or feature requests! 🚀