Pruna AI Triton Server Integration

This repository demonstrates how to optimize and deploy machine learning models using Pruna alongside NVIDIA's Triton Inference Server. The project includes examples for both Stable Diffusion (image generation) and LLM (text generation) models using Pruna's step caching compiler for efficient inference.

Directory Structure

The repository contains the following structure:

README.md: Project documentation.
Dockerfile: Docker image setup for Triton Server and Pruna.
client.py: Example client script for Stable Diffusion inference.
client_llm.py: Example client script for LLM inference.
model_repository/: Model repository for Triton Server.
- stable_diffusion/: Stable Diffusion model directory.
  - config.pbtxt: Model configuration for Triton.
  - 1/: Version-specific model folder.
    - model.py: Triton Python backend model implementation.
- llm_model/: LLM model directory.
  - config.pbtxt: Model configuration for Triton.
  - 1/: Version-specific model folder.
    - model.py: Triton Python backend model implementation.

Features

This repository includes:

Pruna Model Optimization: Integration of step caching for accelerated inference.
Triton Inference Server Integration: Simplifies scalable model deployment.
Stable Diffusion Deployment: End-to-end setup of a Stable Diffusion pipeline.
LLM Deployment: End-to-end setup of a Large Language Model pipeline.
Ease of Use: Simple instructions for setup and deployment.

Getting Started

Prerequisites

Install Docker from Docker's official website.
Install Python Dependencies using pip: pip install tritonclient[grpc].

Steps to Deploy and Run

Clone the Repository: Clone the repository and navigate to the project directory.
Build the Docker Image: Build the Docker image with:
```
docker build -t tritonserver_pruna .
```

Run the Docker Container: Start the container with:

docker run --rm --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v "$(pwd)/model_repository:/models" tritonserver_pruna tritonserver --model-repository=/models

Run the Client Scripts: After starting the Triton Server, execute the client scripts:
- For Stable Diffusion: python3 client.py
- For LLM: python3 client_llm.py

Model Configurations

Stable Diffusion Model

The stable_diffusion/ folder contains:

Inputs: Named INPUT_TEXT, of type TYPE_STRING, and shape [1] (text prompts).
Outputs: Named OUTPUT, of type TYPE_FP32, and shape [3, 512, 512] (generated images).
Batch Size: Supports up to 4 concurrent requests.

LLM Model

The llm_model/ folder contains:

Inputs: Named INPUT_TEXT, of type TYPE_STRING, and shape [1] (text prompts).
Outputs: Named OUTPUT_TEXT, of type TYPE_STRING, and shape [1] (generated text responses).
Batch Size: Supports up to 4 concurrent requests.
Default Model: Uses microsoft/DialoGPT-medium (can be changed in model.py).

Example Client Scripts

Stable Diffusion Client (`client.py`)

Demonstrates how to send text prompts and retrieve generated images as numpy arrays.

LLM Client (`client_llm.py`)

Demonstrates how to send text prompts and retrieve generated text responses from the language model.

Dockerfile Overview

The Dockerfile builds a container that includes:

NVIDIA Triton Inference Server.
Pruna with GPU support.
Additional dependencies for Stable Diffusion (diffusers, PIL).
Transformers library for LLM support.
PyTorch and accelerate for model optimization.

Both models are preconfigured to use Pruna's step caching compiler for optimized inference.

Customization

Changing the LLM Model

To use a different LLM model, modify the model_name variable in model_repository/llm_model/1/model.py:

model_name = "your-preferred-model"  # e.g., "gpt2", "microsoft/DialoGPT-large", etc.

Adjusting Generation Parameters

You can modify the text generation parameters in the LLM model's execute method:

max_length: Maximum length of generated text
temperature: Controls randomness (0.1 = more focused, 1.0 = more random)
do_sample: Whether to use sampling or greedy decoding

Notes

Ensure your hardware supports CUDA and has sufficient GPU memory for the models.
LLM models may require significant GPU memory depending on the model size.
For additional optimizations, explore the smash_config options in both model implementations.
The first inference request may take longer due to model loading and optimization.

Feel free to customize the repository and reach out for support or feature requests! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pruna AI Triton Server Integration

Directory Structure

Features

Getting Started

Prerequisites

Steps to Deploy and Run

Model Configurations

Stable Diffusion Model

LLM Model

Example Client Scripts

Stable Diffusion Client (`client.py`)

LLM Client (`client_llm.py`)

Dockerfile Overview

Customization

Changing the LLM Model

Adjusting Generation Parameters

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
model_repository		model_repository
Dockerfile		Dockerfile
README.md		README.md
client.py		client.py
client_llm.py		client_llm.py

PrunaAI/tritonserver

Folders and files

Latest commit

History

Repository files navigation

Pruna AI Triton Server Integration

Directory Structure

Features

Getting Started

Prerequisites

Steps to Deploy and Run

Model Configurations

Stable Diffusion Model

LLM Model

Example Client Scripts

Stable Diffusion Client (client.py)

LLM Client (client_llm.py)

Dockerfile Overview

Customization

Changing the LLM Model

Adjusting Generation Parameters

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Stable Diffusion Client (`client.py`)

LLM Client (`client_llm.py`)

Packages