Skip to content

PrunaAI/tritonserver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pruna AI Triton Server Integration

This repository demonstrates how to optimize and deploy machine learning models using Pruna alongside NVIDIA's Triton Inference Server. The project includes examples for both Stable Diffusion (image generation) and LLM (text generation) models using Pruna's step caching compiler for efficient inference.

Directory Structure

The repository contains the following structure:

  • README.md: Project documentation.
  • Dockerfile: Docker image setup for Triton Server and Pruna.
  • client.py: Example client script for Stable Diffusion inference.
  • client_llm.py: Example client script for LLM inference.
  • model_repository/: Model repository for Triton Server.
    • stable_diffusion/: Stable Diffusion model directory.
      • config.pbtxt: Model configuration for Triton.
      • 1/: Version-specific model folder.
        • model.py: Triton Python backend model implementation.
    • llm_model/: LLM model directory.
      • config.pbtxt: Model configuration for Triton.
      • 1/: Version-specific model folder.
        • model.py: Triton Python backend model implementation.

Features

This repository includes:

  • Pruna Model Optimization: Integration of step caching for accelerated inference.
  • Triton Inference Server Integration: Simplifies scalable model deployment.
  • Stable Diffusion Deployment: End-to-end setup of a Stable Diffusion pipeline.
  • LLM Deployment: End-to-end setup of a Large Language Model pipeline.
  • Ease of Use: Simple instructions for setup and deployment.

Getting Started

Prerequisites

  1. Install Docker from Docker's official website.
  2. Install Python Dependencies using pip: pip install tritonclient[grpc].

Steps to Deploy and Run

  1. Clone the Repository: Clone the repository and navigate to the project directory.

  2. Build the Docker Image: Build the Docker image with:

    docker build -t tritonserver_pruna .
  3. Run the Docker Container: Start the container with:

    docker run --rm --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v "$(pwd)/model_repository:/models" tritonserver_pruna tritonserver --model-repository=/models
  4. Run the Client Scripts: After starting the Triton Server, execute the client scripts:

    • For Stable Diffusion: python3 client.py
    • For LLM: python3 client_llm.py

Model Configurations

Stable Diffusion Model

The stable_diffusion/ folder contains:

  • Inputs: Named INPUT_TEXT, of type TYPE_STRING, and shape [1] (text prompts).
  • Outputs: Named OUTPUT, of type TYPE_FP32, and shape [3, 512, 512] (generated images).
  • Batch Size: Supports up to 4 concurrent requests.

LLM Model

The llm_model/ folder contains:

  • Inputs: Named INPUT_TEXT, of type TYPE_STRING, and shape [1] (text prompts).
  • Outputs: Named OUTPUT_TEXT, of type TYPE_STRING, and shape [1] (generated text responses).
  • Batch Size: Supports up to 4 concurrent requests.
  • Default Model: Uses microsoft/DialoGPT-medium (can be changed in model.py).

Example Client Scripts

Stable Diffusion Client (client.py)

Demonstrates how to send text prompts and retrieve generated images as numpy arrays.

LLM Client (client_llm.py)

Demonstrates how to send text prompts and retrieve generated text responses from the language model.

Dockerfile Overview

The Dockerfile builds a container that includes:

  • NVIDIA Triton Inference Server.
  • Pruna with GPU support.
  • Additional dependencies for Stable Diffusion (diffusers, PIL).
  • Transformers library for LLM support.
  • PyTorch and accelerate for model optimization.

Both models are preconfigured to use Pruna's step caching compiler for optimized inference.

Customization

Changing the LLM Model

To use a different LLM model, modify the model_name variable in model_repository/llm_model/1/model.py:

model_name = "your-preferred-model"  # e.g., "gpt2", "microsoft/DialoGPT-large", etc.

Adjusting Generation Parameters

You can modify the text generation parameters in the LLM model's execute method:

  • max_length: Maximum length of generated text
  • temperature: Controls randomness (0.1 = more focused, 1.0 = more random)
  • do_sample: Whether to use sampling or greedy decoding

Notes

  • Ensure your hardware supports CUDA and has sufficient GPU memory for the models.
  • LLM models may require significant GPU memory depending on the model size.
  • For additional optimizations, explore the smash_config options in both model implementations.
  • The first inference request may take longer due to model loading and optimization.

Feel free to customize the repository and reach out for support or feature requests! 🚀

About

This repository describes how to use pruna with tritonserver

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published