LlamaStack Performance Testing Framework

A comprehensive performance benchmarking and testing framework for LlamaStackDistribution, designed to measure overhead, identify bottlenecks, and validate scalability in production environments.

Overview

This repository contains tools and configurations for performance testing LlamaStack deployments on Kubernetes/OpenShift, with a focus on:

Quantifying LlamaStack overhead compared to direct vLLM inference
Testing agentic workflows with PostgreSQL state management and MCP tool integration
Validating horizontal pod autoscaling under realistic workloads
Identifying bottlenecks in stateful operations and concurrent request handling

Repository Structure

├── agentic/              # Responses API testing (PostgreSQL + MCP + Autoscaling)
├── benchmarking/         # Chat Completions endpoint testing (LlamaStack vs vLLM)
└── README.md

Test Types

1. Chat Completions Benchmarking (`benchmarking/`)

Purpose: Measure the performance overhead introduced by LlamaStack when wrapping a vLLM inference backend.

Tool: GuideLLM

Methodology:

Baseline: Direct vLLM inference via OpenAI-compatible API
Comparison: Same workload through LlamaStack → vLLM
Metrics: Throughput (requests/sec), latency (TTFT, TPOT), token rates

Key Variables Tested:

Concurrency levels (1, 2, 4, 8, 16, 32, 64, 128)
Uvicorn worker counts (1, 2, 4)
Pod replica counts (1, 2, 4)

📖 See benchmarking/README.md for detailed setup and usage

2. Responses API (Agentic) Testing (`agentic/`)

Purpose: Test the LlamaStack Responses API with stateful operations, tool calling, and database persistence.

Tool: Locust with OpenAI extensions

Focus Areas:

PostgreSQL backend performance (vs SQLite baseline)
MCP (Model Context Protocol) tool integration
Horizontal Pod Autoscaling (HPA) behavior under load
Multi-turn conversations with state persistence

Test Scenarios:

Simple Responses API (no tools)
Responses API with MCP tool calling (National Parks Service example)
Direct vLLM comparison (baseline)

📖 See agentic/README.md for detailed setup and usage

Quick Start

Prerequisites

Kubernetes/OpenShift cluster with:
- NVIDIA GPU nodes (for vLLM inference)
- Red Hat OpenShift AI (RHOAI) or KServe installed
- Persistent storage provisioner
kubectl or oc CLI configured
Python 3.9+ with pip (for local test execution)

Technologies Used

Inference: vLLM 0.11.x
Orchestration: LlamaStackDistribution
Benchmarking: GuideLLM, Locust
Platform: OpenShift 4.x with RHOAI 2.22+
Storage: PostgreSQL, SQLite
Monitoring: Prometheus, DCGM (GPU metrics)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlamaStack Performance Testing Framework

Overview

Repository Structure

Test Types

1. Chat Completions Benchmarking (`benchmarking/`)

2. Responses API (Agentic) Testing (`agentic/`)

Quick Start

Prerequisites

Technologies Used

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
agentic		agentic
benchmarking		benchmarking
README.md		README.md

openshift-psap/llamastack-performance

Folders and files

Latest commit

History

Repository files navigation

LlamaStack Performance Testing Framework

Overview

Repository Structure

Test Types

1. Chat Completions Benchmarking (benchmarking/)

2. Responses API (Agentic) Testing (agentic/)

Quick Start

Prerequisites

Technologies Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Chat Completions Benchmarking (`benchmarking/`)

2. Responses API (Agentic) Testing (`agentic/`)

Packages