A comprehensive performance benchmarking and testing framework for LlamaStackDistribution, designed to measure overhead, identify bottlenecks, and validate scalability in production environments.
This repository contains tools and configurations for performance testing LlamaStack deployments on Kubernetes/OpenShift, with a focus on:
- Quantifying LlamaStack overhead compared to direct vLLM inference
- Testing agentic workflows with PostgreSQL state management and MCP tool integration
- Validating horizontal pod autoscaling under realistic workloads
- Identifying bottlenecks in stateful operations and concurrent request handling
βββ agentic/ # Responses API testing (PostgreSQL + MCP + Autoscaling)
βββ benchmarking/ # Chat Completions endpoint testing (LlamaStack vs vLLM)
βββ README.md
Purpose: Measure the performance overhead introduced by LlamaStack when wrapping a vLLM inference backend.
Tool: GuideLLM
Methodology:
- Baseline: Direct vLLM inference via OpenAI-compatible API
- Comparison: Same workload through LlamaStack β vLLM
- Metrics: Throughput (requests/sec), latency (TTFT, TPOT), token rates
Key Variables Tested:
- Concurrency levels (1, 2, 4, 8, 16, 32, 64, 128)
- Uvicorn worker counts (1, 2, 4)
- Pod replica counts (1, 2, 4)
π See benchmarking/README.md for detailed setup and usage
Purpose: Test the LlamaStack Responses API with stateful operations, tool calling, and database persistence.
Tool: Locust with OpenAI extensions
Focus Areas:
- PostgreSQL backend performance (vs SQLite baseline)
- MCP (Model Context Protocol) tool integration
- Horizontal Pod Autoscaling (HPA) behavior under load
- Multi-turn conversations with state persistence
Test Scenarios:
- Simple Responses API (no tools)
- Responses API with MCP tool calling (National Parks Service example)
- Direct vLLM comparison (baseline)
π See agentic/README.md for detailed setup and usage
- Kubernetes/OpenShift cluster with:
- NVIDIA GPU nodes (for vLLM inference)
- Red Hat OpenShift AI (RHOAI) or KServe installed
- Persistent storage provisioner
kubectlorocCLI configured- Python 3.9+ with pip (for local test execution)
- Inference: vLLM 0.11.x
- Orchestration: LlamaStackDistribution
- Benchmarking: GuideLLM, Locust
- Platform: OpenShift 4.x with RHOAI 2.22+
- Storage: PostgreSQL, SQLite
- Monitoring: Prometheus, DCGM (GPU metrics)