Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 71 additions & 40 deletions site-src/performance/regression-testing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,64 +14,66 @@ Follow the detailed instructions [here](https://github.com/AI-Hypercomputer/infe

* Create an artifact repository:

```bash
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
```
```bash
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
```

* Prepare datasets for [Infinity-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [billsum]((https://huggingface.co/datasets/FiscalNote/billsum)):

```bash
pip install datasets transformers numpy pandas tqdm matplotlib
python datasets/import_dataset.py --hf_token YOUR_TOKEN
```
```bash
pip install datasets transformers numpy pandas tqdm matplotlib
python datasets/import_dataset.py --hf_token YOUR_TOKEN
```

* Build the benchmark Docker image:

```bash
docker build -t inference-benchmark .
```
```bash
docker build -t inference-benchmark .
```

* Push the Docker image to your artifact registry:

```bash
docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
```
```bash
docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
```

## Conduct Regression Tests

Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80GB). Adjust configurations for other hardware as necessary.

### Test Case 1: Single Workload

- **Dataset:** `billsum_conversations.json` (created from [HuggingFace billsum dataset](https://huggingface.co/datasets/FiscalNote/billsum)).
* This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (*critical*)
- **Dataset:** `billsum_conversations.json` (created from [HuggingFace billsum dataset](https://huggingface.co/datasets/FiscalNote/billsum)).
*This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.*
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (*critical*)
- **Replicas:** 10 (vLLM)
- **Request Rates:** 300–350 (increments of 10)
- **Request Rates:** 300–350 QPS (increments of10)

Refer to example manifest:
Refer to example manifest:
`./config/manifests/regression-testing/single-workload-regression.yaml`

### Test Case 2: Multi-LoRA

- **Dataset:** `Infinity-Instruct_conversations.json` (created from [HuggingFace Infinity-Instruct dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct)).
* This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical)
- **Hardware:** NVIDIA H100 GPUs (80 GB)
- **Traffic Distribution:** 60% (first 5 adapters, each 12%), 30% (next 5, each 6%), 10% (last 5, each 2%) simulating prod/dev/test tiers
- **Dataset:** `Infinity-Instruct_conversations.json` (created from [HuggingFace Infinity-Instruct dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct)).
*This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.*
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical)
- **Traffic Distribution:**
- 60 % on first 5 adapters (12 % each)
- 30 % on next 5 adapters (6 % each)
- 10 % on last 5 adapters (2 % each)
- **Max LoRA:** 3
- **Replicas:** 10 (vLLM)
- **Request Rates:** 20–200 (increments of 20)
- **Request Rates:** 20–200 QPS (increments of20)

Optionally, you can also run benchmarks using the `ShareGPT` dataset for additional coverage.
Optionally, you can also run benchmarks against the `ShareGPT` dataset for additional coverage.

Update deployments for multi-LoRA support:
- vLLM Deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
Update deployments for multi-LoRA support:
- vLLM Deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
- InferenceModel: `./config/manifests/inferencemodel.yaml`

Refer to example manifest:
Refer to example manifest:
`./config/manifests/regression-testing/multi-lora-regression.yaml`

### Execute Benchmarks
Expand All @@ -80,15 +82,15 @@ Benchmark in two phases: before and after applying your changes:

- **Before changes:**

```bash
benchmark_id='regression-before' ./tools/benchmark/download-benchmark-results.bash
```
```bash
benchmark_id='regression-before' ./tools/benchmark/download-benchmark-results.bash
```

- **After changes:**

```bash
benchmark_id='regression-after' ./tools/benchmark/download-benchmark-results.bash
```
```bash
benchmark_id='regression-after' ./tools/benchmark/download-benchmark-results.bash
```

## Analyze Benchmark Results

Expand All @@ -97,7 +99,36 @@ Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analy
- Update benchmark IDs to `regression-before` and `regression-after`.
- Compare latency and throughput metrics, performing regression analysis.
- Check R² values specifically:
- **Prompts Attempted/Succeeded:** Expect R² ≈ 1
- **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
- **Prompts Attempted/Succeeded:** Expect R² ≈1
- **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to1 (allow minor variance).

Identify significant deviations, investigate causes, and confirm performance meets expected standards.
Identify significant deviations, investigate causes, and confirm performance meets expected standards.

# Nightly Benchmarking

To catch regressions early, we run a fully automated benchmark suite every night against the **latest `main` image** of the Gateway API. This pipeline uses LPG and the same manifests as above, but against two standard datasets:

1. **Prefill-Heavy** (`billsum_conversations.json`)
Emphasizes TTFT performance.
2. **Decode-Heavy** (`Infinity-Instruct_conversations.json`)
Stresses sustained TPOT behavior.
3. **Multi-LoRA** (`billsum_conversations.json`)
Uses 15 adapters with the traffic split defined above to capture complex adapter-loading and lora affinity scenarios.

**How it works**:

- The benchmarking runs are triggered every 6 hours.
- It provisions a GKE cluster with several NVIDIA H100 (80 GB) GPUs, deploys N vLLM server replicas along with the latest Gateway API extension and monitoring manifests, then launches the benchmarking script.
- In the step above we deploy the latest Endpoint Picker from the `main` branch’s latest Docker image:
```
us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main
```
- It sequentially launches three benchmark runs (as described above) using the existing regression manifests.
- Results are uploaded to a central GCS bucket.
- A Looker Studio dashboard automatically refreshes to display key metrics:
https://lookerstudio.google.com/u/0/reporting/c7ceeda6-6d5e-4688-bcad-acd076acfba6/page/6S4MF
- After the benchmarking runs are complete it tears down the cluster.

**Alerting**:

- If any regression is detected oncall is setup (internally in GKE) for further investigation.