Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
| ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| **Load Generator** | `src/inference_endpoint/load_generator/` | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
| **Endpoint Client** | `src/inference_endpoint/endpoint_client/` | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point |
| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
| **Metrics** | `src/inference_endpoint/metrics/` | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT) |
| **Config** | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings` |
| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
Expand Down Expand Up @@ -187,7 +187,7 @@ tests/
│ ├── endpoint_client/ # HTTP client integration tests
│ └── commands/ # CLI command integration tests
├── performance/ # Performance benchmarks (pytest-benchmark)
└── datasets/ # Test data (dummy_1k.pkl, squad_pruned/)
└── datasets/ # Test data (dummy_1k.jsonl, squad_pruned/)
```

## Development Standards
Expand Down Expand Up @@ -245,7 +245,7 @@ All of these run automatically on commit:
- `max_throughput_runtime_settings`, `poisson_runtime_settings`, `concurrency_runtime_settings` — preset configs
- `clean_sample_event_hooks` — ensures event hooks are cleared between tests

**Test data**: `tests/datasets/dummy_1k.pkl` (1000 samples), `tests/datasets/squad_pruned/`
**Test data**: `tests/datasets/dummy_1k.jsonl` (1000 samples), `tests/datasets/squad_pruned/`

### Performance Guidelines

Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,21 +44,21 @@ inference-endpoint probe \
inference-endpoint benchmark offline \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl
--dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
inference-endpoint benchmark online \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern poisson \
--target-qps 100

# With explicit sample count
inference-endpoint benchmark offline \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--num-samples 5000
```

Expand All @@ -72,7 +72,7 @@ python -m inference_endpoint.testing.echo_server --port 8765 &
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl
--dataset tests/datasets/dummy_1k.jsonl

# Stop echo server
pkill -f echo_server
Expand Down
10 changes: 5 additions & 5 deletions docs/CLI_DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ The first segment is the file path, optionally prefixed with `perf:` or `acc:` t

```bash
# Simple
--dataset data.pkl
--dataset data.jsonl

# Accuracy dataset
--dataset acc:eval.jsonl
Expand All @@ -111,15 +111,15 @@ The first segment is the file path, optionally prefixed with `perf:` or `acc:` t
--dataset data.csv,samples=500,parser.prompt=article

# With accuracy config
--dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer
--dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer

# Multiple datasets
--dataset perf:train.pkl --dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1 --mode both
--dataset perf:train.jsonl --dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1 --mode both
```

Parser remaps use `parser.TARGET=SOURCE` — "rename my dataset's SOURCE column to TARGET". Valid targets are derived from `MakeAdapterCompatible` (`prompt`, `system`). Invalid targets are rejected at parse time. Invalid source columns are rejected at dataset load time.

Pydantic validates all fields: `extra="forbid"` on `Dataset` and `AccuracyConfig` catches typos like `--dataset data.pkl,samles=500`. Format is auto-detected from file extension.
Pydantic validates all fields: `extra="forbid"` on `Dataset` and `AccuracyConfig` catches typos like `--dataset data.jsonl,samles=500`. Format is auto-detected from file extension.

The only YAML-only features are `submission_ref` and `benchmark_mode` (for official submissions).

Expand Down Expand Up @@ -202,5 +202,5 @@ class HTTPClientConfig(WithUpdatesMixin, BaseModel):
`BenchmarkConfig` is frozen. Use `with_updates()` to produce new instances with re-validation:

```python
config = config.with_updates(timeout=300, datasets=["new_data.pkl"])
config = config.with_updates(timeout=300, datasets=["new_data.jsonl"])
```
36 changes: 18 additions & 18 deletions docs/CLI_QUICK_REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,44 +18,44 @@ cyclopts. schema.py is the single source of truth for both YAML configs and CLI
inference-endpoint benchmark offline \
--endpoints URL \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl
--dataset tests/datasets/dummy_1k.jsonl

# Online (sustained QPS - requires --load-pattern, --target-qps)
inference-endpoint benchmark online \
--endpoints URL \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern poisson \
--target-qps 100

# Multiple datasets (--dataset is repeatable, prefix with perf: or acc:)
inference-endpoint benchmark offline \
--endpoints URL \
--model Qwen/Qwen3-8B \
--dataset perf:performance.pkl \
--dataset acc:accuracy.pkl \
--dataset perf:performance.jsonl \
--dataset acc:accuracy.jsonl \
--mode both

# With detailed report generation
inference-endpoint benchmark offline \
--endpoints URL \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--report-dir my_benchmark_report

# YAML-based
inference-endpoint benchmark from-config --config test.yaml
```

**Default Test Dataset:** Use `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB) for local testing.
**Default Test Dataset:** Use `tests/datasets/dummy_1k.jsonl` (1000 samples) for local testing.

**Dataset format:** `--dataset [perf|acc:]<path>[,key=value...]` — TOML-style dotted paths. Type prefix is optional (defaults to `perf`):

```bash
--dataset data.pkl # simple path
--dataset data.jsonl # simple path
--dataset acc:eval.jsonl # accuracy dataset
--dataset data.csv,samples=500,parser.prompt=article # with options
--dataset perf:data.jsonl,format=jsonl,parser.prompt=text # explicit format + remap
--dataset perf:data.jsonl,format=.jsonl,parser.prompt=text # explicit format + remap
```

### Accuracy Evaluation (stub - future implementation)
Expand Down Expand Up @@ -135,9 +135,9 @@ model_params:

## Dataset Formats

Format is auto-detected from file extension. Override with `format:` in the dataset string.
Format is auto-detected from file extension. Override with `format=<ext>` in the dataset string.

**Supported:** `pkl`, `csv`, `json`, `jsonl`, `parquet`, `npy`, `pandas_pkl`, `huggingface`
**Supported:** `.csv`, `.json`, `.jsonl`, `.parquet`, `huggingface`

## Test Modes

Expand All @@ -163,13 +163,13 @@ Accuracy config is supported in both CLI and YAML:

```bash
# CLI — accuracy config via dotted paths
--dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer,accuracy_config.extractor=boxed_math_extractor
--dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer,accuracy_config.extractor=boxed_math_extractor

# Combined perf + accuracy
inference-endpoint benchmark offline \
--endpoints URL --model M \
--dataset perf:perf.pkl \
--dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer \
--dataset perf:perf.jsonl \
--dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer \
--mode both
```

Expand Down Expand Up @@ -203,7 +203,7 @@ inference-endpoint benchmark offline \
inference-endpoint benchmark offline \
--endpoints http://localhost:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl
--dataset tests/datasets/dummy_1k.jsonl
```

### Production Benchmark
Expand All @@ -213,7 +213,7 @@ inference-endpoint benchmark offline \
inference-endpoint benchmark online \
--endpoints https://api.production.com \
--model Qwen/Qwen3-8B \
--dataset prod_queries.pkl \
--dataset prod_queries.jsonl \
--load-pattern poisson \
--target-qps 100 \
--num-samples 10000 \
Expand All @@ -225,7 +225,7 @@ inference-endpoint benchmark online \
inference-endpoint benchmark online \
--endpoints https://api.production.com \
--model Qwen/Qwen3-8B \
--dataset prod_queries.pkl \
--dataset prod_queries.jsonl \
--load-pattern poisson \
--target-qps 100 \
--duration 5m \
Expand Down Expand Up @@ -278,10 +278,10 @@ model_params:
datasets:
- name: "perf"
type: "performance"
path: "openorca.pkl"
path: "openorca.jsonl"
- name: "gpqa"
type: "accuracy"
path: "gpqa.pkl"
path: "gpqa.jsonl"
eval_method: "exact_match"

settings:
Expand Down
2 changes: 1 addition & 1 deletion docs/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ pytest -v 2>&1 | tee test_results.log
- **Unit Tests** (`tests/unit/`): Test individual components in isolation
- **Integration Tests** (`tests/integration/`): Test component interactions with real servers
- **Performance Tests** (`tests/performance/`): Test performance characteristics (marked with @pytest.mark.performance, no timeout)
- **Test Datasets** (`tests/datasets/`): Sample datasets for testing (dummy_1k.pkl, squad_pruned/)
- **Test Datasets** (`tests/datasets/`): Sample datasets for testing (dummy_1k.jsonl, squad_pruned/)

### Writing Tests

Expand Down
30 changes: 15 additions & 15 deletions docs/LOCAL_TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

### 1. Prepare Test Environment

**Dataset:** The repo includes `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB)
**Format:** Automatically inferred (supports: pkl, HuggingFace; coming soon: jsonl)
**Dataset:** The repo includes `tests/datasets/dummy_1k.jsonl` (1000 samples)
**Format:** Automatically inferred from the file extension. Common local formats include `jsonl`, `json`, `csv`, `parquet`, and HuggingFace datasets.

### 2. Start the Echo Server

Expand Down Expand Up @@ -72,13 +72,13 @@ Waiting for 5 responses...
inference-endpoint -v benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl
--dataset tests/datasets/dummy_1k.jsonl

# Production test with custom params and report generation
inference-endpoint -v benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--num-samples 5000 \
--workers 4 \
--report-dir benchmark_report
Expand All @@ -90,7 +90,7 @@ inference-endpoint -v benchmark offline \
**Expected Output:**

```
Loading: dummy_1k.pkl
Loading: dummy_1k.jsonl
Loaded 1000 samples
Mode: TestMode.PERF, QPS: 10.0, Responses: False
Streaming: disabled (auto, offline mode)
Expand All @@ -111,7 +111,7 @@ Cleaning up...
inference-endpoint -v benchmark online \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern poisson \
--target-qps 100 \
--report-dir online_benchmark_report
Expand All @@ -120,7 +120,7 @@ inference-endpoint -v benchmark online \
**Expected Output:**

```
Loading: dummy_1k.pkl
Loading: dummy_1k.jsonl
Loaded 1000 samples
Mode: TestMode.PERF, QPS: 100.0, Responses: False
Streaming: enabled (auto, online mode)
Expand Down Expand Up @@ -150,7 +150,7 @@ inference-endpoint validate-yaml --config offline_template.yaml
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/ds_samples.pkl \
--dataset tests/datasets/ds_samples.jsonl \
-v
```

Expand Down Expand Up @@ -240,7 +240,7 @@ inference-endpoint probe --endpoints http://localhost:8000 --model Qwen/Qwen3-8B
inference-endpoint -v benchmark offline \
--endpoints http://localhost:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--workers 4 \
--report-dir benchmark_report

Expand All @@ -255,14 +255,14 @@ pkill -f echo_server
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--report-dir offline_report

# Online (Poisson distribution)
inference-endpoint benchmark online \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern poisson \
--target-qps 500 \
--report-dir online_report
Expand All @@ -271,21 +271,21 @@ inference-endpoint benchmark online \
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--num-samples 500

# Force streaming on for offline mode (to test TTFT metrics)
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--streaming on

# Concurrency mode (fixed concurrent requests)
inference-endpoint benchmark online \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.pkl \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern concurrency \
--concurrency 32
```
Expand All @@ -310,7 +310,7 @@ inference-endpoint benchmark online \
- Use `-v` for INFO logging, `-vv` for DEBUG
- Echo server mirrors prompts back - perfect for quick testing without real inference
- Press `Ctrl+C` to gracefully interrupt benchmarks
- Default test dataset: `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB)
- Default test dataset: `tests/datasets/dummy_1k.jsonl` (1000 samples)

**Advanced:**

Expand Down
2 changes: 1 addition & 1 deletion examples/02_ServerBenchmarking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ enroot start -e HF_TOKEN=$HF_TOKEN -m $HF_HOME:/root/.cache/huggingface vllm+vll
Once the server is up and running, we can send requests to the endpoint by passing in the endpoint address via `-e` as well as the model name

```
inference-endpoint benchmark offline -e http://localhost:8000 -d tests/datasets/dummy_1k.pkl --model ${MODEL_NAME}
inference-endpoint benchmark offline -e http://localhost:8000 -d tests/datasets/dummy_1k.jsonl --model ${MODEL_NAME}
```

# Using a config file
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ test = [
"Pympler==1.1",
"scipy==1.16.3",
# HTTP server and client for mock server fixture
"aiohttp==3.13.3",
"aiohttp==3.13.4",
# Plotting for benchmark sweep mode
"matplotlib==3.10.8",
]
Expand Down
Loading
Loading