mlcommons · arekay-nv · Apr 1, 2026 · Apr 1, 2026 · Apr 1, 2026 · Apr 1, 2026
@@ -50,7 +50,7 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
 | ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
 | **Load Generator**  | `src/inference_endpoint/load_generator/`                      | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries                    |
 | **Endpoint Client** | `src/inference_endpoint/endpoint_client/`                     | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point                                          |
-| **Dataset Manager** | `src/inference_endpoint/dataset_manager/`                     | Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface                              |
+| **Dataset Manager** | `src/inference_endpoint/dataset_manager/`                     | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface                  |
 | **Metrics**         | `src/inference_endpoint/metrics/`                             | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT)                                         |
 | **Config**          | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings`                |
 | **CLI**             | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
@@ -187,7 +187,7 @@ tests/
 │   ├── endpoint_client/       # HTTP client integration tests
 │   └── commands/              # CLI command integration tests
 ├── performance/               # Performance benchmarks (pytest-benchmark)
-└── datasets/                  # Test data (dummy_1k.pkl, squad_pruned/)
+└── datasets/                  # Test data (dummy_1k.jsonl, squad_pruned/)
 ```
 
 ## Development Standards
@@ -245,7 +245,7 @@ All of these run automatically on commit:
 - `max_throughput_runtime_settings`, `poisson_runtime_settings`, `concurrency_runtime_settings` — preset configs
 - `clean_sample_event_hooks` — ensures event hooks are cleared between tests
 
-**Test data**: `tests/datasets/dummy_1k.pkl` (1000 samples), `tests/datasets/squad_pruned/`
+**Test data**: `tests/datasets/dummy_1k.jsonl` (1000 samples), `tests/datasets/squad_pruned/`
 
 ### Performance Guidelines
 

@@ -44,21 +44,21 @@ inference-endpoint probe \
 inference-endpoint benchmark offline \
   --endpoints http://your-endpoint:8000 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl
+  --dataset tests/datasets/dummy_1k.jsonl
 
 # Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
 inference-endpoint benchmark online \
   --endpoints http://your-endpoint:8000 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --load-pattern poisson \
   --target-qps 100
 
 # With explicit sample count
 inference-endpoint benchmark offline \
   --endpoints http://your-endpoint:8000 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --num-samples 5000
 ```
 
@@ -72,7 +72,7 @@ python -m inference_endpoint.testing.echo_server --port 8765 &
 inference-endpoint benchmark offline \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl
+  --dataset tests/datasets/dummy_1k.jsonl
 
 # Stop echo server
 pkill -f echo_server

@@ -102,7 +102,7 @@ The first segment is the file path, optionally prefixed with `perf:` or `acc:` t
 
 ```bash
 # Simple
---dataset data.pkl
+--dataset data.jsonl
 
 # Accuracy dataset
 --dataset acc:eval.jsonl
@@ -111,15 +111,15 @@ The first segment is the file path, optionally prefixed with `perf:` or `acc:` t
 --dataset data.csv,samples=500,parser.prompt=article
 
 # With accuracy config
---dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer
+--dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer
 
 # Multiple datasets
---dataset perf:train.pkl --dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1 --mode both
+--dataset perf:train.jsonl --dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1 --mode both
 ```
 
 Parser remaps use `parser.TARGET=SOURCE` — "rename my dataset's SOURCE column to TARGET". Valid targets are derived from `MakeAdapterCompatible` (`prompt`, `system`). Invalid targets are rejected at parse time. Invalid source columns are rejected at dataset load time.
 
-Pydantic validates all fields: `extra="forbid"` on `Dataset` and `AccuracyConfig` catches typos like `--dataset data.pkl,samles=500`. Format is auto-detected from file extension.
+Pydantic validates all fields: `extra="forbid"` on `Dataset` and `AccuracyConfig` catches typos like `--dataset data.jsonl,samles=500`. Format is auto-detected from file extension.
 
 The only YAML-only features are `submission_ref` and `benchmark_mode` (for official submissions).
 
@@ -202,5 +202,5 @@ class HTTPClientConfig(WithUpdatesMixin, BaseModel):
 `BenchmarkConfig` is frozen. Use `with_updates()` to produce new instances with re-validation:
 
 ```python
-config = config.with_updates(timeout=300, datasets=["new_data.pkl"])
+config = config.with_updates(timeout=300, datasets=["new_data.jsonl"])
 ```
@@ -18,44 +18,44 @@ cyclopts. schema.py is the single source of truth for both YAML configs and CLI
 inference-endpoint benchmark offline \
   --endpoints URL \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl
+  --dataset tests/datasets/dummy_1k.jsonl
 
 # Online (sustained QPS - requires --load-pattern, --target-qps)
 inference-endpoint benchmark online \
   --endpoints URL \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --load-pattern poisson \
   --target-qps 100
 
 # Multiple datasets (--dataset is repeatable, prefix with perf: or acc:)
 inference-endpoint benchmark offline \
   --endpoints URL \
   --model Qwen/Qwen3-8B \
-  --dataset perf:performance.pkl \
-  --dataset acc:accuracy.pkl \
+  --dataset perf:performance.jsonl \
+  --dataset acc:accuracy.jsonl \
   --mode both
 
 # With detailed report generation
 inference-endpoint benchmark offline \
   --endpoints URL \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --report-dir my_benchmark_report
 
 # YAML-based
 inference-endpoint benchmark from-config --config test.yaml
 ```
 
-**Default Test Dataset:** Use `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB) for local testing.
+**Default Test Dataset:** Use `tests/datasets/dummy_1k.jsonl` (1000 samples) for local testing.
 
 **Dataset format:** `--dataset [perf|acc:]<path>[,key=value...]` — TOML-style dotted paths. Type prefix is optional (defaults to `perf`):
 
 ```bash
---dataset data.pkl                                           # simple path
+--dataset data.jsonl                                         # simple path
 --dataset acc:eval.jsonl                                     # accuracy dataset
 --dataset data.csv,samples=500,parser.prompt=article         # with options
---dataset perf:data.jsonl,format=jsonl,parser.prompt=text    # explicit format + remap
+--dataset perf:data.jsonl,format=.jsonl,parser.prompt=text    # explicit format + remap
 ```
 
 ### Accuracy Evaluation (stub - future implementation)
@@ -135,9 +135,9 @@ model_params:
 
 ## Dataset Formats
 
-Format is auto-detected from file extension. Override with `format:` in the dataset string.
+Format is auto-detected from file extension. Override with `format=<ext>` in the dataset string.
 
-**Supported:** `pkl`, `csv`, `json`, `jsonl`, `parquet`, `npy`, `pandas_pkl`, `huggingface`
+**Supported:** `.csv`, `.json`, `.jsonl`, `.parquet`, `huggingface`
 
 ## Test Modes
 
@@ -163,13 +163,13 @@ Accuracy config is supported in both CLI and YAML:
 
 ```bash
 # CLI — accuracy config via dotted paths
---dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer,accuracy_config.extractor=boxed_math_extractor
+--dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer,accuracy_config.extractor=boxed_math_extractor
 
 # Combined perf + accuracy
 inference-endpoint benchmark offline \
   --endpoints URL --model M \
-  --dataset perf:perf.pkl \
-  --dataset acc:eval.pkl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer \
+  --dataset perf:perf.jsonl \
+  --dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer \
   --mode both
 ```
 
@@ -203,7 +203,7 @@ inference-endpoint benchmark offline \
 inference-endpoint benchmark offline \
   --endpoints http://localhost:8000 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl
+  --dataset tests/datasets/dummy_1k.jsonl
 ```
 
 ### Production Benchmark
@@ -213,7 +213,7 @@ inference-endpoint benchmark offline \
 inference-endpoint benchmark online \
   --endpoints https://api.production.com \
   --model Qwen/Qwen3-8B \
-  --dataset prod_queries.pkl \
+  --dataset prod_queries.jsonl \
   --load-pattern poisson \
   --target-qps 100 \
   --num-samples 10000 \
@@ -225,7 +225,7 @@ inference-endpoint benchmark online \
 inference-endpoint benchmark online \
   --endpoints https://api.production.com \
   --model Qwen/Qwen3-8B \
-  --dataset prod_queries.pkl \
+  --dataset prod_queries.jsonl \
   --load-pattern poisson \
   --target-qps 100 \
   --duration 5m \
@@ -278,10 +278,10 @@ model_params:
 datasets:
   - name: "perf"
     type: "performance"
-    path: "openorca.pkl"
+    path: "openorca.jsonl"
   - name: "gpqa"
     type: "accuracy"
-    path: "gpqa.pkl"
+    path: "gpqa.jsonl"
     eval_method: "exact_match"
 
 settings:

@@ -94,7 +94,7 @@ pytest -v 2>&1 | tee test_results.log
 - **Unit Tests** (`tests/unit/`): Test individual components in isolation
 - **Integration Tests** (`tests/integration/`): Test component interactions with real servers
 - **Performance Tests** (`tests/performance/`): Test performance characteristics (marked with @pytest.mark.performance, no timeout)
-- **Test Datasets** (`tests/datasets/`): Sample datasets for testing (dummy_1k.pkl, squad_pruned/)
+- **Test Datasets** (`tests/datasets/`): Sample datasets for testing (dummy_1k.jsonl, squad_pruned/)
 
 ### Writing Tests
 

@@ -4,8 +4,8 @@
 
 ### 1. Prepare Test Environment
 
-**Dataset:** The repo includes `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB)
-**Format:** Automatically inferred (supports: pkl, HuggingFace; coming soon: jsonl)
+**Dataset:** The repo includes `tests/datasets/dummy_1k.jsonl` (1000 samples)
+**Format:** Automatically inferred from the file extension. Common local formats include `jsonl`, `json`, `csv`, `parquet`, and HuggingFace datasets.
 
 ### 2. Start the Echo Server
 
@@ -72,13 +72,13 @@ Waiting for 5 responses...
 inference-endpoint -v benchmark offline \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl
+  --dataset tests/datasets/dummy_1k.jsonl
 
 # Production test with custom params and report generation
 inference-endpoint -v benchmark offline \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --num-samples 5000 \
   --workers 4 \
   --report-dir benchmark_report
@@ -90,7 +90,7 @@ inference-endpoint -v benchmark offline \
 **Expected Output:**
 
 ```
-Loading: dummy_1k.pkl
+Loading: dummy_1k.jsonl
 Loaded 1000 samples
 Mode: TestMode.PERF, QPS: 10.0, Responses: False
 Streaming: disabled (auto, offline mode)
@@ -111,7 +111,7 @@ Cleaning up...
 inference-endpoint -v benchmark online \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --load-pattern poisson \
   --target-qps 100 \
   --report-dir online_benchmark_report
@@ -120,7 +120,7 @@ inference-endpoint -v benchmark online \
 **Expected Output:**
 
 ```
-Loading: dummy_1k.pkl
+Loading: dummy_1k.jsonl
 Loaded 1000 samples
 Mode: TestMode.PERF, QPS: 100.0, Responses: False
 Streaming: enabled (auto, online mode)
@@ -150,7 +150,7 @@ inference-endpoint validate-yaml --config offline_template.yaml
 inference-endpoint benchmark offline \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/ds_samples.pkl \
+  --dataset tests/datasets/ds_samples.jsonl \
   -v
 ```
 
@@ -240,7 +240,7 @@ inference-endpoint probe --endpoints http://localhost:8000 --model Qwen/Qwen3-8B
 inference-endpoint -v benchmark offline \
   --endpoints http://localhost:8000 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --workers 4 \
   --report-dir benchmark_report
 
@@ -255,14 +255,14 @@ pkill -f echo_server
 inference-endpoint benchmark offline \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --report-dir offline_report
 
 # Online (Poisson distribution)
 inference-endpoint benchmark online \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --load-pattern poisson \
   --target-qps 500 \
   --report-dir online_report
@@ -271,21 +271,21 @@ inference-endpoint benchmark online \
 inference-endpoint benchmark offline \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --num-samples 500
 
 # Force streaming on for offline mode (to test TTFT metrics)
 inference-endpoint benchmark offline \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --streaming on
 
 # Concurrency mode (fixed concurrent requests)
 inference-endpoint benchmark online \
   --endpoints http://localhost:8765 \
   --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
+  --dataset tests/datasets/dummy_1k.jsonl \
   --load-pattern concurrency \
   --concurrency 32
 ```
@@ -310,7 +310,7 @@ inference-endpoint benchmark online \
 - Use `-v` for INFO logging, `-vv` for DEBUG
 - Echo server mirrors prompts back - perfect for quick testing without real inference
 - Press `Ctrl+C` to gracefully interrupt benchmarks
-- Default test dataset: `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB)
+- Default test dataset: `tests/datasets/dummy_1k.jsonl` (1000 samples)
 
 **Advanced:**
 

@@ -49,7 +49,7 @@ enroot start -e HF_TOKEN=$HF_TOKEN -m $HF_HOME:/root/.cache/huggingface vllm+vll
 Once the server is up and running, we can send requests to the endpoint by passing in the endpoint address via `-e` as well as the model name
 
 ```
-inference-endpoint benchmark offline -e http://localhost:8000 -d tests/datasets/dummy_1k.pkl  --model ${MODEL_NAME}
+inference-endpoint benchmark offline -e http://localhost:8000 -d tests/datasets/dummy_1k.jsonl  --model ${MODEL_NAME}
 ```
 
 # Using a config file

@@ -94,7 +94,7 @@ test = [
     "Pympler==1.1",
     "scipy==1.16.3",
     # HTTP server and client for mock server fixture
-    "aiohttp==3.13.3",
+    "aiohttp==3.13.4",
     # Plotting for benchmark sweep mode
     "matplotlib==3.10.8",
 ]