Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
f79e2db
Extend openaimsgspec to bridge sglang endpoint compatibility
attafosu Mar 4, 2026
d909fac
Add sglang-specific preset for llama3.1-8b
attafosu Mar 6, 2026
ccac968
fix tcp warmup (#153)
viraatc Mar 4, 2026
519a8ac
[fix] Handle case with string response (#155)
arekay-nv Mar 9, 2026
c2a230f
Add sglang-specific preset for llama3.1-8b
attafosu Mar 6, 2026
fb19769
chore(http-client): cleanup types, improve coverage, remove orjson …
viraatc Mar 9, 2026
f3c6a58
feat: optimize zmq receive (#131)
viraatc Mar 9, 2026
479f6a8
feat: msgspec optimizations, docs (#74)
viraatc Mar 9, 2026
7e25f8f
chore(http-client): full test-coverage 1/2 (#162)
viraatc Mar 10, 2026
dd3cb49
chore: optimize http-template (#165)
viraatc Mar 10, 2026
733307d
docs: add AGENTS.md with AI coding guidelines (#166)
nvzhihanj Mar 12, 2026
a6ae983
Fix example (#167)
arekay-nv Mar 12, 2026
157de98
feat: add http-client design doc (#163)
viraatc Mar 12, 2026
c980856
[fix] fixes max_duration integration (#143)
arekay-nv Mar 13, 2026
249d21a
Add sglang endpoint/client example
attafosu Mar 16, 2026
d25196e
fix: Fix failed request count in report (#169)
arekay-nv Mar 16, 2026
98641f4
Merge branch 'main' into feat/attafosu/sglang-openai-api-compatibility
attafosu Mar 16, 2026
41e8023
Merge branch 'main' into feat/attafosu/sglang-openai-api-compatibility
attafosu Mar 19, 2026
89ea457
feat: Enable unit tests for dataset presets (#194)
attafosu Mar 20, 2026
61e8585
fix pre-commit
attafosu Mar 20, 2026
3e866a4
Add unit tests for Harmonize modes and callers
attafosu Mar 20, 2026
d65fafa
Update 8B sglang endpoing example
attafosu Mar 20, 2026
45379ae
Enhance tests for 8b sglang preset
attafosu Mar 20, 2026
ea4d4f6
Minor fix
attafosu Mar 20, 2026
d43f8f8
Merge branch 'main' into feat/attafosu/sglang-openai-api-compatibility
attafosu Mar 23, 2026
4721925
Update README.md: Remove redundant port mapping
attafosu Mar 24, 2026
68ea45a
Merge branch 'main' into feat/attafosu/sglang-openai-api-compatibility
attafosu Mar 24, 2026
86156dd
Merge branch 'main' into feat/attafosu/sglang-openai-api-compatibility
attafosu Mar 25, 2026
7d8c495
Merge branch 'main' into feat/attafosu/sglang-openai-api-compatibility
arekay-nv Apr 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions DATASET_PRESET_TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Dataset Preset Testing

Unit tests for dataset preset transforms. These tests verify that presets correctly transform dataset columns without requiring end-to-end benchmark runs.

## Quick Start

```bash
# Run all preset tests
pytest tests/unit/dataset_manager/test_dataset_presets.py -v

# Run tests for a specific dataset
pytest tests/unit/dataset_manager/test_dataset_presets.py::TestCNNDailyMailPresets -v

# Exclude slow tests (Harmonize transform requires transformers)
pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v
Comment on lines +14 to +15
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note says slow tests are excluded because Harmonize “requires transformers”, but transformers is already a core dependency in this repo; the main reason to mark these slow is usually that they can trigger tokenizer/model downloads and be network-dependent. Consider rewording to reflect that.

Copilot uses AI. Check for mistakes.
```
Comment on lines +14 to +16
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note “Exclude slow tests (Harmonize transform requires transformers)” is a bit misleading since transformers is already a core dependency here; the main reason these tests are slow is typically tokenizer/model downloads and external network access. Consider rewording to reflect that the slow marker is about heavyweight downloads / network dependency.

Copilot uses AI. Check for mistakes.

## Preset Coverage

| Dataset | Presets | Tests |
| ------------- | ------------------------------- | ----- |
| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 |
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CNNDailyMail row lists 6 tests, but tests/unit/dataset_manager/test_dataset_presets.py currently defines 5 tests for CNNDailyMail (3 for llama3_8b and 2 for llama3_8b_sglang). Please update the count to match the actual test file.

Suggested change
| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 |
| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 5 |

Copilot uses AI. Check for mistakes.
| AIME25 | `gptoss` | 3 |
| GPQA | `gptoss` | 3 |
| LiveCodeBench | `gptoss` | 3 |
| OpenOrca | `llama2_70b` | 3 |
Comment on lines +20 to +26
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table’s CNNDailyMail test count appears incorrect. tests/unit/dataset_manager/test_dataset_presets.py currently defines 5 tests under TestCNNDailyMailPresets (3 regular + 2 @pytest.mark.slow), not 6. Please update the count (or remove the numeric column) so the doc stays accurate.

Copilot uses AI. Check for mistakes.

## Adding Tests for New Presets

When adding a new dataset preset, add a test class to `tests/unit/dataset_manager/test_dataset_presets.py`:

```python
import pandas as pd
import pytest
from inference_endpoint.dataset_manager.transforms import apply_transforms
from inference_endpoint.dataset_manager.predefined.my_dataset import MyDataset


class TestMyDatasetPresets:
@pytest.fixture
def sample_data(self):
"""Minimal sample data matching dataset schema."""
return pd.DataFrame({
"input_col1": ["value1"],
"input_col2": ["value2"],
})

@pytest.fixture
def transformed_data(self, sample_data):
"""Apply preset transforms to sample data."""
transforms = MyDataset.PRESETS.my_preset()
return apply_transforms(sample_data, transforms)

def test_my_preset_instantiation(self):
"""Verify preset can be created."""
transforms = MyDataset.PRESETS.my_preset()
assert transforms is not None
assert len(transforms) > 0

def test_my_preset_transforms_apply(self, transformed_data):
"""Verify transforms apply without errors."""
assert transformed_data is not None
assert "prompt" in transformed_data.columns # Expected output column

def test_my_preset_output_format(self, transformed_data):
"""Verify output has expected format."""
# Validate format-specific expectations
assert len(transformed_data["prompt"][0]) > 0
```

If the preset uses `Harmonize` transform (requires `transformers` library), mark slow tests:

```python
@pytest.mark.slow
def test_my_preset_transforms_apply(self, transformed_data):
# Test that requires transformers library
pass
```

## Test Scope

✅ **Tests verify:**

- Preset instantiation
- Transform application without errors
- Required output columns exist
- Data is properly transformed

❌ **Tests do NOT verify:**

- Model inference accuracy
- API endpoint compatibility
- Throughput/latency metrics
- Full benchmark runs

See `src/inference_endpoint/dataset_manager/README.md` for dataset schema and preset creation details.
65 changes: 55 additions & 10 deletions examples/05_Llama3.1-8B_Example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

It is recommended to use a config file such as [online_llama3_8b_cnn.yaml](online_llama3_8b_cnn.yaml) to run the benchmark.

## [Optional] Download dataset
## Download dataset (Only needed if quantizing the model)

The Llama3.1-8B benchmark uses the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset (for summarization). If using a config (such as provided) to run the benchmark, the (validation) dataset is downloaded automatically by specifying dataset name as `- name: cnn_dailymail::llama3_8b` under the `dataset` entry.
The Llama3.1-8B benchmark uses the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset (for summarization). If using a config (such as provided) to run the benchmark, the (validation) dataset is downloaded automatically by specifying dataset name as `- name: cnn_dailymail::llama3_8b # or cnn_dailymail::llama3_8b_sglang` under the `dataset` entry.

For post-training quantization, users can use the [cnn-dailymail-calibration-list](https://github.com/mlcommons/inference/blob/v4.0/calibration/CNNDailyMail/calibration-list.txt) to select samples for the calibration.

Expand All @@ -15,6 +15,8 @@ python download_cnndm.py --save-dir data --calibration-ids-file calibration-list

## Launch the server

We provide instructions below for using either vLLM or SGLang endpoints.

The following environment variables are used by the commands below to make the scripts easier to run:

```
Expand All @@ -31,7 +33,7 @@ hf download $MODEL_NAME

The cached models can be verified with `hf cache scan`.

### [vLLM](https://github.com/vllm-project/vllm)
### [vLLM](https://github.com/vllm-project/vllm) (Using NVIDIA GPUs for demo)

**Note**: To generate same outputs as the ones produced from submissions with legacy loadgen, we need to apply a custom chat template (this is taken care of automatically by the cnn-dailymail dataset preset). The flag `--trust-request-chat-template` is also required for this behavior. **Security warning:** `--trust-request-chat-template` allows execution of request-provided chat templates and should only be used in trusted environments or when all requests are controlled by the benchmark harness/preset. Do not enable this flag on publicly exposed endpoints receiving untrusted traffic.

Expand All @@ -41,22 +43,65 @@ We can launch the latest docker image for vllm using the command below:
docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model ${MODEL_NAME} --trust-request-chat-template
```

### To run Offline mode
### [SGLang](https://github.com/sgl-project/sglang)

- First build the container and start the endpoint

```
# Clone the SGLang repository
SGLANG_VER=3f9fc8b848365a5797a44856854e3e6f00a60dd0 # Latest tested
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker && git checkout $SGLANG_VER

# Build the docker image
docker build -t sglang-cpu:latest -f xeon.Dockerfile .

# Initiate a docker container
docker run -it --privileged --ipc=host --network=host -v /dev/shm:/dev/shm -v ~/.cache/huggingface:/root/.cache/huggingface -e "HF_TOKEN=<secret>" --name sglang-cpu-server sglang-cpu:latest /bin/bash

# Start sglang endpoint
docker exec -u root -w /workspace sglang-cpu-server /bin/bash -lc "python3 -m sglang.launch_server \
--model-path $MODEL_NAME \
--served-model-name meta-llama/Llama-3.1-8B-Instruct \
--dtype bfloat16 \
Comment on lines +62 to +66
--device cpu \
--max-running-requests 64 \
--max-total-tokens 131072 \
--chunked-prefill-size 8192 \
--max-prefill-tokens 32768 \
--mem-fraction-static 0.9 \
--disable-piecewise-cuda-graph \
--disable-radix-cache \
--host 127.0.0.1 \
--port 8080 2>&1 | tee server.log"
```

## Start benchmark

Make sure the [`inference-endpoint`](https://github.com/mlcommons/endpoints/tree/main?tab=readme-ov-file#installation) is installed and activated

**Note** Double-check the config file for correct parameters such as the model name in the config

- Launch the benchmark with config yaml
- Launch the benchmark with config yaml (For performance only, remove the accuracy dataset entry in the `online_llama3_8b_cnn.yaml`)

### vLLM endpoint targets

- To run Offline mode

```
inference-endpoint benchmark from-config -c offline_llama3_8b_cnn.yaml --timeout 600
inference-endpoint benchmark from-config -c offline_llama3_8b_cnn.yaml
```

### To run Online mode
- To run Online mode

**Note** Double-check the config file for correct parameters
```
inference-endpoint benchmark from-config -c online_llama3_8b_cnn.yaml
```

- Launch the benchmark with config yaml (For performance only, remove the accuracy dataset entry in the `online_llama3_8b_cnn.yaml`)
### SGLang endpoint targets

- To run the offline benchmark:

```
inference-endpoint benchmark from-config -c online_llama3_8b_cnn.yaml --timeout 600
inference-endpoint benchmark from-config -c offline_llama3_8b_cnn_sglang_api.yaml
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Offline Throughput Benchmark
name: "offline-llama3.1-8b-cnn-benchmark"
version: "1.0"
type: "offline"

model_params:
name: "meta-llama/Llama-3.1-8B-Instruct" # Path to the model
temperature: 0.0
top_p: 1.0
max_new_tokens: 128

datasets:
- name: cnn_dailymail::llama3_8b_sglang
type: accuracy
samples: 13368
parser:
input: prompt
accuracy_config:
eval_method: "rouge"
ground_truth: "highlights"
extractor: identity_extractor
num_repeats: 1
Comment on lines +16 to +22
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example config sets parser: {input: prompt} for a predefined dataset (cnn_dailymail::...). Predefined datasets ignore parser in DataLoaderFactory.create_loader, and input isn’t a supported parser target in the CLI validation (only prompt/system). To avoid confusion, remove this parser section (or switch to parser.prompt=... only when loading custom datasets from files).

Copilot uses AI. Check for mistakes.
- name: cnn_dailymail::llama3_8b_sglang
type: "performance"
samples: 13368
parser:
input: prompt

settings:
runtime:
min_duration_ms: 60000 # 1 minute
max_duration_ms: 360000 # 6 minutes (Arbitrary here, and doesn't have counterpart in legacy loadgen)
scheduler_random_seed: 137 # For Poisson/distribution sampling
dataloader_random_seed: 111 # For dataset shuffling (Will be updated after rng seeds are finalized for submission)
n_samples_to_issue: 13368 # Number of samples to issue (for offline, this should match the dataset samples)

load_pattern:
type: "max_throughput"

client:
workers: 4 # Number of client workers

metrics:
collect:
- "throughput"
- "latency"
- "ttft"
- "tpot"

endpoint_config:
endpoints:
- "http://localhost:8080"
api_type: "sglang"
api_key: null

report_dir: logs/llama3_8b_cnn_sglang_offline # Directory to save the benchmark report
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@

from inference_endpoint.dataset_manager.transforms import (
AddStaticColumns,
Harmonize,
Transform,
UserPromptFormatter,
)
Expand Down Expand Up @@ -48,3 +49,28 @@ def llama3_8b(
),
AddStaticColumns(chat_template),
]


def llama3_8b_sglang(
stream: bool = True,
max_new_tokens: int = 128,
temperature: float = 0.0,
top_p: float = 1.0,
top_k: int = 1,
tokenizer_name: str = "meta-llama/Llama-3.1-8B-Instruct",
) -> list[Transform]:
return [
# Step 1: Format the prompt from "article"
UserPromptFormatter(
user_prompt_format=f"Summarize the following news article in {max_new_tokens} tokens. Please output the summary only, without any other text.\n\nArticle:\n{{article}}\n\nSummary:",
output_column="prompt",
),
# Step 2: Tokenize the raw prompt via Harmonize in plain mode.
Harmonize(
tokenizer_name=tokenizer_name,
prompt_column="prompt",
tokenized_column="input_tokens",
harmonized_column=None,
mode="plain",
),
]
Comment on lines +69 to +76
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why we need harmonize with plain here for a llama model. Harmonization only works with the gpt-oss models as far as i know, so using a harmonize transform here is a bit confusing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we just want to use the Harmonizer to generate the tokenized inputs (input_tokens needed by sglang api). The plain mode is introduced to ensure no chat templates nor processing is applied to the input prompt (as would otherwise be the case in the "harmony" mode: src/inference_endpoint/dataset_manager/transforms.py::process_row() --> src/inference_endpoint/openai/harmony.py::harmony())

I could also add a new transform say a Tokenizer transform to do just that (generating tokenized inputs), but only wanted to refactor existing implementations wherever possible. If this sounds more straightforward I can leave the Harmonizer as is and instead add the tokenizing transform.

19 changes: 18 additions & 1 deletion src/inference_endpoint/dataset_manager/transforms.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ def __init__(
prompt_column: str = "prompt",
tokenized_column: str = "input_tokens",
harmonized_column: str | None = "harmonized_prompt",
mode: str = "harmony",
):
"""Initialize the Harmonize transform.

Expand All @@ -149,10 +150,14 @@ def __init__(
tokenized_column: The name of the column containing the tokenized prompt.
harmonized_column: The name of the column containing the harmonized prompt. If None,
the harmonized prompt will not be stored as text.
mode: "harmony" to render a Harmony conversation; "plain" to tokenize the raw prompt.
"""
self.prompt_column = prompt_column
self.tokenized_column = tokenized_column
self.harmonized_column = harmonized_column
self.mode = mode
if self.mode not in {"harmony", "plain"}:
raise ValueError(f"Invalid harmonize mode: {self.mode}")
Comment on lines 140 to +160
self.harmonizer = Harmonizer(
tokenizer_name=tokenizer_name,
encoding_name=encoding_name,
Comment on lines +158 to 163
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mode="plain", process_row() only needs a HuggingFace tokenizer, but __init__ still constructs a full Harmonizer, which eagerly loads the Harmony encoding and builds the system message. This adds unnecessary overhead/dependencies for the plain-tokenization path. Consider making Harmonizer lazily load the encoding/system message only when mode=="harmony", or use AutoTokenizer directly in plain mode.

Copilot uses AI. Check for mistakes.
Expand All @@ -175,7 +180,19 @@ def process_row(self, row: dict[str, Any]) -> dict[str, Any]:
Returns:
Row dictionary with the harmonized prompt added
"""
row[self.tokenized_column] = self.harmonizer(row[self.prompt_column])
# Guard pre-tokenized rows: the SGLang adapter adds a default Harmonize
# (GPT-OSS tokenizer + harmony mode). When row processors are fused, the
# dataframe-level skip is bypassed, so without this guard, adapter
# Harmonize would overwrite input tokens. Alternative: remove Harmonize
# from the adapter transforms and require each SGLang preset to add its
# own Harmonize with the desired tokenizer/args.
if self.tokenized_column in row and row[self.tokenized_column] is not None:
return row
Comment on lines +189 to +190
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning early when input_tokens is present skips populating harmonized_column even when it’s configured (non-None). If callers rely on the text harmonized prompt for debugging/logging, consider still computing harmonized_column from the existing tokens (without overwriting tokens), or update the docstring/behavior to make it explicit that the column may not be produced when tokens are pre-generated.

Copilot uses AI. Check for mistakes.
if self.mode == "plain":
tokens = self.harmonizer.to_tokens(row[self.prompt_column])
row[self.tokenized_column] = tokens
else:
row[self.tokenized_column] = self.harmonizer(row[self.prompt_column])
Comment on lines 140 to +195
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New mode behavior and the row-level skip logic are not covered by unit tests. Since tests/unit/dataset_manager/test_transforms.py covers other transforms in this module, consider adding focused tests for: (1) invalid mode raising ValueError, (2) adapter+preset fused execution not overwriting pre-existing input_tokens, and (3) plain vs harmony mode behavior (can be done by mocking Harmonizer to avoid network/tokenizer downloads).

Copilot uses AI. Check for mistakes.
if self.harmonized_column is not None:
Comment on lines +183 to 196
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In process_row, the early return when input_tokens is already present also skips populating harmonized_column (when configured). This makes it impossible to keep preset-provided tokens while still emitting harmonized_prompt text. Consider skipping only the tokenization step (avoid overwriting input_tokens), but still fill harmonized_column if it’s set and missing (or validate it matches the existing tokens).

Copilot uses AI. Check for mistakes.
row[self.harmonized_column] = self.harmonizer.to_text(
row[self.tokenized_column]
Expand Down
6 changes: 3 additions & 3 deletions src/inference_endpoint/openai/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ class ChatCompletionResponseMessage(

role: str
content: str | None
refusal: str | None
refusal: str | None = None


class ChatCompletionChoice(
Expand Down Expand Up @@ -149,5 +149,5 @@ class ChatCompletionResponse(
created: int
model: str
choices: list[ChatCompletionChoice]
usage: CompletionUsage | None
system_fingerprint: str | None
usage: CompletionUsage | None = None
system_fingerprint: str | None = None
Loading
Loading