mlcommons · attafosu · Mar 4, 2026 · Mar 6, 2026 · Mar 4, 2026 · Mar 9, 2026
@@ -0,0 +1,96 @@
+# Dataset Preset Testing
+
+Unit tests for dataset preset transforms. These tests verify that presets correctly transform dataset columns without requiring end-to-end benchmark runs.
+
+## Quick Start
+
+```bash
+# Run all preset tests
+pytest tests/unit/dataset_manager/test_dataset_presets.py -v
+
+# Run tests for a specific dataset
+pytest tests/unit/dataset_manager/test_dataset_presets.py::TestCNNDailyMailPresets -v
+
+# Exclude slow tests (Harmonize transform requires transformers)
+pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v
+```
+
+## Preset Coverage
+
+| Dataset       | Presets                         | Tests |
+| ------------- | ------------------------------- | ----- |
+| CNNDailyMail  | `llama3_8b`, `llama3_8b_sglang` | 6     |
-| CNNDailyMail  | `llama3_8b`, `llama3_8b_sglang` | 6     |
+| CNNDailyMail  | `llama3_8b`, `llama3_8b_sglang` | 5     |
-| CNNDailyMail  | `llama3_8b`, `llama3_8b_sglang` | 6     |
+| CNNDailyMail  | `llama3_8b`, `llama3_8b_sglang` | 5     |
+| AIME25        | `gptoss`                        | 3     |
+| GPQA          | `gptoss`                        | 3     |
+| LiveCodeBench | `gptoss`                        | 3     |
+| OpenOrca      | `llama2_70b`                    | 3     |
+
+## Adding Tests for New Presets
+
+When adding a new dataset preset, add a test class to `tests/unit/dataset_manager/test_dataset_presets.py`:
+
+```python
+import pandas as pd
+import pytest
+from inference_endpoint.dataset_manager.transforms import apply_transforms
+from inference_endpoint.dataset_manager.predefined.my_dataset import MyDataset
+
+
+class TestMyDatasetPresets:
+    @pytest.fixture
+    def sample_data(self):
+        """Minimal sample data matching dataset schema."""
+        return pd.DataFrame({
+            "input_col1": ["value1"],
+            "input_col2": ["value2"],
+        })
+
+    @pytest.fixture
+    def transformed_data(self, sample_data):
+        """Apply preset transforms to sample data."""
+        transforms = MyDataset.PRESETS.my_preset()
+        return apply_transforms(sample_data, transforms)
+
+    def test_my_preset_instantiation(self):
+        """Verify preset can be created."""
+        transforms = MyDataset.PRESETS.my_preset()
+        assert transforms is not None
+        assert len(transforms) > 0
+
+    def test_my_preset_transforms_apply(self, transformed_data):
+        """Verify transforms apply without errors."""
+        assert transformed_data is not None
+        assert "prompt" in transformed_data.columns  # Expected output column
+
+    def test_my_preset_output_format(self, transformed_data):
+        """Verify output has expected format."""
+        # Validate format-specific expectations
+        assert len(transformed_data["prompt"][0]) > 0
+```
+
+If the preset uses `Harmonize` transform (requires `transformers` library), mark slow tests:
+
+```python
+@pytest.mark.slow
+def test_my_preset_transforms_apply(self, transformed_data):
+    # Test that requires transformers library
+    pass
+```
+
+## Test Scope
+
+✅ **Tests verify:**
+
+- Preset instantiation
+- Transform application without errors
+- Required output columns exist
+- Data is properly transformed
+
+❌ **Tests do NOT verify:**
+
+- Model inference accuracy
+- API endpoint compatibility
+- Throughput/latency metrics
+- Full benchmark runs
+
+See `src/inference_endpoint/dataset_manager/README.md` for dataset schema and preset creation details.
@@ -2,9 +2,9 @@
 
 It is recommended to use a config file such as [online_llama3_8b_cnn.yaml](online_llama3_8b_cnn.yaml) to run the benchmark.
 
-## [Optional] Download dataset
+## Download dataset (Only needed if quantizing the model)
 
-The Llama3.1-8B benchmark uses the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset (for summarization). If using a config (such as provided) to run the benchmark, the (validation) dataset is downloaded automatically by specifying dataset name as `- name: cnn_dailymail::llama3_8b` under the `dataset` entry.
+The Llama3.1-8B benchmark uses the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset (for summarization). If using a config (such as provided) to run the benchmark, the (validation) dataset is downloaded automatically by specifying dataset name as `- name: cnn_dailymail::llama3_8b # or cnn_dailymail::llama3_8b_sglang` under the `dataset` entry.
 
 For post-training quantization, users can use the [cnn-dailymail-calibration-list](https://github.com/mlcommons/inference/blob/v4.0/calibration/CNNDailyMail/calibration-list.txt) to select samples for the calibration.
 
@@ -15,6 +15,8 @@ python download_cnndm.py --save-dir data --calibration-ids-file calibration-list
 
 ## Launch the server
 
+We provide instructions below for using either vLLM or SGLang endpoints.
+
 The following environment variables are used by the commands below to make the scripts easier to run:
 
 ```
@@ -31,7 +33,7 @@ hf download $MODEL_NAME
 
 The cached models can be verified with `hf cache scan`.
 
-### [vLLM](https://github.com/vllm-project/vllm)
+### [vLLM](https://github.com/vllm-project/vllm) (Using NVIDIA GPUs for demo)
 
 **Note**: To generate same outputs as the ones produced from submissions with legacy loadgen, we need to apply a custom chat template (this is taken care of automatically by the cnn-dailymail dataset preset). The flag `--trust-request-chat-template` is also required for this behavior. **Security warning:** `--trust-request-chat-template` allows execution of request-provided chat templates and should only be used in trusted environments or when all requests are controlled by the benchmark harness/preset. Do not enable this flag on publicly exposed endpoints receiving untrusted traffic.
 
@@ -41,22 +43,65 @@ We can launch the latest docker image for vllm using the command below:
 docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model ${MODEL_NAME} --trust-request-chat-template
 ```
 
-### To run Offline mode
+### [SGLang](https://github.com/sgl-project/sglang)
+
+- First build the container and start the endpoint
+
+```
+# Clone the SGLang repository
+SGLANG_VER=3f9fc8b848365a5797a44856854e3e6f00a60dd0 # Latest tested
+git clone https://github.com/sgl-project/sglang.git
+cd sglang/docker && git checkout $SGLANG_VER
+
+# Build the docker image
+docker build -t sglang-cpu:latest -f xeon.Dockerfile .
+
+# Initiate a docker container
+docker run -it --privileged --ipc=host --network=host -v /dev/shm:/dev/shm -v ~/.cache/huggingface:/root/.cache/huggingface -e "HF_TOKEN=<secret>" --name sglang-cpu-server sglang-cpu:latest /bin/bash
+
+# Start sglang endpoint
+docker exec -u root -w /workspace sglang-cpu-server /bin/bash -lc "python3 -m sglang.launch_server \
+    --model-path $MODEL_NAME \
+    --served-model-name meta-llama/Llama-3.1-8B-Instruct \
+    --dtype bfloat16 \
+    --device cpu \
+    --max-running-requests 64 \
+    --max-total-tokens 131072 \
+    --chunked-prefill-size 8192 \
+    --max-prefill-tokens 32768 \
+    --mem-fraction-static 0.9 \
+    --disable-piecewise-cuda-graph \
+    --disable-radix-cache \
+    --host 127.0.0.1 \
+    --port 8080 2>&1 | tee server.log"
+```
+
+## Start benchmark
+
+Make sure the [`inference-endpoint`](https://github.com/mlcommons/endpoints/tree/main?tab=readme-ov-file#installation) is installed and activated
 
 **Note** Double-check the config file for correct parameters such as the model name in the config
 
-- Launch the benchmark with config yaml
+- Launch the benchmark with config yaml (For performance only, remove the accuracy dataset entry in the `online_llama3_8b_cnn.yaml`)
+
+### vLLM endpoint targets
+
+- To run Offline mode
 
 ```
-inference-endpoint benchmark from-config -c offline_llama3_8b_cnn.yaml --timeout 600
+inference-endpoint benchmark from-config -c offline_llama3_8b_cnn.yaml
 ```
 
-### To run Online mode
+- To run Online mode
 
-**Note** Double-check the config file for correct parameters
+```
+inference-endpoint benchmark from-config -c online_llama3_8b_cnn.yaml
+```
 
-- Launch the benchmark with config yaml (For performance only, remove the accuracy dataset entry in the `online_llama3_8b_cnn.yaml`)
+### SGLang endpoint targets
+
+- To run the offline benchmark:
 
 ```
-inference-endpoint benchmark from-config -c online_llama3_8b_cnn.yaml --timeout 600
+inference-endpoint benchmark from-config -c offline_llama3_8b_cnn_sglang_api.yaml
 ```
@@ -0,0 +1,56 @@
+# Offline Throughput Benchmark
+name: "offline-llama3.1-8b-cnn-benchmark"
+version: "1.0"
+type: "offline"
+
+model_params:
+  name: "meta-llama/Llama-3.1-8B-Instruct" # Path to the model
+  temperature: 0.0
+  top_p: 1.0
+  max_new_tokens: 128
+
+datasets:
+  - name: cnn_dailymail::llama3_8b_sglang
+    type: accuracy
+    samples: 13368
+    parser:
+      input: prompt
+    accuracy_config:
+      eval_method: "rouge"
+      ground_truth: "highlights"
+      extractor: identity_extractor
+      num_repeats: 1
+  - name: cnn_dailymail::llama3_8b_sglang
+    type: "performance"
+    samples: 13368
+    parser:
+      input: prompt
+
+settings:
+  runtime:
+    min_duration_ms: 60000 # 1 minute
+    max_duration_ms: 360000 # 6 minutes (Arbitrary here, and doesn't have counterpart in legacy loadgen)
+    scheduler_random_seed: 137 # For Poisson/distribution sampling
+    dataloader_random_seed: 111 # For dataset shuffling (Will be updated after rng seeds are finalized for submission)
+    n_samples_to_issue: 13368 # Number of samples to issue (for offline, this should match the dataset samples)
+
+  load_pattern:
+    type: "max_throughput"
+
+  client:
+    workers: 4 # Number of client workers
+
+metrics:
+  collect:
+    - "throughput"
+    - "latency"
+    - "ttft"
+    - "tpot"
+
+endpoint_config:
+  endpoints:
+    - "http://localhost:8080"
+  api_type: "sglang"
+  api_key: null
+
+report_dir: logs/llama3_8b_cnn_sglang_offline # Directory to save the benchmark report
@@ -20,6 +20,7 @@
 
 from inference_endpoint.dataset_manager.transforms import (
     AddStaticColumns,
+    Harmonize,
     Transform,
     UserPromptFormatter,
 )
@@ -48,3 +49,28 @@ def llama3_8b(
         ),
         AddStaticColumns(chat_template),
     ]
+
+
+def llama3_8b_sglang(
+    stream: bool = True,
+    max_new_tokens: int = 128,
+    temperature: float = 0.0,
+    top_p: float = 1.0,
+    top_k: int = 1,
+    tokenizer_name: str = "meta-llama/Llama-3.1-8B-Instruct",
+) -> list[Transform]:
+    return [
+        # Step 1: Format the prompt from "article"
+        UserPromptFormatter(
+            user_prompt_format=f"Summarize the following news article in {max_new_tokens} tokens. Please output the summary only, without any other text.\n\nArticle:\n{{article}}\n\nSummary:",
+            output_column="prompt",
+        ),
+        # Step 2: Tokenize the raw prompt via Harmonize in plain mode.
+        Harmonize(
+            tokenizer_name=tokenizer_name,
+            prompt_column="prompt",
+            tokenized_column="input_tokens",
+            harmonized_column=None,
+            mode="plain",
+        ),
+    ]
@@ -137,6 +137,7 @@ def __init__(
         prompt_column: str = "prompt",
         tokenized_column: str = "input_tokens",
         harmonized_column: str | None = "harmonized_prompt",
+        mode: str = "harmony",
     ):
         """Initialize the Harmonize transform.
 
@@ -149,10 +150,14 @@ def __init__(
             tokenized_column: The name of the column containing the tokenized prompt.
             harmonized_column: The name of the column containing the harmonized prompt. If None,
                 the harmonized prompt will not be stored as text.
+            mode: "harmony" to render a Harmony conversation; "plain" to tokenize the raw prompt.
         """
         self.prompt_column = prompt_column
         self.tokenized_column = tokenized_column
         self.harmonized_column = harmonized_column
+        self.mode = mode
+        if self.mode not in {"harmony", "plain"}:
+            raise ValueError(f"Invalid harmonize mode: {self.mode}")
         self.harmonizer = Harmonizer(
             tokenizer_name=tokenizer_name,
             encoding_name=encoding_name,
@@ -175,7 +180,19 @@ def process_row(self, row: dict[str, Any]) -> dict[str, Any]:
         Returns:
             Row dictionary with the harmonized prompt added
         """
-        row[self.tokenized_column] = self.harmonizer(row[self.prompt_column])
+        # Guard pre-tokenized rows: the SGLang adapter adds a default Harmonize
+        # (GPT-OSS tokenizer + harmony mode). When row processors are fused, the
+        # dataframe-level skip is bypassed, so without this guard, adapter
+        # Harmonize would overwrite input tokens. Alternative: remove Harmonize
+        # from the adapter transforms and require each SGLang preset to add its
+        # own Harmonize with the desired tokenizer/args.
+        if self.tokenized_column in row and row[self.tokenized_column] is not None:
+            return row
+        if self.mode == "plain":
+            tokens = self.harmonizer.to_tokens(row[self.prompt_column])
+            row[self.tokenized_column] = tokens
+        else:
+            row[self.tokenized_column] = self.harmonizer(row[self.prompt_column])
         if self.harmonized_column is not None:
             row[self.harmonized_column] = self.harmonizer.to_text(
                 row[self.tokenized_column]

@@ -112,7 +112,7 @@ class ChatCompletionResponseMessage(
 
     role: str
     content: str | None
-    refusal: str | None
+    refusal: str | None = None
 
 
 class ChatCompletionChoice(
@@ -149,5 +149,5 @@ class ChatCompletionResponse(
     created: int
     model: str
     choices: list[ChatCompletionChoice]
-    usage: CompletionUsage | None
-    system_fingerprint: str | None
+    usage: CompletionUsage | None = None
+    system_fingerprint: str | None = None