[feat] Add baseline comparison with paired t-test #1006

mwxely · 2026-01-19T16:45:14Z

Summary

This PR adds baseline comparison capabilities to lmms-eval, enabling users to statistically compare model performance against a baseline using paired t-test analysis.

Motivation

Problem: Checking if confidence intervals overlap is low-power.

Solution: Paired test — compute per-question difference $d_i = score_A - score_B$, then test if $\text{mean}(d) \neq 0$.

Why: Removes question difficulty variance (dominant noise), isolates model difference signal.

Baseline-Anchored Evaluation

A practical application of paired comparison: anchor evaluations to a standard baseline model.

Approach	Report	Limitation
Absolute score	"Our model: 78.3%"	Meaningless without context
Leaderboard rank	"Top-3 on MMMU"	Rank doesn't quantify gap
Paired difference	"+2.1% vs Gemini 3.0 Pro (p<0.01)"	Statistically grounded claim

Benefits

Reproducible claims: "We beat baseline X by Y%" is verifiable
Training signal: Track improvement over baseline across checkpoints
Publication-ready: Statistical significance replaces hand-waving

Changes

New CLI Parameter

lmms-eval --model xxx --tasks videomme --baseline qwen25vl

Three ways to specify baseline:

Format	Example	Description
Preset	`qwen25vl`	Auto-match task from registry
Local	`/path/to/results.jsonl`	Local JSONL file
HF URL	`hf://user/repo/file.jsonl`	HuggingFace dataset

New Output Columns

Column	Description	Example
Baseline	Baseline identifier	`qwen25vl`
Diff	Score difference	`+2.5%`
CI	95% confidence interval	`[+0.8%, +4.2%]`
P_Value	Statistical significance	`0.023*`

JSON Output Fields

{
  "paired_baseline": "qwen25vl",
  "paired_baseline_score": 62.5,
  "paired_ci_lower": 0.8,
  "paired_ci_upper": 4.2,
  "paired_pvalue": 0.023
}

Files Changed

File	Changes
`lmms_eval/__main__.py`	Add `--baseline` CLI parameter
`lmms_eval/api/metrics.py`	Add `paired_ttest()` function
`lmms_eval/evaluator.py`	Integrate baseline comparison logic
`lmms_eval/evaluator_utils.py`	Add `compute_baseline_comparison()` helper
`lmms_eval/utils.py`	Display Baseline/Diff/CI/P_Value columns, auto-hide if N/A
`lmms_eval/baselines/__init__.py`	New module exports
`lmms_eval/baselines/registry.py`	Baseline preset registry (model × task)
`lmms_eval/baselines/loader.py`	Load from local/HF/registry

Test Results

Smoke test with --baseline qwen25vl (preset name)
Smoke test with --baseline /path/to/local.jsonl (local path)
Smoke test with --baseline hf://user/repo/file.jsonl (HF URL)
Local cicd test

Terminal Table Output

JSON Output

CICD Test

Implement paired t-test statistical analysis: - Calculate mean difference and standard error - Compute 95% confidence interval - Return t-statistic and p-value - Fallback to normal approximation when scipy unavailable

New module for baseline management: - registry.py: Model × task preset registry structure - loader.py: Load baselines from local/HF/registry - Support hf://user/repo/file.jsonl URL format

Add helper function to compute paired t-test comparison: - Wrap paired_ttest with baseline metadata - Calculate baseline and current mean scores

Add baseline comparison logic to simple_evaluate(): - Load baseline data from registry/local/HF - Match samples by doc_id and extract scores - Compute paired t-test and store results with paired_ prefix - Add get_baseline_display_name() for short display names

Update make_table() for baseline comparison display: - Add Baseline/Diff/CI/P_Value columns - Auto-hide columns when all values are N/A - Dynamically compute Diff from current score and baseline - Format p-value with * for significance (p < 0.05)

Add CLI parameter to specify baseline for paired t-test: - Support preset name (e.g., qwen25vl) - Support local JSONL path - Support HuggingFace URL (hf://user/repo/file.jsonl)

kcz358 · 2026-01-20T01:13:56Z

lmms_eval/evaluator.py

+            from lmms_eval.baselines import BASELINE_REGISTRY, load_baseline
+            from lmms_eval.evaluator_utils import compute_baseline_comparison


Put import to top

kcz358 · 2026-01-20T01:14:45Z

lmms_eval/evaluator.py

+            def get_baseline_display_name(baseline_arg: str) -> str:
+                """Extract a short display name from baseline argument."""
+                # Handle model:task format (e.g., qwen25vl:mmbench)
+                if ":" in baseline_arg and not baseline_arg.startswith("hf://"):
+                    model_name, task = baseline_arg.split(":", 1)
+                    if model_name in BASELINE_REGISTRY:
+                        return model_name  # Just show model name
+                # Handle model preset (e.g., qwen25vl)
+                if baseline_arg in BASELINE_REGISTRY:
+                    return baseline_arg
+                # Handle HF URL
+                if baseline_arg.startswith("hf://"):
+                    # hf://user/repo/file.jsonl -> user/repo
+                    parts = baseline_arg[5:].split("/")
+                    return "/".join(parts[:2]) if len(parts) >= 2 else baseline_arg
+                # Handle local path
+                if "/" in baseline_arg or "\\" in baseline_arg:
+                    import os
+
+                    filename = os.path.basename(baseline_arg)
+                    return os.path.splitext(filename)[0][:30]  # Truncate to 30 chars
+                return baseline_arg


This should put in baselines as utils or in some file related to baseline. In-line function seems ugly

kcz358 · 2026-01-20T01:16:05Z

lmms_eval/evaluator.py

+                                    if "score" in key.lower():
+                                        val = sample[key]
+                                        if isinstance(val, (int, float)):
+                                            current_scores.append(float(val))
+                                            baseline_scores.append(baseline_scores_dict[doc_id])
+                                            break
+                                        elif isinstance(val, dict):
+                                            pred = val.get("pred_answer") or val.get("pred")
+                                            ans = val.get("answer") or val.get("target")
+                                            if pred and ans:
+                                                score = 1.0 if str(pred).strip().upper() == str(ans).strip().upper() else 0.0
+                                                current_scores.append(score)
+                                                baseline_scores.append(baseline_scores_dict[doc_id])
+                                                break


Maybe should use score_key to extract the score? Mayeb not a good idea to do another score calculation here

kcz358 · 2026-01-20T01:16:51Z

Should this PR being merged to main or v0.6? I am not sure abt that.

kcz358 · 2026-01-20T01:18:26Z

The score calculation seems a bit hardcode. Seems to be quite unflexible? Maybe using score key we created in the previous PR to retrieve the score and calculate the test. If does not exist, we should skip printing out this.

Move baseline-related imports from inside function to module level, following Python best practices for import organization.

Extract inline function to baselines/__init__.py for better code organization. The function is now exported and can be imported from lmms_eval.baselines.

- Get score_key from task config instead of hardcoded "score" lookup - Simplify score extraction logic by using score_key directly - Skip baseline comparison gracefully when no valid scores found - Add debug logging when skipping tasks due to missing scores

The score extraction now falls back to searching for fields ending with "_score" (e.g., videomme_perception_score) when the exact score_key is not found. This handles task-specific score field naming patterns.

mwxely · 2026-01-22T11:08:46Z

The score calculation seems a bit hardcode. Seems to be quite unflexible? Maybe using score key we created in the previous PR to retrieve the score and calculate the test. If does not exist, we should skip printing out this.

@kcz358 Thanks for the review! All issues have been addressed:

Commits added:

386a8922 - Move imports to top of file
602289ad - Move get_baseline_display_name to baselines/__init__.py
04227551 - Use score_key from task config for score extraction
3193f17e - Add fallback for *_score fields (e.g., videomme_perception_score)

Changes:

Imports moved from inside function to module level
Inline function extracted to baselines module for better organization
Score extraction now uses score_key from task config (default: "score")
Falls back to searching *_score fields for task-specific naming patterns
Skips baseline comparison gracefully when no valid scores found (logs debug info instead of failing)

Tested:

All 3 baseline input formats work: Local path, HF URL (hf://...), Preset (qwen25vl)
Terminal table shows Baseline/Diff/CI/P_Value columns
JSON output includes paired_* fields
Unit tests pass
CICD tests pass

kcz358 · 2026-01-23T01:45:30Z

A small feat rfc suggestion here might be interesting to see if we can do this better in the future. Possibly need to have one score key for one metric. Could either to rfc this in the metric list or allow to let score key to be a dict

Merge origin/main into feat/power-analysis, keeping both: - Power analysis CLI args and function (this branch) - Baseline/num_samples args and paired_ttest function (from PR #1006)

mwxely added 6 commits January 19, 2026 16:50

feat(metrics): add paired_ttest function for baseline comparison

e5077ce

Implement paired t-test statistical analysis: - Calculate mean difference and standard error - Compute 95% confidence interval - Return t-statistic and p-value - Fallback to normal approximation when scipy unavailable

feat(baselines): add registry and loader module

8cc08dc

New module for baseline management: - registry.py: Model × task preset registry structure - loader.py: Load baselines from local/HF/registry - Support hf://user/repo/file.jsonl URL format

feat(evaluator_utils): add compute_baseline_comparison helper

8889d3d

Add helper function to compute paired t-test comparison: - Wrap paired_ttest with baseline metadata - Calculate baseline and current mean scores

feat(cli): add --baseline argument for model comparison

b9c3d9b

Add CLI parameter to specify baseline for paired t-test: - Support preset name (e.g., qwen25vl) - Support local JSONL path - Support HuggingFace URL (hf://user/repo/file.jsonl)

mwxely force-pushed the feat/paired-ttest branch from fbeed7f to 850917b Compare January 19, 2026 17:00

mwxely requested review from Luodian and kcz358 January 19, 2026 17:05

style: apply isort and black formatting

e8313c9

mwxely force-pushed the feat/paired-ttest branch from 850917b to e8313c9 Compare January 19, 2026 17:46

mwxely mentioned this pull request Jan 19, 2026

[Release] v0.6 Development Branch - TUI, CLT/Clustered SE, Paired T-Test, Power Analysis, Stability Metrics, Decontamination, Import Refactor #1001

Open

kcz358 reviewed Jan 20, 2026

View reviewed changes

mwxely added 5 commits January 22, 2026 10:05

refactor: move imports to top of file in evaluator.py

386a892

Move baseline-related imports from inside function to module level, following Python best practices for import organization.

refactor: move get_baseline_display_name to baselines module

602289a

Extract inline function to baselines/__init__.py for better code organization. The function is now exported and can be imported from lmms_eval.baselines.

style: apply isort formatting to evaluator.py imports

715dcdb

fix: add fallback for *_score fields in baseline comparison

3193f17

The score extraction now falls back to searching for fields ending with "_score" (e.g., videomme_perception_score) when the exact score_key is not found. This handles task-specific score field naming patterns.

kcz358 approved these changes Jan 23, 2026

View reviewed changes

kcz358 merged commit f095899 into main Jan 23, 2026
3 checks passed

kcz358 deleted the feat/paired-ttest branch January 23, 2026 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add baseline comparison with paired t-test #1006

[feat] Add baseline comparison with paired t-test #1006

mwxely commented Jan 19, 2026

Uh oh!

kcz358 Jan 20, 2026

Uh oh!

kcz358 Jan 20, 2026

Uh oh!

kcz358 Jan 20, 2026

Uh oh!

kcz358 commented Jan 20, 2026

Uh oh!

kcz358 commented Jan 20, 2026

Uh oh!

mwxely commented Jan 22, 2026

Uh oh!

Uh oh!

kcz358 commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		from lmms_eval.baselines import BASELINE_REGISTRY, load_baseline
		from lmms_eval.evaluator_utils import compute_baseline_comparison

[feat] Add baseline comparison with paired t-test #1006

[feat] Add baseline comparison with paired t-test #1006

Conversation

mwxely commented Jan 19, 2026

Summary

Motivation

Baseline-Anchored Evaluation

Benefits

Changes

New CLI Parameter

New Output Columns

JSON Output Fields

Files Changed

Test Results

Terminal Table Output

JSON Output

CICD Test

Uh oh!

kcz358 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 commented Jan 20, 2026

Uh oh!

kcz358 commented Jan 20, 2026

Uh oh!

mwxely commented Jan 22, 2026

Uh oh!

Uh oh!

kcz358 commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants