Skip to content

Conversation

@mwxely
Copy link
Collaborator

@mwxely mwxely commented Jan 19, 2026

Summary

This PR adds baseline comparison capabilities to lmms-eval, enabling users to statistically compare model performance against a baseline using paired t-test analysis.

Motivation

Problem: Checking if confidence intervals overlap is low-power.

Solution: Paired test — compute per-question difference $d_i = score_A - score_B$, then test if $\text{mean}(d) \neq 0$.

Why: Removes question difficulty variance (dominant noise), isolates model difference signal.

Baseline-Anchored Evaluation

A practical application of paired comparison: anchor evaluations to a standard baseline model.

Approach Report Limitation
Absolute score "Our model: 78.3%" Meaningless without context
Leaderboard rank "Top-3 on MMMU" Rank doesn't quantify gap
Paired difference "+2.1% vs Gemini 3.0 Pro (p<0.01)" Statistically grounded claim

Benefits

  • Reproducible claims: "We beat baseline X by Y%" is verifiable
  • Training signal: Track improvement over baseline across checkpoints
  • Publication-ready: Statistical significance replaces hand-waving

Changes

New CLI Parameter

lmms-eval --model xxx --tasks videomme --baseline qwen25vl

Three ways to specify baseline:

Format Example Description
Preset qwen25vl Auto-match task from registry
Local /path/to/results.jsonl Local JSONL file
HF URL hf://user/repo/file.jsonl HuggingFace dataset

New Output Columns

Column Description Example
Baseline Baseline identifier qwen25vl
Diff Score difference +2.5%
CI 95% confidence interval [+0.8%, +4.2%]
P_Value Statistical significance 0.023*

JSON Output Fields

{
  "paired_baseline": "qwen25vl",
  "paired_baseline_score": 62.5,
  "paired_ci_lower": 0.8,
  "paired_ci_upper": 4.2,
  "paired_pvalue": 0.023
}

Files Changed

File Changes
lmms_eval/__main__.py Add --baseline CLI parameter
lmms_eval/api/metrics.py Add paired_ttest() function
lmms_eval/evaluator.py Integrate baseline comparison logic
lmms_eval/evaluator_utils.py Add compute_baseline_comparison() helper
lmms_eval/utils.py Display Baseline/Diff/CI/P_Value columns, auto-hide if N/A
lmms_eval/baselines/__init__.py New module exports
lmms_eval/baselines/registry.py Baseline preset registry (model × task)
lmms_eval/baselines/loader.py Load from local/HF/registry

Test Results

  • Smoke test with --baseline qwen25vl (preset name)
  • Smoke test with --baseline /path/to/local.jsonl (local path)
  • Smoke test with --baseline hf://user/repo/file.jsonl (HF URL)
  • Local cicd test

Terminal Table Output

19181b5d0547210a537915843eb4dabd

JSON Output

978bfee12668d995d1d9706156521d7e

CICD Test

d2b1523716c4a851cd3d6ae375d0a921

Implement paired t-test statistical analysis:
- Calculate mean difference and standard error
- Compute 95% confidence interval
- Return t-statistic and p-value
- Fallback to normal approximation when scipy unavailable
New module for baseline management:
- registry.py: Model × task preset registry structure
- loader.py: Load baselines from local/HF/registry
- Support hf://user/repo/file.jsonl URL format
Add helper function to compute paired t-test comparison:
- Wrap paired_ttest with baseline metadata
- Calculate baseline and current mean scores
Add baseline comparison logic to simple_evaluate():
- Load baseline data from registry/local/HF
- Match samples by doc_id and extract scores
- Compute paired t-test and store results with paired_ prefix
- Add get_baseline_display_name() for short display names
Update make_table() for baseline comparison display:
- Add Baseline/Diff/CI/P_Value columns
- Auto-hide columns when all values are N/A
- Dynamically compute Diff from current score and baseline
- Format p-value with * for significance (p < 0.05)
Add CLI parameter to specify baseline for paired t-test:
- Support preset name (e.g., qwen25vl)
- Support local JSONL path
- Support HuggingFace URL (hf://user/repo/file.jsonl)
@mwxely mwxely force-pushed the feat/paired-ttest branch from fbeed7f to 850917b Compare January 19, 2026 17:00
@mwxely mwxely requested review from Luodian and kcz358 January 19, 2026 17:05
Comment on lines 329 to 330
from lmms_eval.baselines import BASELINE_REGISTRY, load_baseline
from lmms_eval.evaluator_utils import compute_baseline_comparison
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put import to top

Comment on lines 333 to 354
def get_baseline_display_name(baseline_arg: str) -> str:
"""Extract a short display name from baseline argument."""
# Handle model:task format (e.g., qwen25vl:mmbench)
if ":" in baseline_arg and not baseline_arg.startswith("hf://"):
model_name, task = baseline_arg.split(":", 1)
if model_name in BASELINE_REGISTRY:
return model_name # Just show model name
# Handle model preset (e.g., qwen25vl)
if baseline_arg in BASELINE_REGISTRY:
return baseline_arg
# Handle HF URL
if baseline_arg.startswith("hf://"):
# hf://user/repo/file.jsonl -> user/repo
parts = baseline_arg[5:].split("/")
return "/".join(parts[:2]) if len(parts) >= 2 else baseline_arg
# Handle local path
if "/" in baseline_arg or "\\" in baseline_arg:
import os

filename = os.path.basename(baseline_arg)
return os.path.splitext(filename)[0][:30] # Truncate to 30 chars
return baseline_arg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should put in baselines as utils or in some file related to baseline. In-line function seems ugly

Comment on lines 371 to 384
if "score" in key.lower():
val = sample[key]
if isinstance(val, (int, float)):
current_scores.append(float(val))
baseline_scores.append(baseline_scores_dict[doc_id])
break
elif isinstance(val, dict):
pred = val.get("pred_answer") or val.get("pred")
ans = val.get("answer") or val.get("target")
if pred and ans:
score = 1.0 if str(pred).strip().upper() == str(ans).strip().upper() else 0.0
current_scores.append(score)
baseline_scores.append(baseline_scores_dict[doc_id])
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should use score_key to extract the score? Mayeb not a good idea to do another score calculation here

@kcz358
Copy link
Collaborator

kcz358 commented Jan 20, 2026

Should this PR being merged to main or v0.6? I am not sure abt that.

@kcz358
Copy link
Collaborator

kcz358 commented Jan 20, 2026

The score calculation seems a bit hardcode. Seems to be quite unflexible? Maybe using score key we created in the previous PR to retrieve the score and calculate the test. If does not exist, we should skip printing out this.

Move baseline-related imports from inside function to module level,
following Python best practices for import organization.
Extract inline function to baselines/__init__.py for better code
organization. The function is now exported and can be imported from
lmms_eval.baselines.
- Get score_key from task config instead of hardcoded "score" lookup
- Simplify score extraction logic by using score_key directly
- Skip baseline comparison gracefully when no valid scores found
- Add debug logging when skipping tasks due to missing scores
The score extraction now falls back to searching for fields ending with
"_score" (e.g., videomme_perception_score) when the exact score_key is
not found. This handles task-specific score field naming patterns.
@mwxely
Copy link
Collaborator Author

mwxely commented Jan 22, 2026

The score calculation seems a bit hardcode. Seems to be quite unflexible? Maybe using score key we created in the previous PR to retrieve the score and calculate the test. If does not exist, we should skip printing out this.

@kcz358 Thanks for the review! All issues have been addressed:

Commits added:

  1. 386a8922 - Move imports to top of file
  2. 602289ad - Move get_baseline_display_name to baselines/__init__.py
  3. 04227551 - Use score_key from task config for score extraction
  4. 3193f17e - Add fallback for *_score fields (e.g., videomme_perception_score)

Changes:

  • Imports moved from inside function to module level
  • Inline function extracted to baselines module for better organization
  • Score extraction now uses score_key from task config (default: "score")
  • Falls back to searching *_score fields for task-specific naming patterns
  • Skips baseline comparison gracefully when no valid scores found (logs debug info instead of failing)

Tested:

  • All 3 baseline input formats work: Local path, HF URL (hf://...), Preset (qwen25vl)
  • Terminal table shows Baseline/Diff/CI/P_Value columns
  • JSON output includes paired_* fields
  • Unit tests pass
  • CICD tests pass

@kcz358 kcz358 merged commit f095899 into main Jan 23, 2026
3 checks passed
@kcz358 kcz358 deleted the feat/paired-ttest branch January 23, 2026 01:43
@kcz358
Copy link
Collaborator

kcz358 commented Jan 23, 2026

A small feat rfc suggestion here might be interesting to see if we can do this better in the future. Possibly need to have one score key for one metric. Could either to rfc this in the metric list or allow to let score key to be a dict

mwxely added a commit that referenced this pull request Jan 23, 2026
Merge origin/main into feat/power-analysis, keeping both:
- Power analysis CLI args and function (this branch)
- Baseline/num_samples args and paired_ttest function (from PR #1006)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants