-
Notifications
You must be signed in to change notification settings - Fork 495
[feat] Add baseline comparison with paired t-test #1006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement paired t-test statistical analysis: - Calculate mean difference and standard error - Compute 95% confidence interval - Return t-statistic and p-value - Fallback to normal approximation when scipy unavailable
New module for baseline management: - registry.py: Model × task preset registry structure - loader.py: Load baselines from local/HF/registry - Support hf://user/repo/file.jsonl URL format
Add helper function to compute paired t-test comparison: - Wrap paired_ttest with baseline metadata - Calculate baseline and current mean scores
Add baseline comparison logic to simple_evaluate(): - Load baseline data from registry/local/HF - Match samples by doc_id and extract scores - Compute paired t-test and store results with paired_ prefix - Add get_baseline_display_name() for short display names
Update make_table() for baseline comparison display: - Add Baseline/Diff/CI/P_Value columns - Auto-hide columns when all values are N/A - Dynamically compute Diff from current score and baseline - Format p-value with * for significance (p < 0.05)
Add CLI parameter to specify baseline for paired t-test: - Support preset name (e.g., qwen25vl) - Support local JSONL path - Support HuggingFace URL (hf://user/repo/file.jsonl)
fbeed7f to
850917b
Compare
850917b to
e8313c9
Compare
lmms_eval/evaluator.py
Outdated
| from lmms_eval.baselines import BASELINE_REGISTRY, load_baseline | ||
| from lmms_eval.evaluator_utils import compute_baseline_comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put import to top
lmms_eval/evaluator.py
Outdated
| def get_baseline_display_name(baseline_arg: str) -> str: | ||
| """Extract a short display name from baseline argument.""" | ||
| # Handle model:task format (e.g., qwen25vl:mmbench) | ||
| if ":" in baseline_arg and not baseline_arg.startswith("hf://"): | ||
| model_name, task = baseline_arg.split(":", 1) | ||
| if model_name in BASELINE_REGISTRY: | ||
| return model_name # Just show model name | ||
| # Handle model preset (e.g., qwen25vl) | ||
| if baseline_arg in BASELINE_REGISTRY: | ||
| return baseline_arg | ||
| # Handle HF URL | ||
| if baseline_arg.startswith("hf://"): | ||
| # hf://user/repo/file.jsonl -> user/repo | ||
| parts = baseline_arg[5:].split("/") | ||
| return "/".join(parts[:2]) if len(parts) >= 2 else baseline_arg | ||
| # Handle local path | ||
| if "/" in baseline_arg or "\\" in baseline_arg: | ||
| import os | ||
|
|
||
| filename = os.path.basename(baseline_arg) | ||
| return os.path.splitext(filename)[0][:30] # Truncate to 30 chars | ||
| return baseline_arg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should put in baselines as utils or in some file related to baseline. In-line function seems ugly
lmms_eval/evaluator.py
Outdated
| if "score" in key.lower(): | ||
| val = sample[key] | ||
| if isinstance(val, (int, float)): | ||
| current_scores.append(float(val)) | ||
| baseline_scores.append(baseline_scores_dict[doc_id]) | ||
| break | ||
| elif isinstance(val, dict): | ||
| pred = val.get("pred_answer") or val.get("pred") | ||
| ans = val.get("answer") or val.get("target") | ||
| if pred and ans: | ||
| score = 1.0 if str(pred).strip().upper() == str(ans).strip().upper() else 0.0 | ||
| current_scores.append(score) | ||
| baseline_scores.append(baseline_scores_dict[doc_id]) | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should use score_key to extract the score? Mayeb not a good idea to do another score calculation here
|
Should this PR being merged to main or v0.6? I am not sure abt that. |
|
The score calculation seems a bit hardcode. Seems to be quite unflexible? Maybe using score key we created in the previous PR to retrieve the score and calculate the test. If does not exist, we should skip printing out this. |
Move baseline-related imports from inside function to module level, following Python best practices for import organization.
Extract inline function to baselines/__init__.py for better code organization. The function is now exported and can be imported from lmms_eval.baselines.
- Get score_key from task config instead of hardcoded "score" lookup - Simplify score extraction logic by using score_key directly - Skip baseline comparison gracefully when no valid scores found - Add debug logging when skipping tasks due to missing scores
The score extraction now falls back to searching for fields ending with "_score" (e.g., videomme_perception_score) when the exact score_key is not found. This handles task-specific score field naming patterns.
@kcz358 Thanks for the review! All issues have been addressed: Commits added:
Changes:
Tested:
|
|
A small feat rfc suggestion here might be interesting to see if we can do this better in the future. Possibly need to have one score key for one metric. Could either to rfc this in the metric list or allow to let score key to be a dict |
Merge origin/main into feat/power-analysis, keeping both: - Power analysis CLI args and function (this branch) - Baseline/num_samples args and paired_ttest function (from PR #1006)
Summary
This PR adds baseline comparison capabilities to
lmms-eval, enabling users to statistically compare model performance against a baseline using paired t-test analysis.Motivation
Problem: Checking if confidence intervals overlap is low-power.
Solution: Paired test — compute per-question difference$d_i = score_A - score_B$ , then test if $\text{mean}(d) \neq 0$ .
Why: Removes question difficulty variance (dominant noise), isolates model difference signal.
Baseline-Anchored Evaluation
A practical application of paired comparison: anchor evaluations to a standard baseline model.
Benefits
Changes
New CLI Parameter
Three ways to specify baseline:
qwen25vl/path/to/results.jsonlhf://user/repo/file.jsonlNew Output Columns
qwen25vl+2.5%[+0.8%, +4.2%]0.023*JSON Output Fields
{ "paired_baseline": "qwen25vl", "paired_baseline_score": 62.5, "paired_ci_lower": 0.8, "paired_ci_upper": 4.2, "paired_pvalue": 0.023 }Files Changed
lmms_eval/__main__.py--baselineCLI parameterlmms_eval/api/metrics.pypaired_ttest()functionlmms_eval/evaluator.pylmms_eval/evaluator_utils.pycompute_baseline_comparison()helperlmms_eval/utils.pylmms_eval/baselines/__init__.pylmms_eval/baselines/registry.pylmms_eval/baselines/loader.pyTest Results
--baseline qwen25vl(preset name)--baseline /path/to/local.jsonl(local path)--baseline hf://user/repo/file.jsonl(HF URL)cicdtestTerminal Table Output
JSON Output
CICD Test