[feat] Add Power Analysis for Pre-Evaluation Planning #1007

mwxely · 2026-01-19T17:43:03Z

Summary

Add statistical power analysis to help users determine the minimum sample size needed to detect a given effect size before running evaluations.

Add power_analysis() function in api/metrics.py
Add --power-analysis CLI mode with parameters: --effect-size, --alpha, --power, --correlation

Motivation

Problem

Researchers often run full evaluations without knowing if the benchmark has enough statistical power to detect meaningful differences. This wastes compute and produces unreliable conclusions.

Solution

Add a --power-analysis mode that calculates:

Minimum sample size required to detect a specified effect size
Current power of existing benchmarks (when --tasks is specified)
Minimum detectable effect at the current sample size

Usage

# Basic: calculate min n for detecting 3% difference
lmms-eval --power-analysis --effect-size 0.03

# With task: check if benchmark has sufficient power
lmms-eval --power-analysis --effect-size 0.03 --tasks videomme

CLI Arguments

Argument	Default	Description
`--power-analysis`	`False`	Enable power analysis mode
`--effect-size`	`0.03`	Minimum effect size to detect (3%)
`--alpha`	`0.05`	Significance level
`--power`	`0.80`	Desired statistical power
`--correlation`	`0.5`	Expected correlation between paired samples

Files Changed

File	Changes
`lmms_eval/api/metrics.py`	+57 lines - Add `power_analysis()` function
`lmms_eval/__main__.py`	+95 lines - Add CLI arguments and handler

Test Results

CICD Test

Add statistical function to calculate minimum sample size needed to detect a given effect size using paired t-test power analysis.

Add CLI arguments and handler for power analysis: - --power-analysis: enable power analysis mode - --effect-size: minimum effect to detect (default 0.03) - --alpha: significance level (default 0.05) - --power: desired power (default 0.80) - --correlation: expected correlation (default 0.5)

kcz358 · 2026-01-20T01:35:16Z

lmms_eval/api/metrics.py

+
+def power_analysis(
+    effect_size: float,
+    std: float = 0.5,


std should be evaluated from the previous evaluation?

kcz358 · 2026-01-20T01:38:06Z

lmms_eval/api/metrics.py

+    Calculate minimum sample size for paired t-test power analysis.
+
+    For paired samples, the effective variance is reduced by correlation:
+    Var(X - Y) = Var(X) + Var(Y) - 2*Cov(X,Y) = 2*sigma^2*(1-rho)


Var(x) not necessarily equals to Var(y)

Add note in docstring that std parameter should ideally be estimated from previous evaluation results rather than using default value. Add reference to Miller 2024 paper (arXiv:2411.00640).

- Replace single 'std' param with 'std_a' and 'std_b' for general case - Fix formula: var_diff = std_a^2 + std_b^2 - 2*rho*std_a*std_b - Add --std-a and --std-b CLI arguments - Backward compatible: defaults to 0.5 if neither provided

mwxely · 2026-01-22T11:40:37Z

Summary

Both review comments have been addressed in 2 separate commits.

Comment 1: "std should be evaluated from the previous evaluation?"

Location: lmms_eval/api/metrics.py line 705

Response:

Yes, you're right. According to the paper (Section 5):

"The quantities ω², σ_A², and σ_B² may be estimated from previous eval data."

The default value 0.5 was just a rough approximation for binary (0/1) scores.

Fix: Added docstring note clarifying that std should be estimated from previous evaluation results, with reference to the source paper.

Commit: c87e7c07 docs: clarify std should be estimated from previous eval data

Comment 2: "Var(x) not necessarily equals to Var(y)"

Location: lmms_eval/api/metrics.py lines 714-715

Response:

Correct. The original implementation assumed Var(X) = Var(Y) = σ², which simplifies to 2σ²(1-ρ).

The general formula from the paper is:

ω² = Var(x_A) + Var(x_B) - 2·Cov(x_A, x_B)
   = σ_A² + σ_B² - 2·ρ·σ_A·σ_B

Fix:

Replaced single std parameter with separate std_a and std_b parameters
Updated formula to: var_diff = std_a² + std_b² - 2*rho*std_a*std_b
Added --std-a and --std-b CLI arguments
Backward compatible: if only std_a provided, assumes std_b = std_a; if neither, defaults to 0.5

Commit: 69267ceb fix: use separate std_a/std_b params for general variance formula

Changes Summary

File	Commit 1 (docs)	Commit 2 (fix)
`lmms_eval/api/metrics.py`	Update docstring, add paper reference	Split std → std_a/std_b, fix formula
`lmms_eval/__main__.py`	-	Add --std-a, --std-b CLI args

Verification

Smoke test passed:

python -m lmms_eval --power-analysis --effect-size 0.03 --std-a 0.4 --std-b 0.5

Output:

============================================================
POWER ANALYSIS RESULTS
============================================================

Parameters:
  Effect size (delta):     3.0%
  Std (model A):           0.4
  Std (model B):           0.5
  Significance level (α):  0.05
  Desired power (1-β):     0.8
  Correlation (ρ):         0.5

Result:
  Minimum sample size:     n = 1832

Formula verification: When std_a = std_b, new formula gives identical results to old formula

kcz358 · 2026-01-23T01:48:28Z

Seems fine for current. Just a reminder here, for users that wish to use this, you have to manually calculate the std for model A and B. So you have to postprocess your result currently. Can try resolve the conflict. Thanks

Merge origin/main into feat/power-analysis, keeping both: - Power analysis CLI args and function (this branch) - Baseline/num_samples args and paired_ttest function (from PR #1006)

mwxely added 2 commits January 19, 2026 17:37

feat(metrics): add power_analysis function for sample size calculation

f66dcf9

Add statistical function to calculate minimum sample size needed to detect a given effect size using paired t-test power analysis.

mwxely requested review from Luodian and kcz358 January 19, 2026 17:48

mwxely mentioned this pull request Jan 19, 2026

[Release] v0.6 Development Branch - TUI, CLT/Clustered SE, Paired T-Test, Power Analysis, Stability Metrics, Decontamination, Import Refactor #1001

Open

kcz358 reviewed Jan 20, 2026

View reviewed changes

mwxely added 2 commits January 22, 2026 11:33

docs: clarify std should be estimated from previous eval data

c87e7c0

Add note in docstring that std parameter should ideally be estimated from previous evaluation results rather than using default value. Add reference to Miller 2024 paper (arXiv:2411.00640).

merge: resolve conflicts with main branch

a1c6de4

Merge origin/main into feat/power-analysis, keeping both: - Power analysis CLI args and function (this branch) - Baseline/num_samples args and paired_ttest function (from PR #1006)

kcz358 approved these changes Jan 23, 2026

View reviewed changes

kcz358 merged commit ed55078 into main Jan 23, 2026
3 checks passed

kcz358 deleted the feat/power-analysis branch January 23, 2026 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add Power Analysis for Pre-Evaluation Planning #1007

[feat] Add Power Analysis for Pre-Evaluation Planning #1007

mwxely commented Jan 19, 2026

Uh oh!

kcz358 Jan 20, 2026

Uh oh!

kcz358 Jan 20, 2026

Uh oh!

mwxely commented Jan 22, 2026

Uh oh!

kcz358 commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[feat] Add Power Analysis for Pre-Evaluation Planning #1007

[feat] Add Power Analysis for Pre-Evaluation Planning #1007

Conversation

mwxely commented Jan 19, 2026

Summary

Motivation

Problem

Solution

Usage

CLI Arguments

Files Changed

Test Results

CICD Test

Uh oh!

kcz358 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

mwxely commented Jan 22, 2026

Summary

Comment 1: "std should be evaluated from the previous evaluation?"

Comment 2: "Var(x) not necessarily equals to Var(y)"

Changes Summary

Verification

Uh oh!

kcz358 commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants