Skip to content

Conversation

@mwxely
Copy link
Collaborator

@mwxely mwxely commented Jan 19, 2026

Summary

Add statistical power analysis to help users determine the minimum sample size needed to detect a given effect size before running evaluations.

  • Add power_analysis() function in api/metrics.py
  • Add --power-analysis CLI mode with parameters: --effect-size, --alpha, --power, --correlation

Motivation

Problem

Researchers often run full evaluations without knowing if the benchmark has enough statistical power to detect meaningful differences. This wastes compute and produces unreliable conclusions.

Solution

Add a --power-analysis mode that calculates:

  1. Minimum sample size required to detect a specified effect size
  2. Current power of existing benchmarks (when --tasks is specified)
  3. Minimum detectable effect at the current sample size

Usage

# Basic: calculate min n for detecting 3% difference
lmms-eval --power-analysis --effect-size 0.03

# With task: check if benchmark has sufficient power
lmms-eval --power-analysis --effect-size 0.03 --tasks videomme

CLI Arguments

Argument Default Description
--power-analysis False Enable power analysis mode
--effect-size 0.03 Minimum effect size to detect (3%)
--alpha 0.05 Significance level
--power 0.80 Desired statistical power
--correlation 0.5 Expected correlation between paired samples

Files Changed

File Changes
lmms_eval/api/metrics.py +57 lines - Add power_analysis() function
lmms_eval/__main__.py +95 lines - Add CLI arguments and handler

Test Results

64bf4b10166b3e1501f5fbada723f422

CICD Test

443fc430d3fa3a46e87bd6fe73bbdf58

Add statistical function to calculate minimum sample size needed to
detect a given effect size using paired t-test power analysis.
Add CLI arguments and handler for power analysis:
- --power-analysis: enable power analysis mode
- --effect-size: minimum effect to detect (default 0.03)
- --alpha: significance level (default 0.05)
- --power: desired power (default 0.80)
- --correlation: expected correlation (default 0.5)

def power_analysis(
effect_size: float,
std: float = 0.5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std should be evaluated from the previous evaluation?

Calculate minimum sample size for paired t-test power analysis.
For paired samples, the effective variance is reduced by correlation:
Var(X - Y) = Var(X) + Var(Y) - 2*Cov(X,Y) = 2*sigma^2*(1-rho)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Var(x) not necessarily equals to Var(y)

Add note in docstring that std parameter should ideally be estimated
from previous evaluation results rather than using default value.
Add reference to Miller 2024 paper (arXiv:2411.00640).
- Replace single 'std' param with 'std_a' and 'std_b' for general case
- Fix formula: var_diff = std_a^2 + std_b^2 - 2*rho*std_a*std_b
- Add --std-a and --std-b CLI arguments
- Backward compatible: defaults to 0.5 if neither provided
@mwxely
Copy link
Collaborator Author

mwxely commented Jan 22, 2026

Summary

Both review comments have been addressed in 2 separate commits.


Comment 1: "std should be evaluated from the previous evaluation?"

Location: lmms_eval/api/metrics.py line 705

Response:

Yes, you're right. According to the paper (Section 5):

"The quantities ω², σ_A², and σ_B² may be estimated from previous eval data."

The default value 0.5 was just a rough approximation for binary (0/1) scores.

Fix: Added docstring note clarifying that std should be estimated from previous evaluation results, with reference to the source paper.

Commit: c87e7c07 docs: clarify std should be estimated from previous eval data


Comment 2: "Var(x) not necessarily equals to Var(y)"

Location: lmms_eval/api/metrics.py lines 714-715

Response:

Correct. The original implementation assumed Var(X) = Var(Y) = σ², which simplifies to 2σ²(1-ρ).

The general formula from the paper is:

ω² = Var(x_A) + Var(x_B) - 2·Cov(x_A, x_B)
   = σ_A² + σ_B² - 2·ρ·σ_A·σ_B

Fix:

  • Replaced single std parameter with separate std_a and std_b parameters
  • Updated formula to: var_diff = std_a² + std_b² - 2*rho*std_a*std_b
  • Added --std-a and --std-b CLI arguments
  • Backward compatible: if only std_a provided, assumes std_b = std_a; if neither, defaults to 0.5

Commit: 69267ceb fix: use separate std_a/std_b params for general variance formula


Changes Summary

File Commit 1 (docs) Commit 2 (fix)
lmms_eval/api/metrics.py Update docstring, add paper reference Split std → std_a/std_b, fix formula
lmms_eval/__main__.py - Add --std-a, --std-b CLI args

Verification

Smoke test passed:

python -m lmms_eval --power-analysis --effect-size 0.03 --std-a 0.4 --std-b 0.5

Output:

============================================================
POWER ANALYSIS RESULTS
============================================================

Parameters:
  Effect size (delta):     3.0%
  Std (model A):           0.4
  Std (model B):           0.5
  Significance level (α):  0.05
  Desired power (1-β):     0.8
  Correlation (ρ):         0.5

Result:
  Minimum sample size:     n = 1832

Formula verification: When std_a = std_b, new formula gives identical results to old formula

@kcz358
Copy link
Collaborator

kcz358 commented Jan 23, 2026

Seems fine for current. Just a reminder here, for users that wish to use this, you have to manually calculate the std for model A and B. So you have to postprocess your result currently. Can try resolve the conflict. Thanks

Merge origin/main into feat/power-analysis, keeping both:
- Power analysis CLI args and function (this branch)
- Baseline/num_samples args and paired_ttest function (from PR #1006)
@kcz358 kcz358 merged commit ed55078 into main Jan 23, 2026
3 checks passed
@kcz358 kcz358 deleted the feat/power-analysis branch January 23, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants