-
Notifications
You must be signed in to change notification settings - Fork 495
[feat] Add Power Analysis for Pre-Evaluation Planning #1007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add statistical function to calculate minimum sample size needed to detect a given effect size using paired t-test power analysis.
Add CLI arguments and handler for power analysis: - --power-analysis: enable power analysis mode - --effect-size: minimum effect to detect (default 0.03) - --alpha: significance level (default 0.05) - --power: desired power (default 0.80) - --correlation: expected correlation (default 0.5)
lmms_eval/api/metrics.py
Outdated
|
|
||
| def power_analysis( | ||
| effect_size: float, | ||
| std: float = 0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std should be evaluated from the previous evaluation?
lmms_eval/api/metrics.py
Outdated
| Calculate minimum sample size for paired t-test power analysis. | ||
| For paired samples, the effective variance is reduced by correlation: | ||
| Var(X - Y) = Var(X) + Var(Y) - 2*Cov(X,Y) = 2*sigma^2*(1-rho) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Var(x) not necessarily equals to Var(y)
Add note in docstring that std parameter should ideally be estimated from previous evaluation results rather than using default value. Add reference to Miller 2024 paper (arXiv:2411.00640).
- Replace single 'std' param with 'std_a' and 'std_b' for general case - Fix formula: var_diff = std_a^2 + std_b^2 - 2*rho*std_a*std_b - Add --std-a and --std-b CLI arguments - Backward compatible: defaults to 0.5 if neither provided
SummaryBoth review comments have been addressed in 2 separate commits. Comment 1: "std should be evaluated from the previous evaluation?"Location: Response: Yes, you're right. According to the paper (Section 5):
The default value Fix: Added docstring note clarifying that Commit: Comment 2: "Var(x) not necessarily equals to Var(y)"Location: Response: Correct. The original implementation assumed The general formula from the paper is: Fix:
Commit: Changes Summary
VerificationSmoke test passed: python -m lmms_eval --power-analysis --effect-size 0.03 --std-a 0.4 --std-b 0.5Output: Formula verification: When |
|
Seems fine for current. Just a reminder here, for users that wish to use this, you have to manually calculate the std for model A and B. So you have to postprocess your result currently. Can try resolve the conflict. Thanks |
Merge origin/main into feat/power-analysis, keeping both: - Power analysis CLI args and function (this branch) - Baseline/num_samples args and paired_ttest function (from PR #1006)
Summary
Add statistical power analysis to help users determine the minimum sample size needed to detect a given effect size before running evaluations.
power_analysis()function inapi/metrics.py--power-analysisCLI mode with parameters:--effect-size,--alpha,--power,--correlationMotivation
Problem
Researchers often run full evaluations without knowing if the benchmark has enough statistical power to detect meaningful differences. This wastes compute and produces unreliable conclusions.
Solution
Add a
--power-analysismode that calculates:--tasksis specified)Usage
CLI Arguments
--power-analysisFalse--effect-size0.03--alpha0.05--power0.80--correlation0.5Files Changed
lmms_eval/api/metrics.pypower_analysis()functionlmms_eval/__main__.pyTest Results
CICD Test