LCORE-723: Compute correct confidence interval#71
LCORE-723: Compute correct confidence interval#71tisnik merged 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughIntroduces bootstrap-based confidence interval computation for metric scores via a new bootstrap_intervals function, integrates CI into _finalize_metric_stats output, adds a placeholder confidence_intervals field at conversation level, updates imports, and adds unit tests covering bootstrap_intervals use and export. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant Stats as _finalize_metric_stats
participant Boot as bootstrap_intervals
Caller->>Stats: finalize metric stats (scores)
alt scores length > 1
Stats->>Boot: compute CI (Series, confidence, steps)
Boot-->>Stats: (low, mean, high)
Stats-->>Caller: stats with score_statistics.confidence_interval
else scores length <= 1 or failure
Stats-->>Caller: stats with score_statistics.confidence_interval = None
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
67-88: Consider extracting repetitive bootstrap logic.The same pattern (bootstrap, convert to percentage, store in dict) is repeated three times for pass/fail/error rates. Extracting this to a helper function would improve maintainability and reduce duplication.
Example refactor:
def _compute_rate_confidence_interval(series: pd.Series, rate_name: str) -> dict[str, Any]: """Compute confidence interval for a rate and return formatted dict.""" ci_low, ci_mean, ci_high = bootstrap_intervals(series) return { "low": float(ci_low * 100), "mean": float(ci_mean * 100), "high": float(ci_high * 100), } # Then use it: confidence_intervals = { "pass_rate": _compute_rate_confidence_interval(pass_series, "pass_rate"), "fail_rate": _compute_rate_confidence_interval(fail_series, "fail_rate"), "error_rate": _compute_rate_confidence_interval(error_series, "error_rate"), }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lightspeed_evaluation/core/output/statistics.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(185-224)
🔇 Additional comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
20-35: Verify bootstrap implementation locally
Run the bootstrap tests in an environment with numpy/pandas to confirm:
- Using
medianvsmeanfor the bootstrap central estimate- Correct percentile offsets and sign inversions for lower/upper bounds
- Rename
low/highfor clarity if needed- Add input validation for empty series, NaN/inf, non-numeric types, and
bootstrap_steps > 0
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/output/statistics.py (2)
224-236: No-op try/except; simplify and add TODO.This block always sets None. Remove the try/except and document the limitation.
- # Calculate confidence intervals for conversation rates - if total > 1: # Need at least 2 samples for meaningful bootstrap - try: - # Create binary series for each outcome type - # Note: We need to reconstruct the original results for this conversation - # Since we don't have access to the original results here, - # we'll skip CI for conversations. This could be enhanced by - # passing the original results to this function - stats["confidence_intervals"] = None - except (ValueError, RuntimeError): - stats["confidence_intervals"] = None - else: - stats["confidence_intervals"] = None + # TODO: Implement conversation-level confidence intervals when original + # results are available here to build binary series for bootstrap. + stats["confidence_intervals"] = None
16-21: Replace assert and add robust input validation.Assertions can be disabled; also validate steps and sample size.
- assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100" + if not (0 <= confidence <= 100): + raise ValueError(f"confidence must be between 0 and 100, got {confidence}") + if bootstrap_steps <= 0: + raise ValueError(f"bootstrap_steps must be positive, got {bootstrap_steps}") + if len(s) < 2: + raise ValueError("bootstrap_intervals requires at least 2 samples")Based on static analysis hints
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
61-90: Don’t drop all intervals if one computation fails; handle per-rate independently.Keep partial results; only omit failing entries. Set None only if all fail.
- confidence_intervals = {} - if total > 1: # Need at least 2 samples for meaningful bootstrap - try: - # Pass rate confidence interval - ci_low, ci_mean, ci_high = bootstrap_intervals(pass_series) - confidence_intervals["pass_rate"] = { - "low": float(ci_low * 100), # Convert to percentage - "mean": float(ci_mean * 100), - "high": float(ci_high * 100), - } - - # Fail rate confidence interval - ci_low, ci_mean, ci_high = bootstrap_intervals(fail_series) - confidence_intervals["fail_rate"] = { - "low": float(ci_low * 100), - "mean": float(ci_mean * 100), - "high": float(ci_high * 100), - } - - # Error rate confidence interval - ci_low, ci_mean, ci_high = bootstrap_intervals(error_series) - confidence_intervals["error_rate"] = { - "low": float(ci_low * 100), - "mean": float(ci_mean * 100), - "high": float(ci_high * 100), - } - except (ValueError, RuntimeError): - confidence_intervals = None - else: - confidence_intervals = None + confidence_intervals: dict[str, dict[str, float]] | None = {} + if total > 1: # Need at least 2 samples for meaningful bootstrap + try: + ci_low, ci_median, ci_high = bootstrap_intervals(pass_series) + confidence_intervals["pass_rate"] = { + "low": float(ci_low * 100), + "median": float(ci_median * 100), + "high": float(ci_high * 100), + } + except (ValueError, RuntimeError): + pass + try: + ci_low, ci_median, ci_high = bootstrap_intervals(fail_series) + confidence_intervals["fail_rate"] = { + "low": float(ci_low * 100), + "median": float(ci_median * 100), + "high": float(ci_high * 100), + } + except (ValueError, RuntimeError): + pass + try: + ci_low, ci_median, ci_high = bootstrap_intervals(error_series) + confidence_intervals["error_rate"] = { + "low": float(ci_low * 100), + "median": float(ci_median * 100), + "high": float(ci_high * 100), + } + except (ValueError, RuntimeError): + pass + if not confidence_intervals: + confidence_intervals = None + else: + confidence_intervals = None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lightspeed_evaluation/core/output/statistics.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(185-224)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/output/statistics.py
[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: pydocstyle
- GitHub Check: mypy
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
7070411 to
caa5200
Compare
|
Hi Team, PTAL. Thanks in advance. |
asamal4
left a comment
There was a problem hiding this comment.
Question: Are you going to add a different script to run the evaluation with multiple models ?
Also we don't need CI for pass/fail status rather for the metric scores
|
Yes will be adding different script to run the evaluation with with multiple models. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
src/lightspeed_evaluation/core/output/statistics.py (4)
23-25: Optimize bootstrap performance.The current implementation uses 100,000 iterations of
pandas.sample()in a Python loop, which is very slow and may cause performance issues on larger datasets. Consider vectorizing the bootstrap using NumPy operations.Apply this diff to improve performance:
- rates = np.array( - [np.mean(s.sample(n=sample_n, replace=True)) for _ in range(bootstrap_steps)] - ) + # Vectorized bootstrap using NumPy for better performance + values = s.to_numpy(copy=False) + rates = np.array( + [float(np.mean(np.random.choice(values, size=sample_n, replace=True))) + for _ in range(bootstrap_steps)] + )For even better performance, consider reducing
bootstrap_stepsto 10,000 (still statistically sufficient) or implementing a fully vectorized approach usingnp.random.choicewith a 2D array.
16-16: Replace assert with proper validation.Using
assertfor parameter validation is unsafe because assertions are disabled with Python's-Ooptimization flag, allowing invalid confidence values to pass through unchecked at runtime. This is a critical correctness issue that must be fixed regardless of the code's origin.Apply this diff to use proper validation:
- assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100" + if not (0 <= confidence <= 100): + raise ValueError(f"confidence must be between 0 and 100, got {confidence}")Note: This was flagged by static analysis (Bandit) and in past reviews, but remains unresolved.
12-33: Fix naming inconsistency: median labeled as mean.The function computes the median of bootstrap samples (line 28) but exposes it as "mean" throughout the API (docstring line 15, return variable name line 28, and usage in lines 151-154). This misleading naming can confuse consumers of the API.
Apply these diffs to align the naming:
1. Update the function signature and docstring:
- """Compute confidence interval using bootstraping, return low, mean, high.""" + """Compute confidence interval using bootstrapping, return low, median, high."""2. Rename the return variable:
# Median (not mean) is correct here - success_rate_boot_strap = np.median(rates) + median_bootstrap = np.median(rates) low = np.percentile(rates - success_rate, (confidence_rev / 2.0)) high = np.percentile(rates - success_rate, 100 - (confidence_rev / 2.0)) # high represent lower bound, low represents upper bound - return success_rate - high, success_rate_boot_strap, success_rate - low + return success_rate - high, median_bootstrap, success_rate - low3. Update all call sites to use the correct variable name (see lines 151, 154 in
_finalize_metric_stats).
185-197: Simplify no-op try-except block.The try-except-else structure sets
confidence_intervalstoNonein all branches, making the try-except unnecessary. The comments correctly explain that conversation-level CIs cannot be computed without access to original results.Apply this diff to simplify:
- # Calculate confidence intervals for conversation rates - if total > 1: # Need at least 2 samples for meaningful bootstrap - try: - # Create binary series for each outcome type - # Note: We need to reconstruct the original results for this conversation - # Since we don't have access to the original results here, - # we'll skip CI for conversations. This could be enhanced by - # passing the original results to this function - stats["confidence_intervals"] = None - except (ValueError, RuntimeError): - stats["confidence_intervals"] = None - else: - stats["confidence_intervals"] = None + # TODO: Implement conversation-level confidence intervals + # Requires passing original results to reconstruct binary series for bootstrap + # See: Lines 185-197 for context on why this is not yet implemented + stats["confidence_intervals"] = NoneThis was flagged in a previous review and marked as addressed, but the no-op structure remains.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
151-156: Consider making confidence level configurable.The confidence level is hardcoded to 95. While 95% is a standard choice, consider:
- Adding a module-level constant
DEFAULT_CONFIDENCE_LEVEL = 95for clarity and maintainability- Or, exposing it as a parameter in
calculate_detailed_statsif different projects need different confidence levelsExample of introducing a constant:
+# Default confidence level for statistical reporting (95% is standard) +DEFAULT_CONFIDENCE_LEVEL = 95 + def bootstrap_intervals( - s: pd.Series, confidence: int = 95, bootstrap_steps: int = 100000 + s: pd.Series, confidence: int = DEFAULT_CONFIDENCE_LEVEL, bootstrap_steps: int = 100000 ) -> tuple[np.floating, np.floating, np.floating]:score_stats["confidence_interval"] = { "low": float(ci_low), "mean": float(ci_mean), "high": float(ci_high), - "confidence_level": 95, # Default confidence level + "confidence_level": DEFAULT_CONFIDENCE_LEVEL, }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lightspeed_evaluation/core/output/statistics.py(4 hunks)tests/unit/core/output/test_statistics.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(185-224)
tests/unit/core/output/test_statistics.py (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
bootstrap_intervals(12-33)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/output/statistics.py
[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
- GitHub Check: mypy
🔇 Additional comments (4)
tests/unit/core/output/test_statistics.py (3)
4-11: LGTM!The import additions correctly support testing the new
bootstrap_intervalsfunction.
61-75: Update test expectations when assert is replaced.These tests correctly expect
AssertionErrorfor invalid confidence values, matching the current implementation. However, past reviews (and static analysis) flagged the use ofassertinbootstrap_intervalsas a critical issue because assertions can be disabled with Python's-Oflag.Once the implementation is updated to use proper validation (e.g.,
raise ValueError), these tests will need to be updated to expectValueErrorinstead ofAssertionError.Update the test expectations after fixing the assert issue in the implementation:
- with pytest.raises( - AssertionError, match="Invalid confidence, must be between 0 and 100" - ): + with pytest.raises( + ValueError, match="Invalid confidence, must be between 0 and 100" + ): bootstrap_intervals(data, confidence=-5) # Test confidence > 100 - with pytest.raises( - AssertionError, match="Invalid confidence, must be between 0 and 100" - ): + with pytest.raises( + ValueError, match="Invalid confidence, must be between 0 and 100" + ): bootstrap_intervals(data, confidence=150)
77-102: LGTM!The edge case and extreme confidence level tests are well-designed:
- Single value test appropriately uses strict equality (deterministic behavior)
- All-same-values test uses tolerance for floating-point comparisons
- Testing 0% and 100% confidence levels provides good boundary coverage
src/lightspeed_evaluation/core/output/statistics.py (1)
6-7: LGTM!The NumPy and pandas imports are appropriate for the bootstrap confidence interval implementation.
We need confidence interval especially for these. |
There was a problem hiding this comment.
Great work overall 💪
My comment disappeared as you probably committed something, pasting it here:
In fact we want to compute confidence intervals for PASS/FAIL/ERROR as well. We now report mean ("pass_rate" for example) only, which is misleading for comparisons.
You'll need to create a series with zeros and ones like:
- 1 for PASS/FAIL/ERROR (depends on the case)
- 0 for the rest (non PASS/FAIL/ERROR)
and pass it to the bootstrap_intervals function.
pass_rate is the percentage of pass. We are not calculating mean for pass_rate.. so CI is not required for this. It is needed for actual metric score. |
we can still keep the check, use if condition and raise ValueError. |
@asamal4 Imagine 2 datasets A and B you want to compare. On dataset A you get Yes we can manually compute for example t-test from reported mean, std dev and n. But this PR is about helping the user so the results can be compared directly (like for example if the reported intervals are overlapping or not). Current comparison based just on |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/output/statistics.py (2)
148-162: CatchAssertionErrorfrombootstrap_intervals.The
bootstrap_intervalsfunction can raiseAssertionErrorfor invalid confidence values (line 16), but the exception handler only catchesValueErrorandRuntimeError. This means anAssertionErrorwould propagate and crash the stats calculation.try: ci_low, ci_mean, ci_high = bootstrap_intervals(scores_series) score_stats["confidence_interval"] = { "low": float(ci_low), "mean": float(ci_mean), "high": float(ci_high), "confidence_level": 95, # Default confidence level } - except (ValueError, RuntimeError): + except (ValueError, RuntimeError, AssertionError): # If bootstrap fails, set confidence interval to None score_stats["confidence_interval"] = NoneNote: Once you replace the
assertwith explicit validation (as suggested in the earlier comment), you can removeAssertionErrorfrom this handler.
16-16: Replaceassertwith explicit validation.Based on the discussion history, using
assertfor parameter validation is problematic because assertions are disabled with Python's-Oflag. Since you proposed using anifcondition withValueError, apply that approach here:- assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100" + if not (0 <= confidence <= 100): + raise ValueError(f"confidence must be between 0 and 100, got {confidence}")This ensures the validation always runs regardless of optimization flags.
🧹 Nitpick comments (4)
src/lightspeed_evaluation/core/output/statistics.py (4)
23-25: Consider optimizing the bootstrap loop for better performance.The current implementation performs 100,000 iterations with
pandas.sample()in a Python loop, which can be slow for larger datasets. While the logic is correct, performance can be significantly improved:Option 1: Reduce default steps
- s: pd.Series, confidence: int = 95, bootstrap_steps: int = 100000 + s: pd.Series, confidence: int = 95, bootstrap_steps: int = 10_000Option 2: Vectorize with NumPy (more efficient)
- rates = np.array( - [np.mean(s.sample(n=sample_n, replace=True)) for _ in range(bootstrap_steps)] - ) + # Use NumPy for faster resampling + values = s.to_numpy(copy=False) + rates = np.array( + [float(np.mean(np.random.choice(values, size=sample_n, replace=True))) + for _ in range(bootstrap_steps)] + )This maintains the same statistical properties while improving execution time.
28-28: Variable name suggests mean but contains median.The variable
mean_boot_strapactually stores the median of the bootstrap distribution (line 28), which may confuse future maintainers. Consider renaming for clarity:- # Median (not mean) is correct here - mean_boot_strap = np.median(rates) + # Median of bootstrap distribution + median_boot_strap = np.median(rates) low = np.percentile(rates - sample_mean, (confidence_rev / 2.0)) high = np.percentile(rates - sample_mean, 100 - (confidence_rev / 2.0)) # high represent lower bound, low represents upper bound - return sample_mean - high, mean_boot_strap, sample_mean - low + return sample_mean - high, median_boot_strap, sample_mean - lowNote: This would require updating the API documentation and call sites (lines 151, 154) to reflect that the middle value is the median, not the mean.
151-156: API labels interval center as "mean" but it's actually the median.The variable
ci_meanand the API key"mean"suggest this is the arithmetic mean, butbootstrap_intervalsreturns the median of the bootstrap distribution (line 28). This could mislead consumers of this API.If you address the naming in
bootstrap_intervals, update the call sites to maintain consistency:- ci_low, ci_mean, ci_high = bootstrap_intervals(scores_series) + ci_low, ci_median, ci_high = bootstrap_intervals(scores_series) score_stats["confidence_interval"] = { "low": float(ci_low), - "mean": float(ci_mean), + "median": float(ci_median), "high": float(ci_high), "confidence_level": 95, # Default confidence level }This makes the API more accurate and self-documenting.
185-197: Simplify the no-op try-except block.All branches of this try-except set
confidence_intervalstoNone. Since conversation-level CI requires the original results (which aren't available here), you can simplify this to a direct assignment with a TODO comment:# Calculate confidence intervals for conversation rates - if total > 1: # Need at least 2 samples for meaningful bootstrap - try: - # Create binary series for each outcome type - # Note: We need to reconstruct the original results for this conversation - # Since we don't have access to the original results here, - # we'll skip CI for conversations. This could be enhanced by - # passing the original results to this function - stats["confidence_intervals"] = None - except (ValueError, RuntimeError): - stats["confidence_intervals"] = None - else: - stats["confidence_intervals"] = None + # TODO: Implement conversation-level confidence intervals + # Requires passing original results to reconstruct binary series for bootstrap + stats["confidence_intervals"] = NoneThis maintains the same behavior while being more concise and clearly marking it as future work.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lightspeed_evaluation/core/output/statistics.py(4 hunks)tests/unit/core/output/test_statistics.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/core/output/test_statistics.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(185-224)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/output/statistics.py
[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/output/statistics.py (3)
15-15: Fix typo in docstring.Apply this diff:
- """Compute confidence interval using bootstraping, return low, mean, high.""" + """Compute confidence interval using bootstrapping, return low, mean, high."""
150-163: Consider catching additional pandas exceptions.The current exception handling catches
ValueErrorandRuntimeError. However, pandas operations inbootstrap_intervalscould potentially raiseTypeError(invalid data types) orIndexError(empty series edge cases) in certain scenarios.Consider broadening the exception handling:
- except (ValueError, RuntimeError): + except (ValueError, RuntimeError, TypeError, IndexError): # If bootstrap fails, set confidence interval to None score_stats["confidence_interval"] = None
186-198: Simplify no-op try-except block.The try-except block on lines 188-196 always sets
confidence_intervalstoNonein all paths. The comments clearly explain the limitation, but the try-except serves no purpose.Apply this diff to simplify:
- # Calculate confidence intervals for conversation rates - if total > 1: # Need at least 2 samples for meaningful bootstrap - try: - # Create binary series for each outcome type - # Note: We need to reconstruct the original results for this conversation - # Since we don't have access to the original results here, - # we'll skip CI for conversations. This could be enhanced by - # passing the original results to this function - stats["confidence_intervals"] = None - except (ValueError, RuntimeError): - stats["confidence_intervals"] = None - else: - stats["confidence_intervals"] = None + # TODO: Implement confidence intervals for conversation rates + # Requires passing original results to reconstruct binary series for bootstrap. + # See comments in previous discussions for details. + stats["confidence_intervals"] = NoneNote: A previous review comment marked this as "Addressed in commit 500586d," but the no-op structure remains.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lightspeed_evaluation/core/output/statistics.py(4 hunks)tests/unit/core/output/test_statistics.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/core/output/test_statistics.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(185-224)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: mypy
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
🔇 Additional comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
6-7: LGTM!The numpy and pandas imports are necessary for the bootstrap confidence interval implementation.
There is a difference between evaluation status & score. I am referring to current pass rate implementation.. It is the % of evaluations passed.. This is calculated based the execution status (PASS, FAIL). The status is determined using a threshold. This is not the actual score. So when we compare two models with pass rate as 0.6 and 0.7, that means 6 or 7 out of 10 queries passed. This is not an average score, so it is not misleading and gives us direct comparison. My comment is about calculating CI for the actual score distribution (not on the status). |
|
@asamal4 @VladimirKadlec PTAL. |
Understand your point, it's true only if the datasets for both runs were identical. Let's move this discussion outside PR :) |
Evaluate confidence interval using the custom procedure
Summary by CodeRabbit