Skip to content

LCORE-723: Compute correct confidence interval#71

Merged
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev
Oct 8, 2025
Merged

LCORE-723: Compute correct confidence interval#71
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev

Conversation

@bsatapat-jpg
Copy link
Collaborator

@bsatapat-jpg bsatapat-jpg commented Oct 7, 2025

Evaluate confidence interval using the custom procedure

Summary by CodeRabbit

  • New Features
    • Metric score reports now include bootstrap-based confidence intervals (low, mean, high) at a 95% level when sufficient data is available; otherwise, they are shown as not available.
    • Added a placeholder field for conversation-level confidence intervals (not yet computed).
  • Tests
    • Added comprehensive tests for the new confidence interval calculations, including edge cases and validation scenarios.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 7, 2025

Walkthrough

Introduces bootstrap-based confidence interval computation for metric scores via a new bootstrap_intervals function, integrates CI into _finalize_metric_stats output, adds a placeholder confidence_intervals field at conversation level, updates imports, and adds unit tests covering bootstrap_intervals use and export.

Changes

Cohort / File(s) Summary of Changes
Core statistics logic
src/lightspeed_evaluation/core/output/statistics.py
Added bootstrap_intervals(s: pd.Series, confidence=95, bootstrap_steps=100000) computing (low, mean, high). Updated _finalize_metric_stats to compute/attach score_statistics["confidence_interval"] when len(scores) > 1, else None. Ensured confidence_interval present even with no scores. Added conversation-level confidence_intervals field (set to None). Imported numpy as np, pandas as pd. Public API now includes CI fields.
Unit tests
tests/unit/core/output/test_statistics.py
Exported bootstrap_intervals from module. Added TestBootstrapIntervals with edge/valid/invalid/confidence-level tests. Kept existing tests for basic/detailed stats. Imported pandas as pd.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Stats as _finalize_metric_stats
  participant Boot as bootstrap_intervals

  Caller->>Stats: finalize metric stats (scores)
  alt scores length > 1
    Stats->>Boot: compute CI (Series, confidence, steps)
    Boot-->>Stats: (low, mean, high)
    Stats-->>Caller: stats with score_statistics.confidence_interval
  else scores length <= 1 or failure
    Stats-->>Caller: stats with score_statistics.confidence_interval = None
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit boots and bounds through stats,
Sampling shadows, counting pats;
Low, mean, high—carrots align,
Confidence snug at ninety-five fine.
Tests burrow deep, tails aloft—
Intervals crisp, whiskers soft.
Hop! The numbers sing aloft.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly captures the main change of implementing correct confidence interval computation, which directly corresponds to the new bootstrap_intervals function and associated updates in metric statistics; it is concise, clear, and avoids unnecessary detail.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

67-88: Consider extracting repetitive bootstrap logic.

The same pattern (bootstrap, convert to percentage, store in dict) is repeated three times for pass/fail/error rates. Extracting this to a helper function would improve maintainability and reduce duplication.

Example refactor:

def _compute_rate_confidence_interval(series: pd.Series, rate_name: str) -> dict[str, Any]:
    """Compute confidence interval for a rate and return formatted dict."""
    ci_low, ci_mean, ci_high = bootstrap_intervals(series)
    return {
        "low": float(ci_low * 100),
        "mean": float(ci_mean * 100),
        "high": float(ci_high * 100),
    }

# Then use it:
confidence_intervals = {
    "pass_rate": _compute_rate_confidence_interval(pass_series, "pass_rate"),
    "fail_rate": _compute_rate_confidence_interval(fail_series, "fail_rate"),
    "error_rate": _compute_rate_confidence_interval(error_series, "error_rate"),
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94c8236 and 2e8c4dd.

📒 Files selected for processing (1)
  • src/lightspeed_evaluation/core/output/statistics.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (185-224)
🔇 Additional comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

20-35: Verify bootstrap implementation locally
Run the bootstrap tests in an environment with numpy/pandas to confirm:

  • Using median vs mean for the bootstrap central estimate
  • Correct percentile offsets and sign inversions for lower/upper bounds
  • Rename low/high for clarity if needed
  • Add input validation for empty series, NaN/inf, non-numeric types, and bootstrap_steps > 0

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/output/statistics.py (2)

224-236: No-op try/except; simplify and add TODO.

This block always sets None. Remove the try/except and document the limitation.

-        # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement conversation-level confidence intervals when original
+        # results are available here to build binary series for bootstrap.
+        stats["confidence_intervals"] = None

16-21: Replace assert and add robust input validation.

Assertions can be disabled; also validate steps and sample size.

-    assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100"
+    if not (0 <= confidence <= 100):
+        raise ValueError(f"confidence must be between 0 and 100, got {confidence}")
+    if bootstrap_steps <= 0:
+        raise ValueError(f"bootstrap_steps must be positive, got {bootstrap_steps}")
+    if len(s) < 2:
+        raise ValueError("bootstrap_intervals requires at least 2 samples")

Based on static analysis hints

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

61-90: Don’t drop all intervals if one computation fails; handle per-rate independently.

Keep partial results; only omit failing entries. Set None only if all fail.

-    confidence_intervals = {}
-    if total > 1:  # Need at least 2 samples for meaningful bootstrap
-        try:
-            # Pass rate confidence interval
-            ci_low, ci_mean, ci_high = bootstrap_intervals(pass_series)
-            confidence_intervals["pass_rate"] = {
-                "low": float(ci_low * 100),  # Convert to percentage
-                "mean": float(ci_mean * 100),
-                "high": float(ci_high * 100),
-            }
-            
-            # Fail rate confidence interval
-            ci_low, ci_mean, ci_high = bootstrap_intervals(fail_series)
-            confidence_intervals["fail_rate"] = {
-                "low": float(ci_low * 100),
-                "mean": float(ci_mean * 100),
-                "high": float(ci_high * 100),
-            }
-            
-            # Error rate confidence interval
-            ci_low, ci_mean, ci_high = bootstrap_intervals(error_series)
-            confidence_intervals["error_rate"] = {
-                "low": float(ci_low * 100),
-                "mean": float(ci_mean * 100),
-                "high": float(ci_high * 100),
-            }
-        except (ValueError, RuntimeError):
-            confidence_intervals = None
-    else:
-        confidence_intervals = None
+    confidence_intervals: dict[str, dict[str, float]] | None = {}
+    if total > 1:  # Need at least 2 samples for meaningful bootstrap
+        try:
+            ci_low, ci_median, ci_high = bootstrap_intervals(pass_series)
+            confidence_intervals["pass_rate"] = {
+                "low": float(ci_low * 100),
+                "median": float(ci_median * 100),
+                "high": float(ci_high * 100),
+            }
+        except (ValueError, RuntimeError):
+            pass
+        try:
+            ci_low, ci_median, ci_high = bootstrap_intervals(fail_series)
+            confidence_intervals["fail_rate"] = {
+                "low": float(ci_low * 100),
+                "median": float(ci_median * 100),
+                "high": float(ci_high * 100),
+            }
+        except (ValueError, RuntimeError):
+            pass
+        try:
+            ci_low, ci_median, ci_high = bootstrap_intervals(error_series)
+            confidence_intervals["error_rate"] = {
+                "low": float(ci_low * 100),
+                "median": float(ci_median * 100),
+                "high": float(ci_high * 100),
+            }
+        except (ValueError, RuntimeError):
+            pass
+        if not confidence_intervals:
+            confidence_intervals = None
+    else:
+        confidence_intervals = None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e8c4dd and 500586d.

📒 Files selected for processing (1)
  • src/lightspeed_evaluation/core/output/statistics.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (185-224)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/output/statistics.py

[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: pydocstyle
  • GitHub Check: mypy
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)

@bsatapat-jpg bsatapat-jpg force-pushed the dev branch 3 times, most recently from 7070411 to caa5200 Compare October 7, 2025 18:40
@bsatapat-jpg bsatapat-jpg changed the title LCORE-723: Improve reporting and compute correct confidence interval LCORE-723: Compute correct confidence interval Oct 7, 2025
@bsatapat-jpg
Copy link
Collaborator Author

Hi Team, PTAL. Thanks in advance.
cc: @VladimirKadlec @asamal4 @Anxhela21 @tisnik

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are you going to add a different script to run the evaluation with multiple models ?

Also we don't need CI for pass/fail status rather for the metric scores

@bsatapat-jpg
Copy link
Collaborator Author

Yes will be adding different script to run the evaluation with with multiple models.
Modified as per the feedback.
PTAL @asamal4
Thanks

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
src/lightspeed_evaluation/core/output/statistics.py (4)

23-25: Optimize bootstrap performance.

The current implementation uses 100,000 iterations of pandas.sample() in a Python loop, which is very slow and may cause performance issues on larger datasets. Consider vectorizing the bootstrap using NumPy operations.

Apply this diff to improve performance:

-    rates = np.array(
-        [np.mean(s.sample(n=sample_n, replace=True)) for _ in range(bootstrap_steps)]
-    )
+    # Vectorized bootstrap using NumPy for better performance
+    values = s.to_numpy(copy=False)
+    rates = np.array(
+        [float(np.mean(np.random.choice(values, size=sample_n, replace=True)))
+         for _ in range(bootstrap_steps)]
+    )

For even better performance, consider reducing bootstrap_steps to 10,000 (still statistically sufficient) or implementing a fully vectorized approach using np.random.choice with a 2D array.


16-16: Replace assert with proper validation.

Using assert for parameter validation is unsafe because assertions are disabled with Python's -O optimization flag, allowing invalid confidence values to pass through unchecked at runtime. This is a critical correctness issue that must be fixed regardless of the code's origin.

Apply this diff to use proper validation:

-    assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100"
+    if not (0 <= confidence <= 100):
+        raise ValueError(f"confidence must be between 0 and 100, got {confidence}")

Note: This was flagged by static analysis (Bandit) and in past reviews, but remains unresolved.


12-33: Fix naming inconsistency: median labeled as mean.

The function computes the median of bootstrap samples (line 28) but exposes it as "mean" throughout the API (docstring line 15, return variable name line 28, and usage in lines 151-154). This misleading naming can confuse consumers of the API.

Apply these diffs to align the naming:

1. Update the function signature and docstring:

-    """Compute confidence interval using bootstraping, return low, mean, high."""
+    """Compute confidence interval using bootstrapping, return low, median, high."""

2. Rename the return variable:

     # Median (not mean) is correct here
-    success_rate_boot_strap = np.median(rates)
+    median_bootstrap = np.median(rates)
     low = np.percentile(rates - success_rate, (confidence_rev / 2.0))
     high = np.percentile(rates - success_rate, 100 - (confidence_rev / 2.0))

     # high represent lower bound, low represents upper bound
-    return success_rate - high, success_rate_boot_strap, success_rate - low
+    return success_rate - high, median_bootstrap, success_rate - low

3. Update all call sites to use the correct variable name (see lines 151, 154 in _finalize_metric_stats).


185-197: Simplify no-op try-except block.

The try-except-else structure sets confidence_intervals to None in all branches, making the try-except unnecessary. The comments correctly explain that conversation-level CIs cannot be computed without access to original results.

Apply this diff to simplify:

-        # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement conversation-level confidence intervals
+        # Requires passing original results to reconstruct binary series for bootstrap
+        # See: Lines 185-197 for context on why this is not yet implemented
+        stats["confidence_intervals"] = None

This was flagged in a previous review and marked as addressed, but the no-op structure remains.

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

151-156: Consider making confidence level configurable.

The confidence level is hardcoded to 95. While 95% is a standard choice, consider:

  1. Adding a module-level constant DEFAULT_CONFIDENCE_LEVEL = 95 for clarity and maintainability
  2. Or, exposing it as a parameter in calculate_detailed_stats if different projects need different confidence levels

Example of introducing a constant:

+# Default confidence level for statistical reporting (95% is standard)
+DEFAULT_CONFIDENCE_LEVEL = 95
+

 def bootstrap_intervals(
-    s: pd.Series, confidence: int = 95, bootstrap_steps: int = 100000
+    s: pd.Series, confidence: int = DEFAULT_CONFIDENCE_LEVEL, bootstrap_steps: int = 100000
 ) -> tuple[np.floating, np.floating, np.floating]:
                 score_stats["confidence_interval"] = {
                     "low": float(ci_low),
                     "mean": float(ci_mean),
                     "high": float(ci_high),
-                    "confidence_level": 95,  # Default confidence level
+                    "confidence_level": DEFAULT_CONFIDENCE_LEVEL,
                 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf75bdc and f11a7d0.

📒 Files selected for processing (2)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • tests/unit/core/output/test_statistics.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (185-224)
tests/unit/core/output/test_statistics.py (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • bootstrap_intervals (12-33)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/output/statistics.py

[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.13)
  • GitHub Check: mypy
🔇 Additional comments (4)
tests/unit/core/output/test_statistics.py (3)

4-11: LGTM!

The import additions correctly support testing the new bootstrap_intervals function.


61-75: Update test expectations when assert is replaced.

These tests correctly expect AssertionError for invalid confidence values, matching the current implementation. However, past reviews (and static analysis) flagged the use of assert in bootstrap_intervals as a critical issue because assertions can be disabled with Python's -O flag.

Once the implementation is updated to use proper validation (e.g., raise ValueError), these tests will need to be updated to expect ValueError instead of AssertionError.

Update the test expectations after fixing the assert issue in the implementation:

-        with pytest.raises(
-            AssertionError, match="Invalid confidence, must be between 0 and 100"
-        ):
+        with pytest.raises(
+            ValueError, match="Invalid confidence, must be between 0 and 100"
+        ):
             bootstrap_intervals(data, confidence=-5)

         # Test confidence > 100
-        with pytest.raises(
-            AssertionError, match="Invalid confidence, must be between 0 and 100"
-        ):
+        with pytest.raises(
+            ValueError, match="Invalid confidence, must be between 0 and 100"
+        ):
             bootstrap_intervals(data, confidence=150)

77-102: LGTM!

The edge case and extreme confidence level tests are well-designed:

  • Single value test appropriately uses strict equality (deterministic behavior)
  • All-same-values test uses tolerance for floating-point comparisons
  • Testing 0% and 100% confidence levels provides good boundary coverage
src/lightspeed_evaluation/core/output/statistics.py (1)

6-7: LGTM!

The NumPy and pandas imports are appropriate for the bootstrap confidence interval implementation.

@VladimirKadlec
Copy link
Contributor

Also we don't need CI for pass/fail status rather for the metric scores

We need confidence interval especially for these.

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work overall 💪

My comment disappeared as you probably committed something, pasting it here:

In fact we want to compute confidence intervals for PASS/FAIL/ERROR as well. We now report mean ("pass_rate" for example) only, which is misleading for comparisons.

You'll need to create a series with zeros and ones like:

  • 1 for PASS/FAIL/ERROR (depends on the case)
  • 0 for the rest (non PASS/FAIL/ERROR)
    and pass it to the bootstrap_intervals function.

@asamal4
Copy link
Collaborator

asamal4 commented Oct 8, 2025

@VladimirKadlec

We now report mean ("pass_rate" for example) only, which is misleading for comparisons

pass_rate is the percentage of pass. We are not calculating mean for pass_rate.. so CI is not required for this. It is needed for actual metric score.

@asamal4
Copy link
Collaborator

asamal4 commented Oct 8, 2025

If we don't want assert here, just remove the line and add some comment about correct values of confidence.

we can still keep the check, use if condition and raise ValueError.

@VladimirKadlec
Copy link
Contributor

VladimirKadlec commented Oct 8, 2025

pass_rate is the percentage of pass. We are not calculating mean for pass_rate.. so CI is not required for this. It is needed for actual metric score.

@asamal4 pass_rate here is a mean of an integer array containing ones (in case of PASS) and zeros (in case of non PASS).

Imagine 2 datasets A and B you want to compare. On dataset A you get pass_rate = 0.6, on dataset B you get pass_rate = 0.7 . Which system is better A or B? From these numbers one cannot tell if the difference of the results is statistically significant or not.

Yes we can manually compute for example t-test from reported mean, std dev and n. But this PR is about helping the user so the results can be compared directly (like for example if the reported intervals are overlapping or not). Current comparison based just on pass_rate is incorrect and misleading.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/output/statistics.py (2)

148-162: Catch AssertionError from bootstrap_intervals.

The bootstrap_intervals function can raise AssertionError for invalid confidence values (line 16), but the exception handler only catches ValueError and RuntimeError. This means an AssertionError would propagate and crash the stats calculation.

             try:
                 ci_low, ci_mean, ci_high = bootstrap_intervals(scores_series)
                 score_stats["confidence_interval"] = {
                     "low": float(ci_low),
                     "mean": float(ci_mean),
                     "high": float(ci_high),
                     "confidence_level": 95,  # Default confidence level
                 }
-            except (ValueError, RuntimeError):
+            except (ValueError, RuntimeError, AssertionError):
                 # If bootstrap fails, set confidence interval to None
                 score_stats["confidence_interval"] = None

Note: Once you replace the assert with explicit validation (as suggested in the earlier comment), you can remove AssertionError from this handler.


16-16: Replace assert with explicit validation.

Based on the discussion history, using assert for parameter validation is problematic because assertions are disabled with Python's -O flag. Since you proposed using an if condition with ValueError, apply that approach here:

-    assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100"
+    if not (0 <= confidence <= 100):
+        raise ValueError(f"confidence must be between 0 and 100, got {confidence}")

This ensures the validation always runs regardless of optimization flags.

🧹 Nitpick comments (4)
src/lightspeed_evaluation/core/output/statistics.py (4)

23-25: Consider optimizing the bootstrap loop for better performance.

The current implementation performs 100,000 iterations with pandas.sample() in a Python loop, which can be slow for larger datasets. While the logic is correct, performance can be significantly improved:

Option 1: Reduce default steps

-    s: pd.Series, confidence: int = 95, bootstrap_steps: int = 100000
+    s: pd.Series, confidence: int = 95, bootstrap_steps: int = 10_000

Option 2: Vectorize with NumPy (more efficient)

-    rates = np.array(
-        [np.mean(s.sample(n=sample_n, replace=True)) for _ in range(bootstrap_steps)]
-    )
+    # Use NumPy for faster resampling
+    values = s.to_numpy(copy=False)
+    rates = np.array(
+        [float(np.mean(np.random.choice(values, size=sample_n, replace=True)))
+         for _ in range(bootstrap_steps)]
+    )

This maintains the same statistical properties while improving execution time.


28-28: Variable name suggests mean but contains median.

The variable mean_boot_strap actually stores the median of the bootstrap distribution (line 28), which may confuse future maintainers. Consider renaming for clarity:

-    # Median (not mean) is correct here
-    mean_boot_strap = np.median(rates)
+    # Median of bootstrap distribution
+    median_boot_strap = np.median(rates)
     low = np.percentile(rates - sample_mean, (confidence_rev / 2.0))
     high = np.percentile(rates - sample_mean, 100 - (confidence_rev / 2.0))
 
     # high represent lower bound, low represents upper bound
-    return sample_mean - high, mean_boot_strap, sample_mean - low
+    return sample_mean - high, median_boot_strap, sample_mean - low

Note: This would require updating the API documentation and call sites (lines 151, 154) to reflect that the middle value is the median, not the mean.


151-156: API labels interval center as "mean" but it's actually the median.

The variable ci_mean and the API key "mean" suggest this is the arithmetic mean, but bootstrap_intervals returns the median of the bootstrap distribution (line 28). This could mislead consumers of this API.

If you address the naming in bootstrap_intervals, update the call sites to maintain consistency:

-                ci_low, ci_mean, ci_high = bootstrap_intervals(scores_series)
+                ci_low, ci_median, ci_high = bootstrap_intervals(scores_series)
                 score_stats["confidence_interval"] = {
                     "low": float(ci_low),
-                    "mean": float(ci_mean),
+                    "median": float(ci_median),
                     "high": float(ci_high),
                     "confidence_level": 95,  # Default confidence level
                 }

This makes the API more accurate and self-documenting.


185-197: Simplify the no-op try-except block.

All branches of this try-except set confidence_intervals to None. Since conversation-level CI requires the original results (which aren't available here), you can simplify this to a direct assignment with a TODO comment:

         # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement conversation-level confidence intervals
+        # Requires passing original results to reconstruct binary series for bootstrap
+        stats["confidence_intervals"] = None

This maintains the same behavior while being more concise and clearly marking it as future work.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f11a7d0 and 48bdfa8.

📒 Files selected for processing (2)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • tests/unit/core/output/test_statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/core/output/test_statistics.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (185-224)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/output/statistics.py

[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/output/statistics.py (3)

15-15: Fix typo in docstring.

Apply this diff:

-    """Compute confidence interval using bootstraping, return low, mean, high."""
+    """Compute confidence interval using bootstrapping, return low, mean, high."""

150-163: Consider catching additional pandas exceptions.

The current exception handling catches ValueError and RuntimeError. However, pandas operations in bootstrap_intervals could potentially raise TypeError (invalid data types) or IndexError (empty series edge cases) in certain scenarios.

Consider broadening the exception handling:

-            except (ValueError, RuntimeError):
+            except (ValueError, RuntimeError, TypeError, IndexError):
                 # If bootstrap fails, set confidence interval to None
                 score_stats["confidence_interval"] = None

186-198: Simplify no-op try-except block.

The try-except block on lines 188-196 always sets confidence_intervals to None in all paths. The comments clearly explain the limitation, but the try-except serves no purpose.

Apply this diff to simplify:

-        # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement confidence intervals for conversation rates
+        # Requires passing original results to reconstruct binary series for bootstrap.
+        # See comments in previous discussions for details.
+        stats["confidence_intervals"] = None

Note: A previous review comment marked this as "Addressed in commit 500586d," but the no-op structure remains.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48bdfa8 and fb5ac84.

📒 Files selected for processing (2)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • tests/unit/core/output/test_statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/core/output/test_statistics.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (185-224)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: mypy
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.13)
🔇 Additional comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

6-7: LGTM!

The numpy and pandas imports are necessary for the bootstrap confidence interval implementation.

@asamal4
Copy link
Collaborator

asamal4 commented Oct 8, 2025

@VladimirKadlec

@asamal4 pass_rate here is a mean of an integer array containing ones (in case of PASS) and zeros (in case of non PASS).

There is a difference between evaluation status & score. I am referring to current pass rate implementation.. It is the % of evaluations passed.. This is calculated based the execution status (PASS, FAIL). The status is determined using a threshold. This is not the actual score. So when we compare two models with pass rate as 0.6 and 0.7, that means 6 or 7 out of 10 queries passed. This is not an average score, so it is not misleading and gives us direct comparison.

My comment is about calculating CI for the actual score distribution (not on the status).

@bsatapat-jpg
Copy link
Collaborator Author

@asamal4 @VladimirKadlec PTAL.
Thanks

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VladimirKadlec
Copy link
Contributor

There is a difference between evaluation status & score. I am referring to current pass rate implementation.. It is the % of evaluations passed.. This is calculated based the execution status (PASS, FAIL). The status is determined using a threshold. This is not the actual score. So when we compare two models with pass rate as 0.6 and 0.7, that means 6 or 7 out of 10 queries passed. This is not an average score, so it is not misleading and gives us direct comparison.

Understand your point, it's true only if the datasets for both runs were identical. Let's move this discussion outside PR :)

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks nice, LGTM

@tisnik tisnik merged commit 3a7b865 into lightspeed-core:main Oct 8, 2025
15 checks passed
@bsatapat-jpg bsatapat-jpg deleted the dev branch October 8, 2025 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants