LCORE-723: Compute correct confidence interval by bsatapat-jpg · Pull Request #71 · lightspeed-core/lightspeed-evaluation

bsatapat-jpg · 2025-10-07T17:49:16Z

Evaluate confidence interval using the custom procedure

Summary by CodeRabbit

New Features
- Metric score reports now include bootstrap-based confidence intervals (low, mean, high) at a 95% level when sufficient data is available; otherwise, they are shown as not available.
- Added a placeholder field for conversation-level confidence intervals (not yet computed).
Tests
- Added comprehensive tests for the new confidence interval calculations, including edge cases and validation scenarios.

coderabbitai · 2025-10-07T17:49:26Z

Walkthrough

Introduces bootstrap-based confidence interval computation for metric scores via a new bootstrap_intervals function, integrates CI into _finalize_metric_stats output, adds a placeholder confidence_intervals field at conversation level, updates imports, and adds unit tests covering bootstrap_intervals use and export.

Changes

Cohort / File(s)	Summary of Changes
Core statistics logic `src/lightspeed_evaluation/core/output/statistics.py`	Added `bootstrap_intervals(s: pd.Series, confidence=95, bootstrap_steps=100000)` computing (low, mean, high). Updated `_finalize_metric_stats` to compute/attach `score_statistics["confidence_interval"]` when `len(scores) > 1`, else `None`. Ensured `confidence_interval` present even with no scores. Added conversation-level `confidence_intervals` field (set to `None`). Imported `numpy as np`, `pandas as pd`. Public API now includes CI fields.
Unit tests `tests/unit/core/output/test_statistics.py`	Exported `bootstrap_intervals` from module. Added `TestBootstrapIntervals` with edge/valid/invalid/confidence-level tests. Kept existing tests for basic/detailed stats. Imported `pandas as pd`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Stats as _finalize_metric_stats
  participant Boot as bootstrap_intervals

  Caller->>Stats: finalize metric stats (scores)
  alt scores length > 1
    Stats->>Boot: compute CI (Series, confidence, steps)
    Boot-->>Stats: (low, mean, high)
    Stats-->>Caller: stats with score_statistics.confidence_interval
  else scores length <= 1 or failure
    Stats-->>Caller: stats with score_statistics.confidence_interval = None
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit boots and bounds through stats,
Sampling shadows, counting pats;
Low, mean, high—carrots align,
Confidence snug at ninety-five fine.
Tests burrow deep, tails aloft—
Intervals crisp, whiskers soft.
Hop! The numbers sing aloft.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly captures the main change of implementing correct confidence interval computation, which directly corresponds to the new bootstrap_intervals function and associated updates in metric statistics; it is concise, clear, and avoids unnecessary detail.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

src/lightspeed_evaluation/core/output/statistics.py

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

67-88: Consider extracting repetitive bootstrap logic.

The same pattern (bootstrap, convert to percentage, store in dict) is repeated three times for pass/fail/error rates. Extracting this to a helper function would improve maintainability and reduce duplication.

Example refactor:

def _compute_rate_confidence_interval(series: pd.Series, rate_name: str) -> dict[str, Any]:
    """Compute confidence interval for a rate and return formatted dict."""
    ci_low, ci_mean, ci_high = bootstrap_intervals(series)
    return {
        "low": float(ci_low * 100),
        "mean": float(ci_mean * 100),
        "high": float(ci_high * 100),
    }

# Then use it:
confidence_intervals = {
    "pass_rate": _compute_rate_confidence_interval(pass_series, "pass_rate"),
    "fail_rate": _compute_rate_confidence_interval(fail_series, "fail_rate"),
    "error_rate": _compute_rate_confidence_interval(error_series, "error_rate"),
}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94c8236 and 2e8c4dd.

📒 Files selected for processing (1)

src/lightspeed_evaluation/core/output/statistics.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (185-224)

🔇 Additional comments (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

20-35: Verify bootstrap implementation locally
Run the bootstrap tests in an environment with numpy/pandas to confirm:

Using median vs mean for the bootstrap central estimate

Correct percentile offsets and sign inversions for lower/upper bounds

Rename low/high for clarity if needed

Add input validation for empty series, NaN/inf, non-numeric types, and bootstrap_steps > 0

src/lightspeed_evaluation/core/output/statistics.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

src/lightspeed_evaluation/core/output/statistics.py (2)

224-236: No-op try/except; simplify and add TODO.

This block always sets None. Remove the try/except and document the limitation.

-        # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement conversation-level confidence intervals when original
+        # results are available here to build binary series for bootstrap.
+        stats["confidence_intervals"] = None

16-21: Replace assert and add robust input validation.

Assertions can be disabled; also validate steps and sample size.

-    assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100"
+    if not (0 <= confidence <= 100):
+        raise ValueError(f"confidence must be between 0 and 100, got {confidence}")
+    if bootstrap_steps <= 0:
+        raise ValueError(f"bootstrap_steps must be positive, got {bootstrap_steps}")
+    if len(s) < 2:
+        raise ValueError("bootstrap_intervals requires at least 2 samples")

Based on static analysis hints

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

61-90: Don’t drop all intervals if one computation fails; handle per-rate independently.

Keep partial results; only omit failing entries. Set None only if all fail.

-    confidence_intervals = {}
-    if total > 1:  # Need at least 2 samples for meaningful bootstrap
-        try:
-            # Pass rate confidence interval
-            ci_low, ci_mean, ci_high = bootstrap_intervals(pass_series)
-            confidence_intervals["pass_rate"] = {
-                "low": float(ci_low * 100),  # Convert to percentage
-                "mean": float(ci_mean * 100),
-                "high": float(ci_high * 100),
-            }
-            
-            # Fail rate confidence interval
-            ci_low, ci_mean, ci_high = bootstrap_intervals(fail_series)
-            confidence_intervals["fail_rate"] = {
-                "low": float(ci_low * 100),
-                "mean": float(ci_mean * 100),
-                "high": float(ci_high * 100),
-            }
-            
-            # Error rate confidence interval
-            ci_low, ci_mean, ci_high = bootstrap_intervals(error_series)
-            confidence_intervals["error_rate"] = {
-                "low": float(ci_low * 100),
-                "mean": float(ci_mean * 100),
-                "high": float(ci_high * 100),
-            }
-        except (ValueError, RuntimeError):
-            confidence_intervals = None
-    else:
-        confidence_intervals = None
+    confidence_intervals: dict[str, dict[str, float]] | None = {}
+    if total > 1:  # Need at least 2 samples for meaningful bootstrap
+        try:
+            ci_low, ci_median, ci_high = bootstrap_intervals(pass_series)
+            confidence_intervals["pass_rate"] = {
+                "low": float(ci_low * 100),
+                "median": float(ci_median * 100),
+                "high": float(ci_high * 100),
+            }
+        except (ValueError, RuntimeError):
+            pass
+        try:
+            ci_low, ci_median, ci_high = bootstrap_intervals(fail_series)
+            confidence_intervals["fail_rate"] = {
+                "low": float(ci_low * 100),
+                "median": float(ci_median * 100),
+                "high": float(ci_high * 100),
+            }
+        except (ValueError, RuntimeError):
+            pass
+        try:
+            ci_low, ci_median, ci_high = bootstrap_intervals(error_series)
+            confidence_intervals["error_rate"] = {
+                "low": float(ci_low * 100),
+                "median": float(ci_median * 100),
+                "high": float(ci_high * 100),
+            }
+        except (ValueError, RuntimeError):
+            pass
+        if not confidence_intervals:
+            confidence_intervals = None
+    else:
+        confidence_intervals = None

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e8c4dd and 500586d.

📒 Files selected for processing (1)

src/lightspeed_evaluation/core/output/statistics.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (185-224)

🪛 GitHub Check: Bandit

src/lightspeed_evaluation/core/output/statistics.py

[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: pydocstyle
GitHub Check: mypy
GitHub Check: tests (3.11)
GitHub Check: tests (3.13)
GitHub Check: tests (3.12)

src/lightspeed_evaluation/core/output/statistics.py

bsatapat-jpg · 2025-10-07T18:58:54Z

Hi Team, PTAL. Thanks in advance.
cc: @VladimirKadlec @asamal4 @Anxhela21 @tisnik

asamal4

Question: Are you going to add a different script to run the evaluation with multiple models ?

Also we don't need CI for pass/fail status rather for the metric scores

src/lightspeed_evaluation/core/output/statistics.py

bsatapat-jpg · 2025-10-08T06:16:27Z

Yes will be adding different script to run the evaluation with with multiple models.
Modified as per the feedback.
PTAL @asamal4
Thanks

src/lightspeed_evaluation/core/output/statistics.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

src/lightspeed_evaluation/core/output/statistics.py (4)
23-25: Optimize bootstrap performance.

The current implementation uses 100,000 iterations of pandas.sample() in a Python loop, which is very slow and may cause performance issues on larger datasets. Consider vectorizing the bootstrap using NumPy operations.

Apply this diff to improve performance:
-    rates = np.array(
-        [np.mean(s.sample(n=sample_n, replace=True)) for _ in range(bootstrap_steps)]
-    )
+    # Vectorized bootstrap using NumPy for better performance
+    values = s.to_numpy(copy=False)
+    rates = np.array(
+        [float(np.mean(np.random.choice(values, size=sample_n, replace=True)))
+         for _ in range(bootstrap_steps)]
+    )
For even better performance, consider reducing bootstrap_steps to 10,000 (still statistically sufficient) or implementing a fully vectorized approach using np.random.choice with a 2D array.

16-16: Replace assert with proper validation.

Using assert for parameter validation is unsafe because assertions are disabled with Python's -O optimization flag, allowing invalid confidence values to pass through unchecked at runtime. This is a critical correctness issue that must be fixed regardless of the code's origin.

Apply this diff to use proper validation:
-    assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100"
+    if not (0 <= confidence <= 100):
+        raise ValueError(f"confidence must be between 0 and 100, got {confidence}")
Note: This was flagged by static analysis (Bandit) and in past reviews, but remains unresolved.

12-33: Fix naming inconsistency: median labeled as mean.

The function computes the median of bootstrap samples (line 28) but exposes it as "mean" throughout the API (docstring line 15, return variable name line 28, and usage in lines 151-154). This misleading naming can confuse consumers of the API.

Apply these diffs to align the naming:

1. Update the function signature and docstring:
-    """Compute confidence interval using bootstraping, return low, mean, high."""
+    """Compute confidence interval using bootstrapping, return low, median, high."""
2. Rename the return variable:
     # Median (not mean) is correct here
-    success_rate_boot_strap = np.median(rates)
+    median_bootstrap = np.median(rates)
     low = np.percentile(rates - success_rate, (confidence_rev / 2.0))
     high = np.percentile(rates - success_rate, 100 - (confidence_rev / 2.0))

     # high represent lower bound, low represents upper bound
-    return success_rate - high, success_rate_boot_strap, success_rate - low
+    return success_rate - high, median_bootstrap, success_rate - low
3. Update all call sites to use the correct variable name (see lines 151, 154 in _finalize_metric_stats).

185-197: Simplify no-op try-except block.

The try-except-else structure sets confidence_intervals to None in all branches, making the try-except unnecessary. The comments correctly explain that conversation-level CIs cannot be computed without access to original results.

Apply this diff to simplify:
-        # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement conversation-level confidence intervals
+        # Requires passing original results to reconstruct binary series for bootstrap
+        # See: Lines 185-197 for context on why this is not yet implemented
+        stats["confidence_intervals"] = None
This was flagged in a previous review and marked as addressed, but the no-op structure remains.

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/output/statistics.py (1)
151-156: Consider making confidence level configurable.

The confidence level is hardcoded to 95. While 95% is a standard choice, consider:

Adding a module-level constant DEFAULT_CONFIDENCE_LEVEL = 95 for clarity and maintainability

Or, exposing it as a parameter in calculate_detailed_stats if different projects need different confidence levels

Example of introducing a constant:
+# Default confidence level for statistical reporting (95% is standard)
+DEFAULT_CONFIDENCE_LEVEL = 95
+

 def bootstrap_intervals(
-    s: pd.Series, confidence: int = 95, bootstrap_steps: int = 100000
+    s: pd.Series, confidence: int = DEFAULT_CONFIDENCE_LEVEL, bootstrap_steps: int = 100000
 ) -> tuple[np.floating, np.floating, np.floating]:
                 score_stats["confidence_interval"] = {
                     "low": float(ci_low),
                     "mean": float(ci_mean),
                     "high": float(ci_high),
-                    "confidence_level": 95,  # Default confidence level
+                    "confidence_level": DEFAULT_CONFIDENCE_LEVEL,
                 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf75bdc and f11a7d0.

📒 Files selected for processing (2)

src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
tests/unit/core/output/test_statistics.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (185-224)

tests/unit/core/output/test_statistics.py (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

bootstrap_intervals (12-33)

🪛 GitHub Check: Bandit

src/lightspeed_evaluation/core/output/statistics.py

[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.11)
GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: mypy

🔇 Additional comments (4)

tests/unit/core/output/test_statistics.py (3)
4-11: LGTM!

The import additions correctly support testing the new bootstrap_intervals function.

61-75: Update test expectations when assert is replaced.

These tests correctly expect AssertionError for invalid confidence values, matching the current implementation. However, past reviews (and static analysis) flagged the use of assert in bootstrap_intervals as a critical issue because assertions can be disabled with Python's -O flag.

Once the implementation is updated to use proper validation (e.g., raise ValueError), these tests will need to be updated to expect ValueError instead of AssertionError.

Update the test expectations after fixing the assert issue in the implementation:
-        with pytest.raises(
-            AssertionError, match="Invalid confidence, must be between 0 and 100"
-        ):
+        with pytest.raises(
+            ValueError, match="Invalid confidence, must be between 0 and 100"
+        ):
             bootstrap_intervals(data, confidence=-5)

         # Test confidence > 100
-        with pytest.raises(
-            AssertionError, match="Invalid confidence, must be between 0 and 100"
-        ):
+        with pytest.raises(
+            ValueError, match="Invalid confidence, must be between 0 and 100"
+        ):
             bootstrap_intervals(data, confidence=150)
77-102: LGTM!

The edge case and extreme confidence level tests are well-designed:

Single value test appropriately uses strict equality (deterministic behavior)

All-same-values test uses tolerance for floating-point comparisons

Testing 0% and 100% confidence levels provides good boundary coverage
src/lightspeed_evaluation/core/output/statistics.py (1)

6-7: LGTM!

The NumPy and pandas imports are appropriate for the bootstrap confidence interval implementation.

src/lightspeed_evaluation/core/output/statistics.py

VladimirKadlec · 2025-10-08T06:55:41Z

Also we don't need CI for pass/fail status rather for the metric scores

We need confidence interval especially for these.

VladimirKadlec

Great work overall 💪

My comment disappeared as you probably committed something, pasting it here:

In fact we want to compute confidence intervals for PASS/FAIL/ERROR as well. We now report mean ("pass_rate" for example) only, which is misleading for comparisons.

You'll need to create a series with zeros and ones like:

1 for PASS/FAIL/ERROR (depends on the case)
0 for the rest (non PASS/FAIL/ERROR)
and pass it to the bootstrap_intervals function.

src/lightspeed_evaluation/core/output/statistics.py

asamal4 · 2025-10-08T07:19:13Z

@VladimirKadlec

We now report mean ("pass_rate" for example) only, which is misleading for comparisons

pass_rate is the percentage of pass. We are not calculating mean for pass_rate.. so CI is not required for this. It is needed for actual metric score.

asamal4 · 2025-10-08T07:20:14Z

If we don't want assert here, just remove the line and add some comment about correct values of confidence.

we can still keep the check, use if condition and raise ValueError.

src/lightspeed_evaluation/core/output/statistics.py

VladimirKadlec · 2025-10-08T08:27:20Z

pass_rate is the percentage of pass. We are not calculating mean for pass_rate.. so CI is not required for this. It is needed for actual metric score.

@asamal4 pass_rate here is a mean of an integer array containing ones (in case of PASS) and zeros (in case of non PASS).

Imagine 2 datasets A and B you want to compare. On dataset A you get pass_rate = 0.6, on dataset B you get pass_rate = 0.7 . Which system is better A or B? From these numbers one cannot tell if the difference of the results is statistically significant or not.

Yes we can manually compute for example t-test from reported mean, std dev and n. But this PR is about helping the user so the results can be compared directly (like for example if the reported intervals are overlapping or not). Current comparison based just on pass_rate is incorrect and misleading.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

src/lightspeed_evaluation/core/output/statistics.py (2)
148-162: Catch AssertionError from bootstrap_intervals.

The bootstrap_intervals function can raise AssertionError for invalid confidence values (line 16), but the exception handler only catches ValueError and RuntimeError. This means an AssertionError would propagate and crash the stats calculation.
             try:
                 ci_low, ci_mean, ci_high = bootstrap_intervals(scores_series)
                 score_stats["confidence_interval"] = {
                     "low": float(ci_low),
                     "mean": float(ci_mean),
                     "high": float(ci_high),
                     "confidence_level": 95,  # Default confidence level
                 }
-            except (ValueError, RuntimeError):
+            except (ValueError, RuntimeError, AssertionError):
                 # If bootstrap fails, set confidence interval to None
                 score_stats["confidence_interval"] = None
Note: Once you replace the assert with explicit validation (as suggested in the earlier comment), you can remove AssertionError from this handler.

16-16: Replace assert with explicit validation.

Based on the discussion history, using assert for parameter validation is problematic because assertions are disabled with Python's -O flag. Since you proposed using an if condition with ValueError, apply that approach here:
-    assert 0 <= confidence <= 100, "Invalid confidence, must be between 0 and 100"
+    if not (0 <= confidence <= 100):
+        raise ValueError(f"confidence must be between 0 and 100, got {confidence}")
This ensures the validation always runs regardless of optimization flags.

🧹 Nitpick comments (4)

src/lightspeed_evaluation/core/output/statistics.py (4)
23-25: Consider optimizing the bootstrap loop for better performance.

The current implementation performs 100,000 iterations with pandas.sample() in a Python loop, which can be slow for larger datasets. While the logic is correct, performance can be significantly improved:

Option 1: Reduce default steps
-    s: pd.Series, confidence: int = 95, bootstrap_steps: int = 100000
+    s: pd.Series, confidence: int = 95, bootstrap_steps: int = 10_000
Option 2: Vectorize with NumPy (more efficient)
-    rates = np.array(
-        [np.mean(s.sample(n=sample_n, replace=True)) for _ in range(bootstrap_steps)]
-    )
+    # Use NumPy for faster resampling
+    values = s.to_numpy(copy=False)
+    rates = np.array(
+        [float(np.mean(np.random.choice(values, size=sample_n, replace=True)))
+         for _ in range(bootstrap_steps)]
+    )
This maintains the same statistical properties while improving execution time.

28-28: Variable name suggests mean but contains median.

The variable mean_boot_strap actually stores the median of the bootstrap distribution (line 28), which may confuse future maintainers. Consider renaming for clarity:
-    # Median (not mean) is correct here
-    mean_boot_strap = np.median(rates)
+    # Median of bootstrap distribution
+    median_boot_strap = np.median(rates)
     low = np.percentile(rates - sample_mean, (confidence_rev / 2.0))
     high = np.percentile(rates - sample_mean, 100 - (confidence_rev / 2.0))
 
     # high represent lower bound, low represents upper bound
-    return sample_mean - high, mean_boot_strap, sample_mean - low
+    return sample_mean - high, median_boot_strap, sample_mean - low
Note: This would require updating the API documentation and call sites (lines 151, 154) to reflect that the middle value is the median, not the mean.

151-156: API labels interval center as "mean" but it's actually the median.

The variable ci_mean and the API key "mean" suggest this is the arithmetic mean, but bootstrap_intervals returns the median of the bootstrap distribution (line 28). This could mislead consumers of this API.

If you address the naming in bootstrap_intervals, update the call sites to maintain consistency:
-                ci_low, ci_mean, ci_high = bootstrap_intervals(scores_series)
+                ci_low, ci_median, ci_high = bootstrap_intervals(scores_series)
                 score_stats["confidence_interval"] = {
                     "low": float(ci_low),
-                    "mean": float(ci_mean),
+                    "median": float(ci_median),
                     "high": float(ci_high),
                     "confidence_level": 95,  # Default confidence level
                 }
This makes the API more accurate and self-documenting.

185-197: Simplify the no-op try-except block.

All branches of this try-except set confidence_intervals to None. Since conversation-level CI requires the original results (which aren't available here), you can simplify this to a direct assignment with a TODO comment:
         # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement conversation-level confidence intervals
+        # Requires passing original results to reconstruct binary series for bootstrap
+        stats["confidence_intervals"] = None
This maintains the same behavior while being more concise and clearly marking it as future work.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f11a7d0 and 48bdfa8.

📒 Files selected for processing (2)

src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
tests/unit/core/output/test_statistics.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/core/output/test_statistics.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (185-224)

🪛 GitHub Check: Bandit

src/lightspeed_evaluation/core/output/statistics.py

[notice] 16-16:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

src/lightspeed_evaluation/core/output/statistics.py (3)

15-15: Fix typo in docstring.

Apply this diff:

-    """Compute confidence interval using bootstraping, return low, mean, high."""
+    """Compute confidence interval using bootstrapping, return low, mean, high."""

150-163: Consider catching additional pandas exceptions.

The current exception handling catches ValueError and RuntimeError. However, pandas operations in bootstrap_intervals could potentially raise TypeError (invalid data types) or IndexError (empty series edge cases) in certain scenarios.

Consider broadening the exception handling:

-            except (ValueError, RuntimeError):
+            except (ValueError, RuntimeError, TypeError, IndexError):
                 # If bootstrap fails, set confidence interval to None
                 score_stats["confidence_interval"] = None

186-198: Simplify no-op try-except block.

The try-except block on lines 188-196 always sets confidence_intervals to None in all paths. The comments clearly explain the limitation, but the try-except serves no purpose.

Apply this diff to simplify:

-        # Calculate confidence intervals for conversation rates
-        if total > 1:  # Need at least 2 samples for meaningful bootstrap
-            try:
-                # Create binary series for each outcome type
-                # Note: We need to reconstruct the original results for this conversation
-                # Since we don't have access to the original results here,
-                # we'll skip CI for conversations. This could be enhanced by
-                # passing the original results to this function
-                stats["confidence_intervals"] = None
-            except (ValueError, RuntimeError):
-                stats["confidence_intervals"] = None
-        else:
-            stats["confidence_intervals"] = None
+        # TODO: Implement confidence intervals for conversation rates
+        # Requires passing original results to reconstruct binary series for bootstrap.
+        # See comments in previous discussions for details.
+        stats["confidence_intervals"] = None

Note: A previous review comment marked this as "Addressed in commit 500586d," but the no-op structure remains.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48bdfa8 and fb5ac84.

📒 Files selected for processing (2)

src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
tests/unit/core/output/test_statistics.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/core/output/test_statistics.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (185-224)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: mypy
GitHub Check: tests (3.11)
GitHub Check: tests (3.12)
GitHub Check: tests (3.13)

🔇 Additional comments (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

6-7: LGTM!

The numpy and pandas imports are necessary for the bootstrap confidence interval implementation.

asamal4 · 2025-10-08T09:47:02Z

@VladimirKadlec

@asamal4 pass_rate here is a mean of an integer array containing ones (in case of PASS) and zeros (in case of non PASS).

There is a difference between evaluation status & score. I am referring to current pass rate implementation.. It is the % of evaluations passed.. This is calculated based the execution status (PASS, FAIL). The status is determined using a threshold. This is not the actual score. So when we compare two models with pass rate as 0.6 and 0.7, that means 6 or 7 out of 10 queries passed. This is not an average score, so it is not misleading and gives us direct comparison.

My comment is about calculating CI for the actual score distribution (not on the status).

bsatapat-jpg · 2025-10-08T12:17:34Z

@asamal4 @VladimirKadlec PTAL.
Thanks

asamal4

LGTM

VladimirKadlec · 2025-10-08T12:59:34Z

There is a difference between evaluation status & score. I am referring to current pass rate implementation.. It is the % of evaluations passed.. This is calculated based the execution status (PASS, FAIL). The status is determined using a threshold. This is not the actual score. So when we compare two models with pass rate as 0.6 and 0.7, that means 6 or 7 out of 10 queries passed. This is not an average score, so it is not misleading and gives us direct comparison.

Understand your point, it's true only if the datasets for both runs were identical. Let's move this discussion outside PR :)

VladimirKadlec

LGTM

src/lightspeed_evaluation/core/output/statistics.py

tisnik

it looks nice, LGTM

github-advanced-security bot found potential problems Oct 7, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

bsatapat-jpg force-pushed the dev branch from 2e8c4dd to 500586d Compare October 7, 2025 17:57

github-advanced-security bot found potential problems Oct 7, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Show resolved Hide resolved

src/lightspeed_evaluation/core/output/statistics.py Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch 3 times, most recently from 7070411 to caa5200 Compare October 7, 2025 18:40

bsatapat-jpg changed the title ~~LCORE-723: Improve reporting and compute correct confidence interval~~ LCORE-723: Compute correct confidence interval Oct 7, 2025

asamal4 reviewed Oct 7, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Outdated Show resolved Hide resolved

src/lightspeed_evaluation/core/output/statistics.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from caa5200 to cf75bdc Compare October 8, 2025 05:58

asamal4 reviewed Oct 8, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Fixed Show fixed Hide fixed

src/lightspeed_evaluation/core/output/statistics.py Outdated Show resolved Hide resolved

src/lightspeed_evaluation/core/output/statistics.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from cf75bdc to f11a7d0 Compare October 8, 2025 06:44

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Show resolved Hide resolved

VladimirKadlec suggested changes Oct 8, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Outdated Show resolved Hide resolved

src/lightspeed_evaluation/core/output/statistics.py Outdated Show resolved Hide resolved

VladimirKadlec reviewed Oct 8, 2025

View reviewed changes

VladimirKadlec suggested changes Oct 8, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Fixed Show fixed Hide fixed

bsatapat-jpg force-pushed the dev branch from f11a7d0 to 48bdfa8 Compare October 8, 2025 08:22

github-advanced-security bot found potential problems Oct 8, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

LCORE-723: Improve reporting and compute correct confidence interval

fb5ac84

bsatapat-jpg force-pushed the dev branch from 48bdfa8 to fb5ac84 Compare October 8, 2025 08:29

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

asamal4 approved these changes Oct 8, 2025

View reviewed changes

VladimirKadlec approved these changes Oct 8, 2025

View reviewed changes

src/lightspeed_evaluation/core/output/statistics.py Show resolved Hide resolved

tisnik approved these changes Oct 8, 2025

View reviewed changes

tisnik merged commit 3a7b865 into lightspeed-core:main Oct 8, 2025
15 checks passed

bsatapat-jpg deleted the dev branch October 8, 2025 17:22

coderabbitai bot mentioned this pull request Nov 27, 2025

LEADS-26: Increased Unit test cases coverage #109

Merged

15 tasks

Conversation

bsatapat-jpg commented Oct 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bsatapat-jpg commented Oct 7, 2025

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bsatapat-jpg commented Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VladimirKadlec commented Oct 8, 2025

Uh oh!

VladimirKadlec left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asamal4 commented Oct 8, 2025

Uh oh!

asamal4 commented Oct 8, 2025

Uh oh!

Uh oh!

VladimirKadlec commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Oct 8, 2025

Uh oh!

bsatapat-jpg commented Oct 8, 2025

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

VladimirKadlec commented Oct 8, 2025

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bsatapat-jpg commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

VladimirKadlec left a comment •

edited

Loading

VladimirKadlec commented Oct 8, 2025 •

edited

Loading