-
Couldn't load subscription status.
- Fork 11
Statistical approach to test outcomes
| Test Category | Description | Speed ↑ | Confidence↓ | Reliability ↑ |
|---|---|---|---|---|
| Unit Tests | Test individual components | Fast | Low | High |
| Integration Tests | Test component interactions | Medium | Medium | Medium |
| End-to-End / Acceptance | Test full system functionality | Slow | High | Low |
We see that with larger tests we have higher confidence that whole system works, but we have slow tests and low reliability. This is also a move from certainty to probability of test outcomes. Generating LLM output makes tests slow and also can cause low reliability.
AI systems, especially LLMs, produce non-deterministic outputs. This requires a probabilistic testing approach:
- Outputs can be evaluated as boolean (pass/fail).
- Tests can be repeated N times.
- Statistical analysis helps determine confidence in results.
This is based on assumption that we can build software that will sort LLM output into valid and non-valid. Invalid outputs will cause retry to generate new valid response. NOTE: In online chat systems LLM can provide N responses to allow faster response time without waiting for retry.
We use binomial testing to evaluate LLM reliability. This is similar to coin flips or quality control scenarios.
Scenario: You flip a coin 20 times and get 15 heads. Is this surprising enough to say the coin is biased?
- H₀ (Null): Coin is fair (50% heads).
- H₁ (Alt): Coin is biased (>50% heads).
- Probability of getting ≥15 heads by chance ≈ 0.02 (2%)
- Since p = 0.02 < 0.05, result is statistically significant
- Conclusion: Coin is likely biased.
More on Binomial Distribution →
Imagine flaky UI tests with a 5% failure rate (i.e., 95% reliability). You introduce a code change and get 47 passes before 1 failure.
Question: Did the change improve reliability?
- Observed success rate: 97.9% (46/47)
- Expected (baseline) success: 95%
- Confidence interval (90%): ~3–7 failures in 100 runs
Conclusion:
Even with a higher observed success rate, we cannot claim improvement unless the difference is statistically significant.
There's a 36% chance this is a statistical fluke. More runs are needed to reduce uncertainty.
- Generate response.
- Evaluate output with binomial validator.
- Run test N times.
- Analyze results statistically.
- Compare to expected success rate.
- Mark result as GREEN (pass) or RED (fail) based on threshold.
Example Goal: Reduce refusal rate for gory images from 5% to 1%.
Question: Did the fix help?
- Say we got 1 failure in 47 runs → 97.9% success.
- This does not exceed statistical threshold.
Question: How many runs do we need to be 90% confident of improvement?
Answer:
You need 97 runs with 0 failures to be 90% confident that your success rate is ≥ 98.98%.