Skip to content

Statistical approach to test outcomes

Paul Zabelin edited this page Mar 26, 2025 · 29 revisions

Traditional Testing

Test Category Description Speed ↑ Confidence↓ Reliability ↑
Unit Tests Test individual components Fast Low High
Integration Tests Test component interactions Medium Medium Medium
End-to-End / Acceptance Test full system functionality Slow High Low

We see that with larger tests we have higher confidence that whole system works, but we have slow tests and low reliability. This is also a move from certainty to probability of test outcomes. Generating LLM output makes tests slow and also can cause low reliability.


AI Testing and Non-Determinism

AI systems, especially LLMs, produce non-deterministic outputs. This requires a probabilistic testing approach:

  • Outputs can be evaluated as boolean (pass/fail).
  • Tests can be repeated N times.
  • Statistical analysis helps determine confidence in results.

This is based on assumption that we can build software that will sort LLM output into valid and non-valid. Invalid outputs will cause retry to generate new valid response. NOTE: In online chat systems LLM can provide N responses to allow faster response time without waiting for retry.


Binomial Approach

We use binomial testing to evaluate LLM reliability. This is similar to coin flips or quality control scenarios.

Example: Biased Coin Test

Scenario: You flip a coin 20 times and get 15 heads. Is this surprising enough to say the coin is biased?

Step 1: Define Hypotheses

  • H₀ (Null): Coin is fair (50% heads).
  • H₁ (Alt): Coin is biased (>50% heads).

Step 2: Calculate Probability

  • Probability of getting ≥15 heads by chance ≈ 0.02 (2%)

Step 3: Interpret Results

  • Since p = 0.02 < 0.05, result is statistically significant
  • Conclusion: Coin is likely biased.

More on Binomial Distribution →


Flaky Test Case Example

Imagine flaky UI tests with a 5% failure rate (i.e., 95% reliability). You introduce a code change and get 47 passes before 1 failure.

Question: Did the change improve reliability?

  • Observed success rate: 97.9% (46/47)
  • Expected (baseline) success: 95%
  • Confidence interval (90%): ~3–7 failures in 100 runs

Conclusion:
Even with a higher observed success rate, we cannot claim improvement unless the difference is statistically significant.
There's a 36% chance this is a statistical fluke. More runs are needed to reduce uncertainty.


Testing Process

  1. Generate response.
  2. Evaluate output with binomial validator.
  3. Run test N times.
  4. Analyze results statistically.
  5. Compare to expected success rate.
  6. Mark result as GREEN (pass) or RED (fail) based on threshold.

Goal-Oriented Testing

Example Goal: Reduce refusal rate for gory images from 5% to 1%.

Question: Did the fix help?

  • Say we got 1 failure in 47 runs → 97.9% success.
  • This does not exceed statistical threshold.

Sample Size Calculation

Question: How many runs do we need to be 90% confident of improvement?

Answer:
You need 97 runs with 0 failures to be 90% confident that your success rate is ≥ 98.98%.

Example code showing this →

Null Hypothesis →

Clone this wiki locally