Statistical approach to test outcomes

Traditional Testing

Test Category	Description	Speed ↑	Confidence↓	Reliability ↑
Unit Tests	Test individual components	Fast	Low	High
Integration Tests	Test component interactions	Medium	Medium	Medium
End-to-End / Acceptance	Test full system functionality	Slow	High	Low

We see that with larger tests we have higher confidence that whole system works, but we have slow tests and low reliability. This is also a move from certainty to probability of test outcomes. Generating LLM output makes tests slow and also can cause low reliability.

AI Testing and Non-Determinism

AI systems, especially LLMs, produce non-deterministic outputs. This requires a probabilistic testing approach:

Outputs can be evaluated as boolean (pass/fail).
Tests can be repeated N times.
Statistical analysis helps determine confidence in results.

This is based on assumption that we can build software that will sort LLM output into valid and non-valid. Invalid outputs will cause retry to generate new valid response. NOTE: In online chat systems LLM can provide N responses to allow faster response time without waiting for retry.

Binomial Approach

We use binomial testing to evaluate LLM reliability. This is similar to coin flips or quality control scenarios.

Example: Biased Coin Test

Scenario: You flip a coin 20 times and get 15 heads. Is this surprising enough to say the coin is biased?

Step 1: Define Hypotheses

H₀ (Null): Coin is fair (50% heads).
H₁ (Alt): Coin is biased (>50% heads).

Step 2: Calculate Probability

Probability of getting ≥15 heads by chance ≈ 0.02 (2%)

Step 3: Interpret Results

Since p = 0.02 < 0.05, result is statistically significant
Conclusion: Coin is likely biased.

Flaky Test Case Example

Imagine flaky UI tests with a 5% failure rate (i.e., 95% reliability). You introduce a code change and get 47 passes before 1 failure.

Question: Did the change improve reliability?

Observed success rate: 97.9% (46/47)
Expected (baseline) success: 95%
Confidence interval (90%): ~3–7 failures in 100 runs

Conclusion:
Even with a higher observed success rate, we cannot claim improvement unless the difference is statistically significant.
There's a 36% chance this is a statistical fluke. More runs are needed to reduce uncertainty.

Testing Process

Generate response.
Evaluate output with binomial validator.
Run test N times.
Analyze results statistically.
Compare to expected success rate.
Mark result as GREEN (pass) or RED (fail) based on threshold.

Goal-Oriented Testing

Example Goal: Reduce refusal rate for gory images from 5% to 1%.

Question: Did the fix help?

Say we got 1 failure in 47 runs → 97.9% success.
This does not exceed statistical threshold.

Sample Size Calculation

Question: How many runs do we need to be 90% confident of improvement?

Answer:
You need 97 runs with 0 failures to be 90% confident that your success rate is ≥ 98.98%.

Example code showing this →

Null Hypothesis →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Statistical approach to test outcomes

Traditional Testing

AI Testing and Non-Determinism

Binomial Approach

Example: Biased Coin Test

Step 1: Define Hypotheses

Step 2: Calculate Probability

Step 3: Interpret Results

Flaky Test Case Example

Testing Process

Goal-Oriented Testing

Sample Size Calculation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally