Skip to content

AI Application Journey

Paul Zabelin edited this page Mar 25, 2025 · 3 revisions

AI Applications using LLM have to deal with non-deterministic behavior.

First we write Proof-Of-Concept to validate that AI can be useful. Then we usually see some inconsistent behavior of LLM. We capture that in our automated test. Because LLM output is different each time we ran test we are exposing difficult cases for LLM by running test multiple times. And we observe that test passes certain percentage of runs.

Over time, we need to be able to iterate on our application and know--reliably--whether we are improving it. In other words, we need to know that our application continues to be aligned with our goals. Let's look at an example of this in action.

Example

Now we have, for example test that passes 70%. Since we have a test our code can detect the failure and retry in production. At the same time we want to iterate on the prompt and try increase success rate beyond 70%.

After making some changes to the prompt, the success rate is 72%. How do we know if this improvement is statistically significant? Is it better or is this random chance?

Statistical Mathematics

CAT Hypothesis

  1. We measure reliability as % of successes from large enough number of run
  2. Each experiment runs in a "clean" room
  3. Experiment outcomes: number of failures, total number of runs
  4. New success rate is calculated using z-score with 90% confidence level. See Statistical approach to test outcomes
  5. Outcomes:
  • not statistical significantly different
  • reliability improved
  • reliability worse
Clone this wiki locally