AI Application Journey

AI Applications using LLM have to deal with non-deterministic behavior.

First we write Proof-Of-Concept to validate that AI can be useful. Then we usually see some inconsistent behavior of LLM. We capture that in our automated test. Because LLM output is different each time we ran test we are exposing difficult cases for LLM by running test multiple times. And we observe that test passes certain percentage of runs.

Over time, we need to be able to iterate on our application and know--reliably--whether we are improving it. In other words, we need to know that our application continues to be aligned with our goals. Let's look at an example of this in action.

Example

Now we have, for example test that passes 70%. Since we have a test our code can detect the failure and retry in production. At the same time we want to iterate on the prompt and try increase success rate beyond 70%.

After making some changes to the prompt, the success rate is 72%. How do we know if this improvement is statistically significant? Is it better or is this random chance?

Statistical Mathematics

CAT Hypothesis

We measure reliability as % of successes from large enough number of run
Each experiment runs in a "clean" room
Experiment outcomes: number of failures, total number of runs
New success rate is calculated using z-score with 90% confidence level. See Statistical approach to test outcomes
Outcomes:

not statistical significantly different
reliability improved
reliability worse

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AI Application Journey

AI Applications using LLM have to deal with non-deterministic behavior.

Example

Statistical Mathematics

CAT Hypothesis

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally