Closed
Description
I noticed that the Ground Truth feedback function doesn't always provide scores between 0 and 1, which throws off the overall average estimate for the score (see image below, experiment was run on a labeled question / answer evaluation set)
Looking through the utils for the GroundTruthAgreement class, it looks like there's two main causes:
- The AGREEMENT_SYSTEM_PROMPT does not reliably get the LLM to return the score from 0-10 at the end of its response. Importantly, I noticed in these cases when the score is above 10, the generated answer is not necessarily correct, so the number is likely not an attempt at a score from the LLM.
- There's no logical catch in the re_0_10_rating() util function (trulens_eval/utils/generated.py) for when the matched score is outside the expected bounds.
IMO I think the prompt can be tweaked to more specifically instruct the LLM to return the score at the end of the response, and if the system still returns a score outside the 0-10 range, it should be considered as a failure to match and return -10 to avoid biasing the results in either direction. Let me know what you think!
Metadata
Assignees
Labels
No labels