Ground truth score does not always range from 0 to 1

I noticed that the Ground Truth feedback function doesn't always provide scores between 0 and 1, which throws off the overall average estimate for the score (see image below, experiment was run on a labeled question / answer evaluation set)
![image](https://github.com/truera/trulens/assets/52364891/7c07578c-b094-41e4-b760-5c62cf3f69e9)

Looking through the utils for the GroundTruthAgreement class, it looks like there's two main causes:

1. The AGREEMENT_SYSTEM_PROMPT does not reliably get the LLM to return the score from 0-10 at the end of its response. Importantly, I noticed in these cases when the score is above 10, the generated answer is not necessarily correct, so the number is likely not an attempt at a score from the LLM.
2. There's no logical catch in the re_0_10_rating() util function (trulens_eval/utils/generated.py) for when the matched score is outside the expected bounds. 

IMO I think the prompt can be tweaked to more specifically instruct the LLM to return the score at the end of the response, and if the system still returns a score outside the 0-10 range, it should be considered as a failure to match and return -10 to avoid biasing the results in either direction. Let me know what you think!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ground truth score does not always range from 0 to 1 #658

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development