Skip to content

Ground truth score does not always range from 0 to 1 #658

Closed
@andrec279

Description

I noticed that the Ground Truth feedback function doesn't always provide scores between 0 and 1, which throws off the overall average estimate for the score (see image below, experiment was run on a labeled question / answer evaluation set)
image

Looking through the utils for the GroundTruthAgreement class, it looks like there's two main causes:

  1. The AGREEMENT_SYSTEM_PROMPT does not reliably get the LLM to return the score from 0-10 at the end of its response. Importantly, I noticed in these cases when the score is above 10, the generated answer is not necessarily correct, so the number is likely not an attempt at a score from the LLM.
  2. There's no logical catch in the re_0_10_rating() util function (trulens_eval/utils/generated.py) for when the matched score is outside the expected bounds.

IMO I think the prompt can be tweaked to more specifically instruct the LLM to return the score at the end of the response, and if the system still returns a score outside the 0-10 range, it should be considered as a failure to match and return -10 to avoid biasing the results in either direction. Let me know what you think!

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions