Description
Are there any links to refer to recommended thresholds for each metric listed below for RAG evaluation?
I applied these metrics to the RAG project. These are the metrics for RAG Evaluation that range from scores of 0-1 (0%-100%):
RAGAS metric
1.Correctness
2.Semantic Similarity
3.Faithfulness
4. Response Relevancy
5. LLM Context Recall
Nvidia Metrics:
1.Answer Accuracy
2.Context Relevance Responsto
3.Groundedness
For example, I used 4o-mini to synthesize 100 questions about company policy and evaluated them using 4o-mini, anthropic.claude-3-5-haiku, and gemini-2.0. The summary scores are shown in the table shown in below.
metric | mean_score
nv_accuracy | 0.74
nv_context_relevance | 0.96
nv_response_groundedness | 0.98
answer_correctness | 0.65
semantic_similarity | 0.95
faithfulness | 0.93
answer_relevancy | 0.97
context_recall | 0.95
llm_context_precision_with_reference | 0.95
llm_context_precision_without_reference | 0.98