RAG Evaluation Metrics and Recommended Thresholds

Are there any links to refer to recommended thresholds for each metric listed below for RAG evaluation?

 I applied these metrics to the RAG project. These are the metrics for RAG Evaluation that range from scores of 0-1 (0%-100%):
 RAGAS metric
 1.Correctness 
2.Semantic Similarity 
3.Faithfulness
4. Response Relevancy 
5. LLM Context Recall 


Nvidia Metrics: 
1.Answer Accuracy 
2.Context Relevance Responsto
3.Groundedness 

For example, I used 4o-mini to synthesize 100 questions about company policy and evaluated them using 4o-mini, anthropic.claude-3-5-haiku, and gemini-2.0. The summary scores are shown in the table shown in  below.

metric | mean_score

nv_accuracy | 0.74
nv_context_relevance | 0.96
nv_response_groundedness | 0.98
answer_correctness | 0.65
semantic_similarity | 0.95
faithfulness | 0.93
answer_relevancy | 0.97
context_recall | 0.95
llm_context_precision_with_reference | 0.95
llm_context_precision_without_reference | 0.98


 <p>&nbsp;</p>
</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAG Evaluation Metrics and Recommended Thresholds #2043

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RAG Evaluation Metrics and Recommended Thresholds #2043

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions