Skip to content

RAG Evaluation Metrics and Recommended Thresholds #2043

Open
@technqvi

Description

@technqvi

Are there any links to refer to recommended thresholds for each metric listed below for RAG evaluation?

I applied these metrics to the RAG project. These are the metrics for RAG Evaluation that range from scores of 0-1 (0%-100%):
RAGAS metric
1.Correctness
2.Semantic Similarity
3.Faithfulness
4. Response Relevancy
5. LLM Context Recall

Nvidia Metrics:
1.Answer Accuracy
2.Context Relevance Responsto
3.Groundedness

For example, I used 4o-mini to synthesize 100 questions about company policy and evaluated them using 4o-mini, anthropic.claude-3-5-haiku, and gemini-2.0. The summary scores are shown in the table shown in below.

metric | mean_score

nv_accuracy | 0.74
nv_context_relevance | 0.96
nv_response_groundedness | 0.98
answer_correctness | 0.65
semantic_similarity | 0.95
faithfulness | 0.93
answer_relevancy | 0.97
context_recall | 0.95
llm_context_precision_with_reference | 0.95
llm_context_precision_without_reference | 0.98

 

Metadata

Metadata

Assignees

No one assigned

    Labels

    module-metricsthis is part of metrics modulequestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions