-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added DocumentRetrievalEvaluator to Azure AI Evaluation to support evaluation of document search #39929
base: main
Are you sure you want to change the base?
Conversation
API change check API changes are not detected in this pull request. |
…r input and output schemas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new evaluator class, DocumentRetrievalEvaluator, to compute document retrieval metrics such as NDCG, XDCG, fidelity, and top-K relevance for document search queries. Key changes include:
- Adding an init.py file that exposes DocumentRetrievalEvaluator.
- Implementing DocumentRetrievalEvaluator with methods to compute metrics and perform input validation.
Reviewed Changes
Copilot reviewed 2 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/init.py | Exposes the DocumentRetrievalEvaluator class |
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py | Implements the evaluator methods and metric computations |
Files not reviewed (2)
- sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/input.schema: Language not supported
- sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/metrics.schema: Language not supported
Comments suppressed due to low confidence (3)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py:42
- The call to super().init() is unnecessary since DocumentRetrievalEvaluator does not extend a base class. Consider removing it to avoid confusion.
super().__init__()
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py:166
- The output key 'ratioholes' does not match the TypedDict definition which specifies 'holes_ratio'. Consider updating it for consistency.
"ratioholes": 0,
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py:211
- The key 'ratioholes' is inconsistent with the TypedDict definition that uses 'holes_ratio'. Update it accordingly.
"ratioholes": ratioholes,
""" | ||
Calculate document retrieval metrics, such as NDCG, XDCG, Fidelity and Top K Relevance. | ||
""" | ||
def __init__(self, groundtruth_min: int = 0, groundtruth_max: int = 4, groundtruth_step: int = 1, threshold: Optional[dict] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these params keyword args instead of positional ?
threshold = {} | ||
|
||
elif not isinstance(threshold, dict): | ||
raise TypeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the Evaluation Exception class for it
return weighted_sum_by_rating_results / float(weighted_sum_by_rating_index) | ||
|
||
|
||
def _compute_fidelity_old( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the formulation depending on a static 0-4 scale for qrels. I'll remove it later, since I've replaced it with a more generalized formula above.
) | ||
|
||
for key in default_threshold.keys(): | ||
if key not in threshold: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do a default_threshold.update(threshold)
and store that as self._threshold
. Both more efficient and avoids mutating the passed-in dictionary. Mutating parameters can be a source of subtle bugs unless it is very clear that this is the intended outcome.
return result | ||
|
||
def __call__(self, *, groundtruth_documents_labels: str, retrieved_documents_labels: str) -> DocumentRetrievalMetrics: | ||
# input validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring.
Description
This PR includes a new class,
DocumentRetrievalEvaluator
, to produce document retrieval evaluator metrics over a set of input document, measured against a set of input ground-truth documents.All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines