Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DocumentRetrievalEvaluator to Azure AI Evaluation to support evaluation of document search #39929

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

abhahn
Copy link
Member

@abhahn abhahn commented Mar 4, 2025

Description

This PR includes a new class, DocumentRetrievalEvaluator, to produce document retrieval evaluator metrics over a set of input document, measured against a set of input ground-truth documents.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 4, 2025
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@abhahn abhahn marked this pull request as ready for review March 18, 2025 20:41
@Copilot Copilot bot review requested due to automatic review settings March 18, 2025 20:41
@abhahn abhahn requested a review from a team as a code owner March 18, 2025 20:41
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new evaluator class, DocumentRetrievalEvaluator, to compute document retrieval metrics such as NDCG, XDCG, fidelity, and top-K relevance for document search queries. Key changes include:

  • Adding an init.py file that exposes DocumentRetrievalEvaluator.
  • Implementing DocumentRetrievalEvaluator with methods to compute metrics and perform input validation.

Reviewed Changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated no comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/init.py Exposes the DocumentRetrievalEvaluator class
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py Implements the evaluator methods and metric computations
Files not reviewed (2)
  • sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/input.schema: Language not supported
  • sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/metrics.schema: Language not supported
Comments suppressed due to low confidence (3)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py:42

  • The call to super().init() is unnecessary since DocumentRetrievalEvaluator does not extend a base class. Consider removing it to avoid confusion.
super().__init__()

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py:166

  • The output key 'ratioholes' does not match the TypedDict definition which specifies 'holes_ratio'. Consider updating it for consistency.
"ratioholes": 0,

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py:211

  • The key 'ratioholes' is inconsistent with the TypedDict definition that uses 'holes_ratio'. Update it accordingly.
"ratioholes": ratioholes,

"""
Calculate document retrieval metrics, such as NDCG, XDCG, Fidelity and Top K Relevance.
"""
def __init__(self, groundtruth_min: int = 0, groundtruth_max: int = 4, groundtruth_step: int = 1, threshold: Optional[dict] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these params keyword args instead of positional ?

threshold = {}

elif not isinstance(threshold, dict):
raise TypeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the Evaluation Exception class for it

return weighted_sum_by_rating_results / float(weighted_sum_by_rating_index)


def _compute_fidelity_old(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the formulation depending on a static 0-4 scale for qrels. I'll remove it later, since I've replaced it with a more generalized formula above.

)

for key in default_threshold.keys():
if key not in threshold:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do a default_threshold.update(threshold) and store that as self._threshold. Both more efficient and avoids mutating the passed-in dictionary. Mutating parameters can be a source of subtle bugs unless it is very clear that this is the intended outcome.

return result

def __call__(self, *, groundtruth_documents_labels: str, retrieved_documents_labels: str) -> DocumentRetrievalMetrics:
# input validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Evaluation Issues related to the client library for Azure AI Evaluation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants