Skip to content

add async version of validate method #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.0.10] - 2025-04-15

- Add async support to `Validator` API.

## [1.0.9] - 2025-04-10

- Refactor threshold validation in the `Validator` class to only check user-provided metrics.
Expand Down Expand Up @@ -51,7 +55,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Initial release of the `cleanlab-codex` client library.

[Unreleased]: https://github.com/cleanlab/cleanlab-codex/compare/v1.0.9...HEAD
[Unreleased]: https://github.com/cleanlab/cleanlab-codex/compare/v1.0.10...HEAD
[1.0.10]: https://github.com/cleanlab/cleanlab-codex/compare/v1.0.9...v1.0.10
[1.0.9]: https://github.com/cleanlab/cleanlab-codex/compare/v1.0.8...v1.0.9
[1.0.8]: https://github.com/cleanlab/cleanlab-codex/compare/v1.0.7...v1.0.8
[1.0.7]: https://github.com/cleanlab/cleanlab-codex/compare/v1.0.6...v1.0.7
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ classifiers = [
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = [
"cleanlab-tlm~=1.0.12",
"cleanlab-tlm~=1.0.18",
"codex-sdk==0.1.0-alpha.14",
"pydantic>=2.0.0, <3",
]
Expand Down
2 changes: 1 addition & 1 deletion src/cleanlab_codex/__about__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# SPDX-License-Identifier: MIT
__version__ = "1.0.9"
__version__ = "1.0.10"
84 changes: 83 additions & 1 deletion src/cleanlab_codex/validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
get_default_evaluations,
get_default_trustworthyrag_config,
)
from cleanlab_codex.internal.validator import update_scores_based_on_thresholds as _update_scores_based_on_thresholds
from cleanlab_codex.internal.validator import (
update_scores_based_on_thresholds as _update_scores_based_on_thresholds,
)
from cleanlab_codex.project import Project

if TYPE_CHECKING:
Expand Down Expand Up @@ -131,6 +133,41 @@ def validate(
**scores,
}

async def validate_async(
self,
query: str,
context: str,
response: str,
prompt: Optional[str] = None,
form_prompt: Optional[Callable[[str, str], str]] = None,
) -> dict[str, Any]:
"""Evaluate whether the AI-generated response is bad, and if so, request an alternate expert answer.
If no expert answer is available, this query is still logged for SMEs to answer.

Args:
query (str): The user query that was used to generate the response.
context (str): The context that was retrieved from the RAG Knowledge Base and used to generate the response.
response (str): A reponse from your LLM/RAG system.
prompt (str, optional): Optional prompt representing the actual inputs (combining query, context, and system instructions into one string) to the LLM that generated the response.
form_prompt (Callable[[str, str], str], optional): Optional function to format the prompt based on query and context. Cannot be provided together with prompt, provide one or the other. This function should take query and context as parameters and return a formatted prompt string. If not provided, a default prompt formatter will be used. To include a system prompt or any other special instructions for your LLM, incorporate them directly in your custom form_prompt() function definition.

Returns:
dict[str, Any]: A dictionary containing:
- 'expert_answer': Alternate SME-provided answer from Codex if the response was flagged as bad and an answer was found in the Codex Project, or None otherwise.
- 'is_bad_response': True if the response is flagged as potentially bad, False otherwise. When True, a Codex lookup is performed, which logs this query into the Codex Project for SMEs to answer.
- Additional keys from a [`ThresholdedTrustworthyRAGScore`](/codex/api/python/types.validator/#class-thresholdedtrustworthyragscore) dictionary: each corresponds to a [TrustworthyRAG](/tlm/api/python/utils.rag/#class-trustworthyrag) evaluation metric, and points to the score for this evaluation as well as a boolean `is_bad` flagging whether the score falls below the corresponding threshold.
"""
scores, is_bad_response = await self.detect_async(query, context, response, prompt, form_prompt)
expert_answer = None
if is_bad_response:
expert_answer = self._remediate(query)

return {
"expert_answer": expert_answer,
"is_bad_response": is_bad_response,
**scores,
}

def detect(
self,
query: str,
Expand Down Expand Up @@ -176,6 +213,51 @@ def detect(
is_bad_response = any(score_dict["is_bad"] for score_dict in thresholded_scores.values())
return thresholded_scores, is_bad_response

async def detect_async(
self,
query: str,
context: str,
response: str,
prompt: Optional[str] = None,
form_prompt: Optional[Callable[[str, str], str]] = None,
) -> tuple[ThresholdedTrustworthyRAGScore, bool]:
"""Score response quality using TrustworthyRAG and flag bad responses based on configured thresholds.

Note:
Use this method instead of `validate()` to test/tune detection configurations like score thresholds and TrustworthyRAG settings.
This `detect()` method will not affect your Codex Project, whereas `validate()` will log queries whose response was detected as bad into the Codex Project and is thus only suitable for production, not testing.
Both this method and `validate()` rely on this same detection logic, so you can use this method to first optimize detections and then switch to using `validate()`.

Args:
query (str): The user query that was used to generate the response.
context (str): The context that was retrieved from the RAG Knowledge Base and used to generate the response.
response (str): A reponse from your LLM/RAG system.
prompt (str, optional): Optional prompt representing the actual inputs (combining query, context, and system instructions into one string) to the LLM that generated the response.
form_prompt (Callable[[str, str], str], optional): Optional function to format the prompt based on query and context. Cannot be provided together with prompt, provide one or the other. This function should take query and context as parameters and return a formatted prompt string. If not provided, a default prompt formatter will be used. To include a system prompt or any other special instructions for your LLM, incorporate them directly in your custom form_prompt() function definition.

Returns:
tuple[ThresholdedTrustworthyRAGScore, bool]: A tuple containing:
- ThresholdedTrustworthyRAGScore: Quality scores for different evaluation metrics like trustworthiness
and response helpfulness. Each metric has a score between 0-1. It also has a boolean flag, `is_bad` indicating whether the score is below a given threshold.
- bool: True if the response is determined to be bad based on the evaluation scores
and configured thresholds, False otherwise.
"""
scores = await self._tlm_rag.score_async(
response=response,
query=query,
context=context,
prompt=prompt,
form_prompt=form_prompt,
)

thresholded_scores = _update_scores_based_on_thresholds(
scores=scores,
thresholds=self._bad_response_thresholds,
)

is_bad_response = any(score_dict["is_bad"] for score_dict in thresholded_scores.values())
return thresholded_scores, is_bad_response

def _remediate(self, query: str) -> str | None:
"""Request a SME-provided answer for this query, if one is available in Codex.

Expand Down