-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds basic ragas eval #193
base: main
Are you sure you want to change the base?
Conversation
requirements.txt
Outdated
@@ -10,3 +10,5 @@ pandas | |||
pandas-stubs | |||
lm-eval>=0.4.4 | |||
httpx | |||
|
|||
ragas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline at EOF
def __init__(self): | ||
pass | ||
|
||
def run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for this a user is expected to bring a list of Sample objects, which hold the input, prediction, and ground truth? Are we going to provide a way to build this list of Samples from given files or lists of each category, or is this moreso just for use with self-built scripts that import the Sample object and build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it such that dataset
now is either a pathlib.Path
object or a list
of samples, and we read what we need to accordingly
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
We want ragas to read from both a list as well as a list of samples Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
f12fbd7
to
f0db4a4
Compare
@RobotSail let me know once you are done with testing the code. Other than that LGTM. |
src/instructlab/eval/ragas.py
Outdated
|
||
max_tokens: int = 768 | ||
|
||
# Random seed for reproducibility. This is not supported everywhere and therefore is unreliable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had discussed this earlier, I think you're going to want to remove this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alimaredia I'll update this comment because I believe you are confusing it's meaning with our earlier conversation. This comment relates to the seed not being supported by every model serving framework.
When a dataset is provided and is missing the `response` field, we will need to generate these responses. This commit ensures that when this case happens, we will error out when a student model is not configured. Otherwise, we will always generate these responses if the student model exists, regardless if `response` is in the dataframe or not. Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
@abhi1092 I've updated the code to have unit tests, please let me know if there was anything I missed. |
api_key="test-api-key", | ||
) | ||
evaluator = RagasEvaluator() | ||
result_df = evaluator._generate_answers_from_model(questions, student_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RobotSail shouldn't we hit a mock client here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using gpt-3.5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhi1092 We do test it with a mock-client, GPT-3.5 was just the first model that came to mind as a fill-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to use fake values to eliminate the possibility of it actually calling out to the real openai API.
Given a DataFrame containing `user_input` columns, generates responses from the given model | ||
and returns a new DataFrame containing its answers in the `response` column. | ||
""" | ||
client = get_openai_client( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RobotSail maybe this method and also the class can instead take the client as input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhi1092 That's a good idea. Although the way I've done it here is how it happens in MT-Bench as well. Would it make sense to make a follow-up issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhi1092 Actually I will just make this change in this PR since it's small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR to include this.
######################################################################## | ||
# Test case: directly passing a dataset | ||
######################################################################## | ||
result = evaluator.run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too maybe we can call mock gpt-4o with pre-defined evaluation result. This way we only test the evaluation given the student model response, reference and judge model output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall just a could of questions re: function and testing simplification
system_prompt: str = _DEFAULT_SYSTEM_PROMPT | ||
|
||
# "model randomness" aka likelihood of sampling something other than the likeliest token | ||
temperature: float = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor nit: Pydantic has an inline validation mechanism with Field
s:
temperature: float = 0.0 | |
temperature: float = Field(ge=0.0, le=1.0, default=0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, this is way simpler.
|
||
def __init__( | ||
self, | ||
student_model: ModelConfig | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can any of these actually be None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because the user isn't required to pass them in at initialization time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might streamline the code if these are required in the constructor so you don't have to do checks below. Although you might be implementing to the standard of the Evaluator
class, in which case a refactor might make sense in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point but it'd be fine to leave it as-is. This is also how a lot of other libraries including Ragas implement similar functionality.
run_config = run_config if run_config else self.run_config | ||
openai_client = openai_client if openai_client else self.openai_client | ||
|
||
if not dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset is non-optional, so this check might be redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll fix this.
src/instructlab/eval/ragas.py
Outdated
|
||
# if the student model was provided then we always generate regardless | ||
if student_model: | ||
if not openai_client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could move this assertion up to the start of the function to fail earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No actually we can't do this. We have to first compute whether or not we'll need to generate questions, so there's nothing that can be skipped when computing this.
src/instructlab/eval/ragas.py
Outdated
timeouts. This is because by default, OpenAI tier-1 usage accounts have very high | ||
rate limits resulting in heavy throttling during evaluations. | ||
openai_client (openai.Client | None, optional): | ||
The client to use when generating questions from the student model, must be compatible with the OpenAI API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the client for the student model or the judge model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the student model. In this PR we are making the opinionated stance that the judge model needs to be GPT-4o for consistent results.
@patch("instructlab.eval.ragas.evaluate") | ||
@patch.object(RagasEvaluator, "_generate_answers_from_model") | ||
@patch.object(RagasEvaluator, "_get_metrics") | ||
def test_run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this test that checks multiple things be broken down into separate tests?
updated_df["response"] = "" | ||
|
||
for i, qna in updated_df.iterrows(): | ||
messages = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an array of dictionaries {"role": .., "content": ...}
. ?
e.g.
messages = [
{"role": "system", "content": student_model.system_prompt},
{"role": "user", "content": qna["user_input"]},
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updating
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
This PR introduces Rubric-based evaluation through Ragas using the default with-reference rubric that they provide.
The current evaluation supports the two following modes:
dataset
contains records which holduser_input
(question),reference
(golden answer), andresponse
(model answer)dataset
with theuser_input
andreference
, and additionally aModelConfiguration
of a model which will be used to generate theresponse
for each question. This can be any model running on an OpenAI-compatible endpointSigned-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com