-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SODA Dataset for Training #35
Merged
Merged
Changes from 14 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
f2b0d6d
soda dataset for training
farzadab ee93d20
soda: alt_last_turn + convo eval
farzadab ea2e9c6
Merge remote-tracking branch 'origin/main' into farzad-soda-train
farzadab be8083e
bugfix: 6 -> 64 max_new_tokens for soda
farzadab aa146e7
add soda to stage2 configs
farzadab b0e83cc
fix formatting
farzadab c209dda
change prompt so truncation won't affect evaluation
farzadab 06b45ff
soda prompt fix
farzadab d9f8c23
fix soda text-only
farzadab 3c8cbd7
fix t_end not defined
farzadab 59a5742
rename audio_one_but_last -> audio_second_last_turn
farzadab 20c27cb
allowing multiple messages in inference
farzadab 87aaf19
eval Sample history: list of str to list of dict
farzadab e4ec48a
separate gpt_evals + test for conv eval
farzadab 075760d
make evaluate_answer_gpt public
farzadab 31c2207
dataset sample prompt: % idx -> RNG
farzadab 99c1409
add check to make sure audio column is resampled
farzadab File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
from ultravox.evaluation import eval_types | ||
from ultravox.evaluation import gpt_eval | ||
|
||
BOOLQ_SYSTEM_PROMPT = f""" | ||
You are an expert evaluator of AI systems. | ||
Given a question with a known true/false answer, you will be rating the correctness of an AI model's answer to that same question. | ||
Based on the supplied question, answer, and expected (correct) answer, you will rate the model's answer as either correct or incorrect. | ||
Award 1 point if the model's answer matches the correct answer, and 0 points if the model's answer does not match, or cannot be converted to a true/false verdict. | ||
Model answers of the form "True", "Yes", "Yeah", etc., should be considered to match a True answer. | ||
Model answers of the form "False", "No", "Incorrect", etc., should be considered to match a False answer. | ||
Only use the supplied correct answer to make your decision; DO NOT use your own knowledge to determine correctness. | ||
Your response MUST start with either 0 or 1, followed by a space, and then a brief explanation for why you awarded that score. | ||
""" | ||
BOOLQ_USER_PROMPT = """ | ||
Using the supplied correct answer as ground truth, evaluate the model's answer to the question below: | ||
Question: {{ question }} | ||
Model answer: {{ generated_answer }} | ||
Correct answer: {{ expected_answer }} | ||
""" | ||
|
||
|
||
def evaluate_answer_boolq(sample: eval_types.Sample) -> eval_types.InstructResult: | ||
return gpt_eval._evaluate_answer_gpt(BOOLQ_SYSTEM_PROMPT, BOOLQ_USER_PROMPT, sample) | ||
farzadab marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
from ultravox.evaluation import eval_types | ||
from ultravox.evaluation import gpt_eval | ||
|
||
CONVO_SYSTEM_PROMPT = f""" | ||
You are an expert evaluator of conversational AI systems. | ||
Given a conversation between two parties, the role of the AI system was to follow the flow of the conversation and respond appropriately. | ||
You are given the conversation, the AI model's response, and an exemplary (correct) response. | ||
The AI model response might be truncated, but that should not affect your evaluation. | ||
Your should award 1 point if the model's response is appropriate and follows the conversation, and 0 points if it does not, such as being off-topic or nonsensical. | ||
Your response MUST start with either 0 or 1, followed by a space, and then an explanation for why you awarded that score. | ||
""" | ||
|
||
CONVO_USER_PROMPT = """ | ||
Using the supplied example of a correct answer, evaluate the model's ability to follow the flow of the conversation in the last message: | ||
|
||
Conversation: | ||
{%- for turn in history + [ {"role": "user", "content": question} ] %} | ||
{% if turn["role"] == "user" %}A{% else %}B{% endif %}: {{ turn["content"] }} | ||
{% endfor %} | ||
Model (as B): {{ generated_answer }} | ||
Correct: {{ expected_answer }} | ||
""" | ||
|
||
|
||
def evaluate_conversation_response( | ||
sample: eval_types.Sample, | ||
) -> eval_types.InstructResult: | ||
sample.history = [msg for msg in sample.history if msg["role"] != "system"] | ||
return gpt_eval._evaluate_answer_gpt(CONVO_SYSTEM_PROMPT, CONVO_USER_PROMPT, sample) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
from ultravox.evaluation import eval_types | ||
from ultravox.evaluation import gpt_eval | ||
|
||
INSTRUCT_SYSTEM_PROMPT = f""" | ||
You are an expert evaluator of AI systems. | ||
Given a question with a specified instruction, you will be rating the correctness of an AI model's ability to follow that instruction. | ||
Based on the supplied answer, and exemplary (correct) answer, you will rate the model's answer as either correct or incorrect. | ||
Award 1 point if the model followed the instruction, and 0 points if it did not. | ||
For example, given a question with an instruction of "Write a sentence about pickleball", | ||
- if the model responds "Pickleball is a tennis-like game played with a wiffle ball.", you should award 1 point. | ||
- if the model responds "Pickleball is a type of fruit", you should award 0 points. | ||
- if the model responds with something off-topic or nonsensical, you should award 0 points. | ||
Your response MUST start with either 0 or 1, followed by a space, and then an explanation for why you awarded that score. | ||
""" | ||
INSTRUCT_USER_PROMPT = """ | ||
Using the supplied correct answer as an example, evaluate the model's ability to follow the instructions in the question below: | ||
Question: {{ question }} | ||
Model answer: {{ generated_answer }} | ||
Correct answer: {{ expected_answer }} | ||
""" | ||
|
||
|
||
def evaluate_answer_instruct(sample: eval_types.Sample) -> eval_types.InstructResult: | ||
return gpt_eval._evaluate_answer_gpt( | ||
INSTRUCT_SYSTEM_PROMPT, INSTRUCT_USER_PROMPT, sample | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
import re | ||
from unittest import mock | ||
|
||
from ultravox.evaluation import eval_types | ||
from ultravox.evaluation import gpt_eval | ||
from ultravox.evaluation import gpt_eval_conv | ||
|
||
|
||
def test_evaluate_conversation(): | ||
gpt_eval.client = mock.MagicMock() | ||
sample = eval_types.Sample( | ||
history=[ | ||
{"role": "system", "content": "Blah blah blah"}, | ||
{"role": "user", "content": "T1"}, | ||
{"role": "assistant", "content": "T2"}, | ||
], | ||
question="T3", | ||
generated_answer="T4", | ||
expected_answer="EXP", | ||
) | ||
expected_turns = "A: T1\n\nB: T2\n\nA: T3\n\nModel (as B): T4\nCorrect: EXP" | ||
|
||
gpt_eval_conv.evaluate_conversation_response(sample) | ||
|
||
completion_args = gpt_eval.client.chat.completions.create.call_args[1] | ||
assert len(completion_args["messages"]) == 2 | ||
assert completion_args["messages"][0]["role"] == "system" | ||
assert completion_args["messages"][1]["role"] == "user" | ||
gpt_question = re.sub("\n *", "\n", completion_args["messages"][1]["content"]) | ||
assert expected_turns in gpt_question | ||
assert "Blah blah blah" not in gpt_question |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we end up using a RNG for this sort of thing rather than the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I forget where but we discussed adding a private RNG to datasets to allow them to simply pull a value from the RNG rather than using the index counter and various moduli)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the idea was that we do that in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well just do it now I guess since I have the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.