-
Notifications
You must be signed in to change notification settings - Fork 6
expect json response from the LLM #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a004f07
to
50d6754
Compare
50d6754
to
0a07f6f
Compare
|
||
assert len(result) == len(examples_set) | ||
|
||
for i, example in enumerate(result): | ||
assert example[0] == examples_set[i][0] | ||
assert example[1] == examples_set[i][1] | ||
|
||
def test_should_open_example_js_and_json_files(self, mocker, code_generator, rubric, examples): | ||
examples_set = examples(rubric) | ||
print(examples_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a stray print.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! removed here and above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is great stuff! I like how you've added the response type as another option... JSON will be so much better to work with overall.
I think I found one stray print, but other than that it looks great.
just wanted to add, thank you @wilkie for all the great test coverage in here! as a python newbie, not having to figure out how to test my code changes from scratch has been huge help 😁 |
Description
Adds option to handle a json response from the LLM instead of a TSV response. This improves accuracy and reliability for some LLMs including GPT 4 Turbo.
The code changes in this PR enable running both the 'json' and 'reason' scenarios in the LLM comparison spreadsheet, for GPT 4 classic and turbo models.
Code changes
'tsv'
(default) or'json'
.parse_json_response
methodS3 updates
alongside this PR, the following data has been uploaded to
s3://cdo-ai/teaching_assistant/experiments/
:ai-rubrics-json
: copy ofai-rubrics-pilot-baseline
, with:ai-rubrics-json-reason
: copy ofai-rubrics-json
, with:ai-rubrics-json-gpt-4-turbo
: copy ofai-rubrics-json
, withgpt-4-0125-preview
modelai-rubrics-json-reason-gpt-4-turbo
: copy ofai-rubrics-json-reason
, withgpt-4-0125-preview
modelto help keep costs down and options open, I've included the
output/report-exact-match.html
files as well as thecached_responses
directory. this will allow you, for example, to regenerate a pass-fail version of any report using rubric tester's-c
option without incurring further LLM costs.regenerating examples
as part of configuring
ai-rubrics-json-reason
, I used rubric tester to regenerate the examples (see steps added to the readme in this PR). Since I was regenerating existing examples rather than creating them from scratch, I chose the labels for theactual_labels.csv
file in the temp dataset based on theexamples/*.tsv
files I was replacing, rather than having to determine those values from scratch.rubric tester commands
The commands run to produce the results in LLM comparison are:
before repeating these commands, please note that each GPT 4 classic run costs about $12 and each GPT 4 Turbo run costs about $5, so the above 4 commands cost about $35 to run (!!). see previous note about how report outputs and cached responses have been uploaded to S3 to hopefully avoid costs of redundant test runs, or shrink the cost of your test runs with
-s
and--lesson-names
rubric tester params.Testing story