Skip to content

expect json response from the LLM #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Feb 26, 2024
Merged

expect json response from the LLM #53

merged 22 commits into from
Feb 26, 2024

Conversation

davidsbailey
Copy link
Member

@davidsbailey davidsbailey commented Feb 15, 2024

Description

Adds option to handle a json response from the LLM instead of a TSV response. This improves accuracy and reliability for some LLMs including GPT 4 Turbo.

The code changes in this PR enable running both the 'json' and 'reason' scenarios in the LLM comparison spreadsheet, for GPT 4 classic and turbo models.

Code changes

  • add response type param to params.json, which can either by 'tsv' (default) or 'json'.
  • use response type to determine which filename suffix to reading example LLM responses from
  • add parse_json_response method
  • use response type decide whether to parse the LLM response as json

S3 updates

alongside this PR, the following data has been uploaded to s3://cdo-ai/teaching_assistant/experiments/:

  • ai-rubrics-json: copy of ai-rubrics-pilot-baseline, with:
    • system prompts updated to request JSON instead of TSV
    • L14 and L18 updated to provide JSON instead of TSV example responses
  • ai-rubrics-json-reason: copy of ai-rubrics-json, with:
    • system prompt modified to request Reason before Grade, to improve chain-of-thought reasoning
    • L14 and L18 examples regenerated using GPT 4 classic and then hand-tuned for correctness (see below)
  • ai-rubrics-json-gpt-4-turbo: copy of ai-rubrics-json, with gpt-4-0125-preview model
  • ai-rubrics-json-reason-gpt-4-turbo: copy of ai-rubrics-json-reason, with gpt-4-0125-preview model

to help keep costs down and options open, I've included the output/report-exact-match.html files as well as the cached_responses directory. this will allow you, for example, to regenerate a pass-fail version of any report using rubric tester's -c option without incurring further LLM costs.

regenerating examples

as part of configuring ai-rubrics-json-reason, I used rubric tester to regenerate the examples (see steps added to the readme in this PR). Since I was regenerating existing examples rather than creating them from scratch, I chose the labels for the actual_labels.csv file in the temp dataset based on the examples/*.tsv files I was replacing, rather than having to determine those values from scratch.

rubric tester commands

The commands run to produce the results in LLM comparison are:

python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json
python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json-gpt-4-turbo
python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json-reason
python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json-reason-gpt-4-turbo

before repeating these commands, please note that each GPT 4 classic run costs about $12 and each GPT 4 Turbo run costs about $5, so the above 4 commands cost about $35 to run (!!). see previous note about how report outputs and cached responses have been uploaded to S3 to hopefully avoid costs of redundant test runs, or shrink the cost of your test runs with -s and --lesson-names rubric tester params.

Testing story

  • updated existing unit tests
  • new unit test for reading json examples
  • new unit test for parsing json response

Base automatically changed from rename-tsv-response-data to main February 16, 2024 17:35
@davidsbailey davidsbailey marked this pull request as ready for review February 17, 2024 00:18
@davidsbailey davidsbailey requested review from wilkie and a team February 17, 2024 00:18

assert len(result) == len(examples_set)

for i, example in enumerate(result):
assert example[0] == examples_set[i][0]
assert example[1] == examples_set[i][1]

def test_should_open_example_js_and_json_files(self, mocker, code_generator, rubric, examples):
examples_set = examples(rubric)
print(examples_set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a stray print.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! removed here and above

Copy link
Contributor

@wilkie wilkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is great stuff! I like how you've added the response type as another option... JSON will be so much better to work with overall.

I think I found one stray print, but other than that it looks great.

@davidsbailey davidsbailey merged commit c011928 into main Feb 26, 2024
@davidsbailey davidsbailey deleted the handle-json-response branch February 26, 2024 18:13
@davidsbailey
Copy link
Member Author

just wanted to add, thank you @wilkie for all the great test coverage in here! as a python newbie, not having to figure out how to test my code changes from scratch has been huge help 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants