-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
o1 eval results #22
Comments
Preliminarily: 61%. Found a bug in two of the test cases because of it, so will fix those and then upload results this weekend. |
(Caveat though I don't know how to correct for: I can't control the temperature of O1. I think that 4o/3.5 sonnet do slightly better on lower temperature than I run at, so I don't know how to best handle this. It'll require some thought.) |
IIRC you can't set system prompts or temperature for o1 series. There's no workaround as far as I know. |
Yeah. I'm trying to decide if I want to have the evaluation grid shown at a higher temperature so you can see more diversity in outputs, but then report the "best accuracy" as temperature=0 for other models too now or something like that. |
Hey, really curious how new OAI models (either mini or preview) perform here. Looking forward to checking the updated LB 🙌🙌
The text was updated successfully, but these errors were encountered: