Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

o1 eval results #22

Open
stalkermustang opened this issue Sep 15, 2024 · 4 comments
Open

o1 eval results #22

stalkermustang opened this issue Sep 15, 2024 · 4 comments

Comments

@stalkermustang
Copy link

Hey, really curious how new OAI models (either mini or preview) perform here. Looking forward to checking the updated LB 🙌🙌

@carlini
Copy link
Owner

carlini commented Sep 15, 2024

Preliminarily: 61%. Found a bug in two of the test cases because of it, so will fix those and then upload results this weekend.

@carlini
Copy link
Owner

carlini commented Sep 15, 2024

(Caveat though I don't know how to correct for: I can't control the temperature of O1. I think that 4o/3.5 sonnet do slightly better on lower temperature than I run at, so I don't know how to best handle this. It'll require some thought.)

@stalkermustang
Copy link
Author

stalkermustang commented Sep 15, 2024

IIRC you can't set system prompts or temperature for o1 series. There's no workaround as far as I know.
The idea behind limiting the temperature is that the model needs to be "creative" during the thought process. It will generate new ideas and try new things if it struggles to progress on the chosen path.

@carlini
Copy link
Owner

carlini commented Sep 15, 2024

Yeah. I'm trying to decide if I want to have the evaluation grid shown at a higher temperature so you can see more diversity in outputs, but then report the "best accuracy" as temperature=0 for other models too now or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants