o1 eval results #22

stalkermustang · 2024-09-15T00:19:45Z

Hey, really curious how new OAI models (either mini or preview) perform here. Looking forward to checking the updated LB 🙌🙌

carlini · 2024-09-15T01:48:42Z

Preliminarily: 61%. Found a bug in two of the test cases because of it, so will fix those and then upload results this weekend.

carlini · 2024-09-15T01:50:09Z

(Caveat though I don't know how to correct for: I can't control the temperature of O1. I think that 4o/3.5 sonnet do slightly better on lower temperature than I run at, so I don't know how to best handle this. It'll require some thought.)

stalkermustang · 2024-09-15T09:26:42Z

IIRC you can't set system prompts or temperature for o1 series. There's no workaround as far as I know.
The idea behind limiting the temperature is that the model needs to be "creative" during the thought process. It will generate new ideas and try new things if it struggles to progress on the chosen path.

carlini · 2024-09-15T14:32:58Z

Yeah. I'm trying to decide if I want to have the evaluation grid shown at a higher temperature so you can see more diversity in outputs, but then report the "best accuracy" as temperature=0 for other models too now or something like that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

o1 eval results #22

o1 eval results #22

stalkermustang commented Sep 15, 2024

carlini commented Sep 15, 2024

carlini commented Sep 15, 2024

stalkermustang commented Sep 15, 2024 •

edited

Loading

carlini commented Sep 15, 2024

o1 eval results #22

o1 eval results #22

Comments

stalkermustang commented Sep 15, 2024

carlini commented Sep 15, 2024

carlini commented Sep 15, 2024

stalkermustang commented Sep 15, 2024 • edited Loading

carlini commented Sep 15, 2024

stalkermustang commented Sep 15, 2024 •

edited

Loading