Update with Sonnet 3.5 and Gemini 1.5 Pro results

carlini · Jun 23, 2024 · d0ecd8c · d0ecd8c
1 parent dfea228
commit d0ecd8c
Showing 1 changed file with 9 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -49,13 +49,15 @@ This is helpful for determining whether or not models are capable of performing
 ## Results
 
 I've evaluated a few models on this benchmark. Here's how they perform:
-* GPT-4o: 49% passed
-* Claude 3 Opus: 43% passed
-* Claude 3 Sonnet: 33% passed
-* Mistral Large: 29% passed
-* GPT-3.5: 27% passed
-* Mistral Medium: 24% passed
-* Gemini Pro 1.0: 18% passed
+* Claude 3.5 Sonnet: 48% passed
+* GPT 4o: 47% passed
+* Claude 3 Opus: 42% passed
+* Claude 3 Sonnet: 32% passed
+* Gemini 1.5 Pro: 32% passed
+* Mistral Large: 28% passed
+* GPT 3.5: 26% passed
+* Mistral Medium: 23% passed
+* Gemini 1.0 Pro: 17% passed
 
 A complete evaluation grid is available [here](https://nicholas.carlini.com/writing/2024/evaluation_examples/index.html).