Open-ended divergent thinking tests, which ask individuals to list words as distinct from one another as possible, are often used to evaluate originality and fluency. This benchmark presents a more challenging variation of the divergent thinking test, where LLMs are provided with an initial list of 50 random words. The task requires the LLMs to generate 25 words that are not only highly distinct from each other but also unrelated to the initial 50 words, with the additional constraint that each word must start with a specified letter.
- Each LLM generates 25 words that are as distinct as possible from one another and from the initial 50 words, with each word beginning with a specified letter
- The first letters used are: a, b, c, d, e, f, g, h, i, k, l, m, n, o, p, r, s, t, u, w, y, and one of v, x, z, j, or q.
- Each LLM is prompted 88 times to generate 25 words, resulting in a total of 2,200 words generated per LLM.
- Each pair of potentially related words (1,209,932 unique combinations) is evaluated by four LLMs: GPT-4o, Claude 3.5 Sonnet (2024-10-22), Grok 2 (12-12), and Gemini 1.5 Pro on a of scale 0 to 10. For each generated word, the average LLM score of minimum divergences between this word and other words was used.
- Each generated word is also evaluated by these four LLMs to determine how well it follows the rules (e.g., no proper nouns, real English words, no hyphens).
Higher scores indicate better performance.
For completeness, a chart with the Y-axis starting at 0 is also provided:
Model | Score |
---|---|
o1-preview | 4.79 |
Gemini 2.0 Flash Exp | 4.65 |
Claude 3 Opus | 4.47 |
Grok 2 12-12 | 4.45 |
Llama 3.3 70B | 4.44 |
Gemini 2.0 Flash Thinking Exp | 4.41 |
Claude 3.5 Sonnet 2024-10-22 | 4.41 |
Gemma 2 27B | 4.37 |
o1-mini | 4.20 |
Claude 3.5 Haiku | 4.16 |
Mistral Large 2 | 4.14 |
GPT-4o mini | 4.12 |
Gemini 1.5 Flash | 4.09 |
Gemini 1.5 Pro (Sept) | 4.07 |
Claude 3 Haiku | 3.98 |
Qwen 2.5 72B | 3.89 |
Llama 3.1 405B | 3.83 |
DeepSeek-V2.5 | 3.76 |
GPT-4o | 3.73 |
The table below highlights the percentage of repeated words, which helps explain GPT-4o's poor performance:
Model Name | % Repeats |
---|---|
Llama 3.3 70B | 0.00% |
o1-preview | 0.00% |
Gemini 2.0 Flash Thinking Exp | 0.18% |
Claude 3.5 Sonnet 2024-10-22 | 0.50% |
Gemini 1.5 Flash | 0.68% |
Gemini 1.5 Pro (Sept) | 0.95% |
Llama 3.1 405B | 1.00% |
Gemma 2 27B | 1.14% |
o1-mini | 2.23% |
Gemini 2.0 Flash Exp | 2.64% |
Claude 3 Opus | 4.00% |
Mistral Large 2 | 5.05% |
Claude 3.5 Haiku | 5.45% |
Grok 2 12-12 | 5.64% |
GPT-4o mini | 8.45% |
Qwen 2.5 72B | 11.23% |
Claude 3 Haiku | 16.36% |
GPT-4o | 23.68% |
DeepSeek-V2.5 | 25.09% |
- Follow @lechmazur on X (Twitter) for updates
- Also check out LLM Confabulation/Hallucination Benchmark, NYT Connections Benchmark, LLM Deceptiveness and Gullibility Benchmark