Skip to content

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

Notifications You must be signed in to change notification settings

lechmazur/divergent

Repository files navigation

LLM Divergent Thinking Creativity Benchmark

Open-ended divergent thinking tests, which ask individuals to list words as distinct from one another as possible, are often used to evaluate originality and fluency. This benchmark presents a more challenging variation of the divergent thinking test, where LLMs are provided with an initial list of 50 random words. The task requires the LLMs to generate 25 words that are not only highly distinct from each other but also unrelated to the initial 50 words, with the additional constraint that each word must start with a specified letter.

Method

  • Each LLM generates 25 words that are as distinct as possible from one another and from the initial 50 words, with each word beginning with a specified letter
  • The first letters used are: a, b, c, d, e, f, g, h, i, k, l, m, n, o, p, r, s, t, u, w, y, and one of v, x, z, j, or q.
  • Each LLM is prompted 88 times to generate 25 words, resulting in a total of 2,200 words generated per LLM.
  • Each pair of potentially related words (1,209,932 unique combinations) is evaluated by four LLMs: GPT-4o, Claude 3.5 Sonnet (2024-10-22), Grok 2 (12-12), and Gemini 1.5 Pro on a of scale 0 to 10. For each generated word, the average LLM score of minimum divergences between this word and other words was used.
  • Each generated word is also evaluated by these four LLMs to determine how well it follows the rules (e.g., no proper nouns, real English words, no hyphens).

Results

Higher scores indicate better performance.

scores

For completeness, a chart with the Y-axis starting at 0 is also provided:

graph-0

Model Score
o1-preview 4.79
Gemini 2.0 Flash Exp 4.65
Claude 3 Opus 4.47
Grok 2 12-12 4.45
Llama 3.3 70B 4.44
Gemini 2.0 Flash Thinking Exp 4.41
Claude 3.5 Sonnet 2024-10-22 4.41
Gemma 2 27B 4.37
o1-mini 4.20
Claude 3.5 Haiku 4.16
Mistral Large 2 4.14
GPT-4o mini 4.12
Gemini 1.5 Flash 4.09
Gemini 1.5 Pro (Sept) 4.07
Claude 3 Haiku 3.98
Qwen 2.5 72B 3.89
Llama 3.1 405B 3.83
DeepSeek-V2.5 3.76
GPT-4o 3.73

The table below highlights the percentage of repeated words, which helps explain GPT-4o's poor performance:

chart2

Model Name % Repeats
Llama 3.3 70B 0.00%
o1-preview 0.00%
Gemini 2.0 Flash Thinking Exp 0.18%
Claude 3.5 Sonnet 2024-10-22 0.50%
Gemini 1.5 Flash 0.68%
Gemini 1.5 Pro (Sept) 0.95%
Llama 3.1 405B 1.00%
Gemma 2 27B 1.14%
o1-mini 2.23%
Gemini 2.0 Flash Exp 2.64%
Claude 3 Opus 4.00%
Mistral Large 2 5.05%
Claude 3.5 Haiku 5.45%
Grok 2 12-12 5.64%
GPT-4o mini 8.45%
Qwen 2.5 72B 11.23%
Claude 3 Haiku 16.36%
GPT-4o 23.68%
DeepSeek-V2.5 25.09%

Updates and Contact

About

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published