LLM Divergent Thinking Creativity Benchmark

Open-ended divergent thinking tests, which ask individuals to list words as distinct from one another as possible, are often used to evaluate originality and fluency. This benchmark presents a more challenging variation of the divergent thinking test, where LLMs are provided with an initial list of 50 random words. The task requires the LLMs to generate 25 words that are not only highly distinct from each other but also unrelated to the initial 50 words, with the additional constraint that each word must start with a specified letter.

Method

Each LLM generates 25 words that are as distinct as possible from one another and from the initial 50 words, with each word beginning with a specified letter
The first letters used are: a, b, c, d, e, f, g, h, i, k, l, m, n, o, p, r, s, t, u, w, y, and one of v, x, z, j, or q.
Each LLM is prompted 88 times to generate 25 words, resulting in a total of 2,200 words generated per LLM.
Each pair of potentially related words (1,209,932 unique combinations) is evaluated by four LLMs: GPT-4o, Claude 3.5 Sonnet (2024-10-22), Grok 2 (12-12), and Gemini 1.5 Pro on a of scale 0 to 10. For each generated word, the average LLM score of minimum divergences between this word and other words was used.
Each generated word is also evaluated by these four LLMs to determine how well it follows the rules (e.g., no proper nouns, real English words, no hyphens).

Results

Higher scores indicate better performance.

For completeness, a chart with the Y-axis starting at 0 is also provided:

Model	Score
o1-preview	4.79
Gemini 2.0 Flash Exp	4.65
Claude 3 Opus	4.47
Grok 2 12-12	4.45
Llama 3.3 70B	4.44
Gemini 2.0 Flash Thinking Exp	4.41
Claude 3.5 Sonnet 2024-10-22	4.41
Gemma 2 27B	4.37
o1-mini	4.20
Claude 3.5 Haiku	4.16
Mistral Large 2	4.14
GPT-4o mini	4.12
Gemini 1.5 Flash	4.09
Gemini 1.5 Pro (Sept)	4.07
Claude 3 Haiku	3.98
Qwen 2.5 72B	3.89
Llama 3.1 405B	3.83
DeepSeek-V2.5	3.76
GPT-4o	3.73

The table below highlights the percentage of repeated words, which helps explain GPT-4o's poor performance:

Model Name	% Repeats
Llama 3.3 70B	0.00%
o1-preview	0.00%
Gemini 2.0 Flash Thinking Exp	0.18%
Claude 3.5 Sonnet 2024-10-22	0.50%
Gemini 1.5 Flash	0.68%
Gemini 1.5 Pro (Sept)	0.95%
Llama 3.1 405B	1.00%
Gemma 2 27B	1.14%
o1-mini	2.23%
Gemini 2.0 Flash Exp	2.64%
Claude 3 Opus	4.00%
Mistral Large 2	5.05%
Claude 3.5 Haiku	5.45%
Grok 2 12-12	5.64%
GPT-4o mini	8.45%
Qwen 2.5 72B	11.23%
Claude 3 Haiku	16.36%
GPT-4o	23.68%
DeepSeek-V2.5	25.09%

Updates and Contact

Follow @lechmazur on X (Twitter) for updates
Also check out LLM Confabulation/Hallucination Benchmark, NYT Connections Benchmark, LLM Deceptiveness and Gullibility Benchmark

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
graded		graded
graded_words		graded_words
grading_prompts		grading_prompts
grading_prompts_words		grading_prompts_words
output_words		output_words
prompts_50_25		prompts_50_25
responses		responses
README.md		README.md
graded_results.csv		graded_results.csv
initial_list_sorted.txt		initial_list_sorted.txt
num_repeats.csv		num_repeats.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Divergent Thinking Creativity Benchmark

Method

Results

Updates and Contact

About

Releases

Packages

lechmazur/divergent

Folders and files

Latest commit

History

Repository files navigation

LLM Divergent Thinking Creativity Benchmark

Method

Results

Updates and Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages