Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
-
Updated
Apr 10, 2026 - Python
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark large language models on real expert tasks using a YAML-driven pipeline and live dashboard for the GDPVal Gold Subset.
Add a description, image, and links to the benchmark-automation topic page so that developers can more easily learn about it.
To associate your repository with the benchmark-automation topic, visit your repo's landing page and select "manage topics."