FluxCodeBench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
-
Updated
Jan 18, 2026 - Python
FluxCodeBench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench, focusing on hidden requirements, long-context retention, and iterative refinement.
Add a description, image, and links to the coding-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the coding-benchmark topic, visit your repo's landing page and select "manage topics."