A comprehensive code domain benchmark review of LLM researches.
-
š„š„ [2025-07-13] Featured Benchmarks:
š„CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks from Purdue University
š„ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation from Tencent Hunyuan Team
š„CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark from Shanghai Jiao Tong University
š„Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs from Provable Responsible AI and Data Analytics (PRADA) Lab
š„Model Editing for LLMs4Code: How Far are We? from National University of Defense Technology
š„VeriBench: Benchmarking Large Language Models for Verilog Code Generation and Design Synthesis from Indian Institute Of Technology Gandhinagar
š„ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness from Imperial College London United Kingdom
š„Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation from Chinese Academy of Sciences
-
š„š„ [2025-07-05] Featured Benchmarks:
š„ JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation from The Ohio State University
š„ From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking from CMU
š„Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs from Skywork AI
-
š„š„ [2025-06-27] Featured Benchmarks:
š„ FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation from ByteDance
š„CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval from Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
š„CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs from Purdue University
š„SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks from University of Illinois Urbana-Champaign
š„RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments from Zhejiang University
š„MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios from Central South University
š„OJBench: A Competition Level Code Benchmark For Large Language Models from Beijing University of Posts and Telecommunications
š„TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs from Knowledgeverse AI
-
š„š„ [2025-06-14] Featured Benchmarks:
š„ DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination from Columbia University
š„ PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models from University of Texas at Dallas
š„ ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming from Sichuan University
š„ OSS-Bench: Benchmark Generator for Coding LLMs from National University of Singapore
š„ VERINA: Benchmarking Verifiable Code Generation from University of California, Berkeley
š„ ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code from Stanford University
š„ EFFIBENCH-X:A Multi-Language Benchmark fo rMeasuring Effciency ofLLM.Generated Code from HKU
š„ Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency from The University of Chicago
š„ Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents from MIT
š„ LongCodeBench: Evaluating Coding LLMs at 1M Context Windows from Panasonic AI Research
š„ Success is in the Details: Evaluate and Enhance Details Sensitivity of Code from Harbin Institute of Technology
š„ CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning from Iowa State University
š„ Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability from East China Normal University
š„ Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation from Nanjing University of Information Science & Technology
š„ CODEMENV: Benchmarking Large Language Models on Code Migration from Provable Responsible AI and Data Analytics (PRADA) Lab
š„ DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios from Tsinghua University
- [2025-04-18] We add Github Stars for each banchmark.
- [2025-04-13] We add Code Security & Robustness benchmarks.Ā
- [2025-04-06] We add Code Hallucinations benchmarks.Ā
- [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā
- [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā
- [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā
- Code Completion & Code Generation
- Code Efficiency
- CodeFix & Bug-Fix
- Code Reasoning & Understanding
- Code Hallucination
- Data science
- Text2SQL
- MultiModal Code Tasks
- Code Security & Robustness
- Code Translation
- Code Version
- Multi & Other Dimension
- Industry Code Generation
-
Software Development Life Cycle Perspective A Survey of Benchmarks for Code Large Language Models and Agents from Xiāan Jiaotong University
-
Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks from Zhejiang University
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
HALLUCODE | Exploring and Evaluating Hallucinations in LLM-Powered Code Generation | Arxiv 2024/04 | ||
Collu-Bench | Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code | Arxiv 2024/10 | š¤Dataset | |
CodeHalu | CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification | AAAI 2025 | Github |
š¤Dataset |
APIHulBench | Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware | FSE 25 | Github |
|
THINK | THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge | SANER 2025 | Github |
š¤Dataset |
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
DS-1000 | DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation | ICML 2023 | Github |
š¤Dataset šHomePage |
ARCADE | Natural Language to Code Generation in Interactive Data Science Notebooks | ACL 2023 | Github |
Dataset |
DA-Code | DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models | EMNLP 2024 | Github |
š¤Dataset šWebsite |
MatPlotBench | MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization | ACL 2024 Findings | Github |
š¤Dataset |
DataSciBench | DataSciBench: An LLM Agent Benchmark for Data Science | ArXiv 2025/02 | Github |
|
DSBench | DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? | ICLR 2025 | Github |
š¤Dataset |
DS-Bench | DS-Bench: A Realistic Benchmark for Data Science Code Generation | Arxiv 2025/05 | Github |