GitHub - CodeLLM-Research/CodeJudge-Eval: [COLING25] CodeJudge Eval: Can Large Language Models be Good Judges in Code Understanding?

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

Introduction

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.

Experiment Results

More Details

More details can be found in our paper.

📑 Citation

If you find CodeJudge-Eval useful for your research and applications, please cite using this BibTeX:

@misc{zhao2024codejudgeevallargelanguagemodels,
      title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, 
      author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma},
      year={2024},
      eprint={2408.10718},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2408.10718}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
experiments.png		experiments.png
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

Introduction

Experiment Results

More Details

📑 Citation

About

Uh oh!

Releases

Packages

Uh oh!

License

Uh oh!

CodeLLM-Research/CodeJudge-Eval

Folders and files

Latest commit

History

Repository files navigation

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

Introduction

Experiment Results

More Details

📑 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages