Skip to content

[COLING25] CodeJudge Eval: Can Large Language Models be Good Judges in Code Understanding?

License

CodeLLM-Research/CodeJudge-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ‰ Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).
If our project helps you, please give us a star ⭐ on GitHub to support us. πŸ™πŸ™

hf_data arXiv License

Introduction

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.

Experiment Results

More Details

More details can be found in our paper.

πŸ“‘ Citation

If you find CodeJudge-Eval useful for your research and applications, please cite using this BibTeX:

@misc{zhao2024codejudgeevallargelanguagemodels,
      title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, 
      author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma},
      year={2024},
      eprint={2408.10718},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2408.10718}, 
}

About

[COLING25] CodeJudge Eval: Can Large Language Models be Good Judges in Code Understanding?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published