Source code for our EMNLP 2025 paper: "CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion" [arXiv].
1. Install uv
uv syncsource .venv/bin/activateBefore running scripts, download benchmarks (recceval and cceval) and edit the configuration file:
config/config.tomlThen execute the Python scripts sequentially:
python scripts/build_query.py- Generates query strings from the benchmark dataset.
python scripts/retrieve.py- Retrieves top-k relevant code blocks using the configured retriever.
python scripts/rerank.py- Reranks retrieved code blocks based on their estimated importance.
python scripts/build_prompt.py- Constructs prompts from retrieved code blocks for the code completion generator.
python scripts/inference.py- Feeds prompts to the generator model.
- You can replace this step with your own inference code.
Input: JSON file containing an array of strings
Output: JSON file containing an array of generated completions.
python scripts/evaluation.py- Evaluates code completion performance using inference results.
If you find this work helpful, please consider citing our paper:
@inproceedings{coderag2025,
title={CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion},
author={Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2025}
}For questions, please open an issue or contact dingyf@stu.xmu.edu.cn.