diff --git a/README.md b/README.md index 58d74c8..4a970fa 100644 --- a/README.md +++ b/README.md @@ -18,16 +18,14 @@
-## 🤔 Am I in RepoBench? +## 🔥 News -We are always working on the next generation of RepoBench by crawling the most recent GitHub repositories! 🚀 +- *Feb 5th, 2024*: **RepoBench v1.1** (with newest code data) is now available on the 🤗 HuggingFace Hub. You can access the datasets for Python and Java using the following links: + - For Python: [🤗 Repobench Python V1.1](https://huggingface.co/datasets/tianyang/repobench_python_v1.1) + - For Java: [🤗 Repobench Java V1.1](https://huggingface.co/datasets/tianyang/repobench_java_v1.1) + > **For more details of RepoBench v1.1, please refer to the [data directory](./data/README.md).** -> [!IMPORTANT] -> We are very open to any collaborations! If you want to test your model on the data with customised cut-off date or date range, please feel free to [drop us an email](mailto:til040@ucsd.edu?subject=[RepoBench]%20Collaborations) or raise an issue. We will try our best to help you out! - -If you would like to have your code excluded from RepoBench, you can check if your data is in RepoBench and follow the link to **opt-out**: - -[🤗 Am I in RepoBech 🤗](https://huggingface.co/spaces/tianyang/in-the-repobench) +- *Jan 16th, 2024*: RepoBench is accepted to ICLR 2024! 🎉 ## 🛠️ Installation @@ -135,10 +133,9 @@ If you use RepoBench in your research, please consider citing us: @misc{liu2023repobench, title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, author={Tianyang Liu and Canwen Xu and Julian McAuley}, - year={2023}, - eprint={2306.03091}, - archivePrefix={arXiv}, - primaryClass={cs.CL} + year={2024}, + url={https://arxiv.org/abs/2306.03091}, + booktitle={International Conference on Learning Representations} } ``` diff --git a/data/README.md b/data/README.md new file mode 100644 index 0000000..74d59fc --- /dev/null +++ b/data/README.md @@ -0,0 +1,79 @@ +

+ + repobench logo + + + repobench logo + + +

+ + RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems + +

+ + ICLR 2024 + +

+ +
+ +This directory hosts the datasets for subsequet versions of RepoBench. We are committed to updating RepoBench regularly, with updates scheduled **every 3 months**. + +## 🌇 Overview + +- Our primary focus is on **next-line prediction** tasks to aid in code auto-completion. If your research requires retrieval data, please don't hesitate to reach out to us for collaboration. +- Our datasets will be hosted on 🤗 HuggingFace, making them easily accessible for everyone. +- Each data point within our datasets is categorized based on the prompt length (number of tokens), which is determined by OpenAI's GPT-4 tokenizer using (tiktoken)[https://github.com/openai/tiktoken]. Here's a detailed table illustrating the levels we've defined: + + | Level | Prompt Length (Number of Tokens) | + |-------|------------------------| + | 2k | 640 - 1,600 | + | 4k | 1,600 - 3,600 | + | 8k | 3,600 - 7,200 | + | 12k | 7,200 - 10,800 | + | 16k | 10,800 - 14,400 | + | 24k | 14,400 - 21,600 | + | 32k | 21,600 - 28,800 | + | 64k | 28,800 - 57,600 | + | 128k | 57,600 - 100,000 | + +## 📚 Versions + +### RepoBench v1.1 + +RepoBench v1.1 includes data collected from GitHub between **October 6, 2023**, and **November 31, 2023**. To mitigate the data leakage and memorization issues, we conducted a deduplication process on the Stack v2 (coming soon) based on the file content. + +You can access RepoBench v1.1 at the following links: +- For Python: [🤗 Repobench Python V1.1](https://huggingface.co/datasets/tianyang/repobench_python_v1.1) +- For Java: [🤗 Repobench Java V1.1](https://huggingface.co/datasets/tianyang/repobench_java_v1.1) + +Or, you can load the data directly from the HuggingFace Hub using the following code: + +```python +from datasets import load_dataset + +# Load the Python dataset +python_dataset = load_dataset("tianyang/repobench_python_v1.1") + +# Load the Java dataset +java_dataset = load_dataset("tianyang/repobench_java_v1.1") +``` + +### RepoBench v1.2 + +*Cooming soon...* + +## 📝 Citation + +If you use RepoBench in your research, please cite the following paper: + +```bibtex +@misc{liu2023repobench, + title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, + author={Tianyang Liu and Canwen Xu and Julian McAuley}, + year={2024}, + url={https://arxiv.org/abs/2306.03091}, + booktitle={International Conference on Learning Representations} +} +``` \ No newline at end of file