Skip to content

Commit

Permalink
๐Ÿ”ฎ Update README.md with RepoBench v1.1 information
Browse files Browse the repository at this point in the history
  • Loading branch information
Leolty committed Feb 6, 2024
1 parent 476c063 commit bf99f30
Show file tree
Hide file tree
Showing 2 changed files with 88 additions and 12 deletions.
21 changes: 9 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,14 @@

<hr>

## ๐Ÿค” Am I in RepoBench?
## ๐Ÿ”ฅ News

We are always working on the next generation of RepoBench by crawling the most recent GitHub repositories! ๐Ÿš€
- *Feb 5th, 2024*: **RepoBench v1.1** (with newest code data) is now available on the ๐Ÿค— HuggingFace Hub. You can access the datasets for Python and Java using the following links:
- For Python: [๐Ÿค— Repobench Python V1.1](https://huggingface.co/datasets/tianyang/repobench_python_v1.1)
- For Java: [๐Ÿค— Repobench Java V1.1](https://huggingface.co/datasets/tianyang/repobench_java_v1.1)
> **For more details of RepoBench v1.1, please refer to the [data directory](./data/README.md).**
> [!IMPORTANT]
> We are very open to any collaborations! If you want to test your model on the data with customised cut-off date or date range, please feel free to [drop us an email](mailto:til040@ucsd.edu?subject=[RepoBench]%20Collaborations) or raise an issue. We will try our best to help you out!
If you would like to have your code excluded from RepoBench, you can check if your data is in RepoBench and follow the link to **opt-out**:

[๐Ÿค— Am I in RepoBech ๐Ÿค—](https://huggingface.co/spaces/tianyang/in-the-repobench)
- *Jan 16th, 2024*: RepoBench is accepted to ICLR 2024! ๐ŸŽ‰


## ๐Ÿ› ๏ธ Installation
Expand Down Expand Up @@ -135,10 +133,9 @@ If you use RepoBench in your research, please consider citing us:
@misc{liu2023repobench,
title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems},
author={Tianyang Liu and Canwen Xu and Julian McAuley},
year={2023},
eprint={2306.03091},
archivePrefix={arXiv},
primaryClass={cs.CL}
year={2024},
url={https://arxiv.org/abs/2306.03091},
booktitle={International Conference on Learning Representations}
}
```

Expand Down
79 changes: 79 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
<p align="center">
<a href="https://github.com/Leolty/repobench#gh-light-mode-only">
<img src="../assets/repobench_dark.png" width="318px" alt="repobench logo" />
</a>
<a href="https://github.com/Leolty/repobench#gh-dark-mode-only">
<img src="../assets/repobench_light.png" width="318px" alt="repobench logo" />
</a>

<p align="center">
<a href="https://arxiv.org/abs/2306.03091">
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
</a>
<br></br>
<a>
<b>ICLR 2024</b>
</a>
</p>

<hr>

This directory hosts the datasets for subsequet versions of RepoBench. We are committed to updating RepoBench regularly, with updates scheduled **every 3 months**.

## ๐ŸŒ‡ Overview

- Our primary focus is on **next-line prediction** tasks to aid in code auto-completion. If your research requires retrieval data, please don't hesitate to reach out to us for collaboration.
- Our datasets will be hosted on ๐Ÿค— HuggingFace, making them easily accessible for everyone.
- Each data point within our datasets is categorized based on the prompt length (number of tokens), which is determined by OpenAI's GPT-4 tokenizer using (tiktoken)[https://github.com/openai/tiktoken]. Here's a detailed table illustrating the levels we've defined:

| Level | Prompt Length (Number of Tokens) |
|-------|------------------------|
| 2k | 640 - 1,600 |
| 4k | 1,600 - 3,600 |
| 8k | 3,600 - 7,200 |
| 12k | 7,200 - 10,800 |
| 16k | 10,800 - 14,400 |
| 24k | 14,400 - 21,600 |
| 32k | 21,600 - 28,800 |
| 64k | 28,800 - 57,600 |
| 128k | 57,600 - 100,000 |

## ๐Ÿ“š Versions

### RepoBench v1.1

RepoBench v1.1 includes data collected from GitHub between **October 6, 2023**, and **November 31, 2023**. To mitigate the data leakage and memorization issues, we conducted a deduplication process on the Stack v2 (coming soon) based on the file content.

You can access RepoBench v1.1 at the following links:
- For Python: [๐Ÿค— Repobench Python V1.1](https://huggingface.co/datasets/tianyang/repobench_python_v1.1)
- For Java: [๐Ÿค— Repobench Java V1.1](https://huggingface.co/datasets/tianyang/repobench_java_v1.1)

Or, you can load the data directly from the HuggingFace Hub using the following code:

```python
from datasets import load_dataset

# Load the Python dataset
python_dataset = load_dataset("tianyang/repobench_python_v1.1")

# Load the Java dataset
java_dataset = load_dataset("tianyang/repobench_java_v1.1")
```

### RepoBench v1.2

*Cooming soon...*

## ๐Ÿ“ Citation

If you use RepoBench in your research, please cite the following paper:

```bibtex
@misc{liu2023repobench,
title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems},
author={Tianyang Liu and Canwen Xu and Julian McAuley},
year={2024},
url={https://arxiv.org/abs/2306.03091},
booktitle={International Conference on Learning Representations}
}
```

0 comments on commit bf99f30

Please sign in to comment.