Paper: PromptSet: A Programmer’s Prompting Dataset
from datasets import load_dataset
promptset = load_dataset("pisterlabs/promptset")
# iterate all prompts
for prompt_list in promptset["train"]["prompts"]:
for prompt in prompt_list:
pass
data
: contains all the raw data collected from Github.devGPT
: contains all the processed data collected from DevGPT's Zenodo repository. Check directory for more details.gen_prompts
: contains code to process and collect prompt data.analytics
: contains code to analyze the data collected.
- Download and unzip the repository snapshot as of January 10, 2024. repos.zip
- Clone tree-sitter-py
git clone https://github.com/tree-sitter/tree-sitter-python
- Run
python -m gen_prompts.find_prompts --run_id 0 --repo_dir {path_to_unzipped_repos} --threads 8
, this parses all the content data to find likely prompt areas. - Run
python -m gen_prompts.reader --run_id 0
, here we format and clean the parsed values - Run
python -m gen_prompts.upload_ds --run_id 0
, this creates a PR against the pisterlabs/promptset HF repo.