A dataset composed of 976 total Java source code files from 11 authors' GitHub pages and ChatGPT 3.5 and BingGPT rewritten code for code classification.
Explore the files »
Table of Contents
With the release of OpenAI's ChatGPT, code written by GPT is becoming increasingly more common in everyday usage. However, students often use generated code to cheat on exams and homework. Being able to detect code written by GPT could be useful for organizations and schools as a classification or anomaly detection task. I wasn't able to find a publicly available online dataset of Java source code written by GPT to be trained on for research purposes, so I created my own.
Here's the general idea:
- 666 Java source code files from 11 different authors' GitHub pages were acquired via another public dataset.
- 5 of the 11 authors' files were passed through either ChatGPT-3.5 or Bing GPT-4 in a rewriting task.
- The prompt: "The messages I send you will be in Java code. I want you to rewrite all of it while maintaining functionality."
- The entirety of the file was passed through ChatGPT (no cutoff) and BingGPT (4000 character limit) without additional prompting. The resulting code was then pasted into a new file.
- The resulting files were either saved without additional formatting or were formatted by VSCode's format when saving setting.
Of course, there are limitations to this dataset as code classification by an LLM is novel. However, this could be a reasonable starting point for those who want to detect GPT. Feel free to use this dataset for research or training.
Here's a breakdown of the files in this dataset:
- 976 total files
- 666 files of original authors
- 108 rewritten files using Bing GPT-4 (61 formatted, 47 non-formatted)
- 202 rewritten files using ChatGPT-3.5 (59 formatted, 143 non-formatted)
If you use this dataset, please cite:
@misc{P24_Java,
author = {Paek, Timothy},
title = {GPT Java Dataset: A Dataset for LLM-Generated Code Detection},
year = {2024},
howpublished = {GitHub Repository},
url = {https://github.com/tipaek/GPT-Java-Dataset}
}
Timothy Paek - Linked-In - tipaek@syr.edu
What I used in making this dataset: