GPTJavaDataset

GPT Java Source Code Dataset

A dataset composed of 976 total Java source code files from 11 authors' GitHub pages and ChatGPT 3.5 and BingGPT rewritten code for code classification.
Explore the files »

Table of Contents

About The Project
Getting Started
- File Composiiton
- Citation
Contact
Acknowledgments

About The Project

With the release of OpenAI's ChatGPT, code written by GPT is becoming increasingly more common in everyday usage. However, students often use generated code to cheat on exams and homework. Being able to detect code written by GPT could be useful for organizations and schools as a classification or anomaly detection task. I wasn't able to find a publicly available online dataset of Java source code written by GPT to be trained on for research purposes, so I created my own.

Here's the general idea:

666 Java source code files from 11 different authors' GitHub pages were acquired via another public dataset.
5 of the 11 authors' files were passed through either ChatGPT-3.5 or Bing GPT-4 in a rewriting task.
The prompt: "The messages I send you will be in Java code. I want you to rewrite all of it while maintaining functionality."
The entirety of the file was passed through ChatGPT (no cutoff) and BingGPT (4000 character limit) without additional prompting. The resulting code was then pasted into a new file.
The resulting files were either saved without additional formatting or were formatted by VSCode's format when saving setting.

Of course, there are limitations to this dataset as code classification by an LLM is novel. However, this could be a reasonable starting point for those who want to detect GPT. Feel free to use this dataset for research or training.

(back to top)

Getting Started

Dataset Structure

Here's a breakdown of the files in this dataset:

976 total files
666 files of original authors
108 rewritten files using Bing GPT-4 (61 formatted, 47 non-formatted)
202 rewritten files using ChatGPT-3.5 (59 formatted, 143 non-formatted)

(back to top)

Citation

If you use this dataset, please cite:

@misc{P24_Java,
  author = {Paek, Timothy},
  title = {GPT Java Dataset: A Dataset for LLM-Generated Code Detection},
  year = {2024},
  howpublished = {GitHub Repository},
  url = {https://github.com/tipaek/GPT-Java-Dataset}
}

Contact

Timothy Paek - Linked-In - tipaek@syr.edu

(back to top)

Acknowledgments

What I used in making this dataset:

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
GPT - Rewrite		GPT - Rewrite
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPTJavaDataset

GPT Java Source Code Dataset

About The Project

Getting Started

Dataset Structure

Citation

Contact

Acknowledgments

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

tipaek/GPT-Java-Dataset

Folders and files

Latest commit

History

Repository files navigation

GPTJavaDataset

GPT Java Source Code Dataset

About The Project

Getting Started

Dataset Structure

Citation

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages