Skip to content

tipaek/GPT-Java-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

GPTJavaDataset


logo of Java file

GPT Java Source Code Dataset

A dataset composed of 976 total Java source code files from 11 authors' GitHub pages and ChatGPT 3.5 and BingGPT rewritten code for code classification.
Explore the files »

Table of Contents
  1. About The Project
  2. Getting Started
  3. Contact
  4. Acknowledgments

About The Project

With the release of OpenAI's ChatGPT, code written by GPT is becoming increasingly more common in everyday usage. However, students often use generated code to cheat on exams and homework. Being able to detect code written by GPT could be useful for organizations and schools as a classification or anomaly detection task. I wasn't able to find a publicly available online dataset of Java source code written by GPT to be trained on for research purposes, so I created my own.

Here's the general idea:

  • 666 Java source code files from 11 different authors' GitHub pages were acquired via another public dataset.
  • 5 of the 11 authors' files were passed through either ChatGPT-3.5 or Bing GPT-4 in a rewriting task.
  • The prompt: "The messages I send you will be in Java code. I want you to rewrite all of it while maintaining functionality."
  • The entirety of the file was passed through ChatGPT (no cutoff) and BingGPT (4000 character limit) without additional prompting. The resulting code was then pasted into a new file.
  • The resulting files were either saved without additional formatting or were formatted by VSCode's format when saving setting.

Of course, there are limitations to this dataset as code classification by an LLM is novel. However, this could be a reasonable starting point for those who want to detect GPT. Feel free to use this dataset for research or training.

(back to top)

Getting Started

Dataset Structure

Here's a breakdown of the files in this dataset:

  • 976 total files
  • 666 files of original authors
  • 108 rewritten files using Bing GPT-4 (61 formatted, 47 non-formatted)
  • 202 rewritten files using ChatGPT-3.5 (59 formatted, 143 non-formatted)

(back to top)

Citation

If you use this dataset, please cite:

@misc{P24_Java,
  author = {Paek, Timothy},
  title = {GPT Java Dataset: A Dataset for LLM-Generated Code Detection},
  year = {2024},
  howpublished = {GitHub Repository},
  url = {https://github.com/tipaek/GPT-Java-Dataset}
}

Contact

Timothy Paek - Linked-In - tipaek@syr.edu

(back to top)

Acknowledgments

What I used in making this dataset:

(back to top)

About

A dataset composed of Java source code and GPT altered code for classification training.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages