Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity
This repository allows the replication of our study Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity accepted for publication at The 36th IEEE International Symposium on Software Reliability Engineering (ISSRE 2025).
It contains the datasets, code, and analyses used in our experiments on code defects, security vulnerabilities and code complexity, as well as ODC (Orthogonal Defect Classification) mappings and final experimental results.
Raw and processed Python and Java datasets used in our study are available on Zenodo at the following link: datasets. The Python dataset (python_dataset.jsonl
) contains 285,249 samples, while the Java dataset (java_dataset.jsonl
) contains 221,795 samples. Each sample has the following structure: <index, Human-code, ChatGPT-code, DeepSeek-Coder-code, Qwen-Coder-code>
.
Contains the mapping of each rule to ODC defect type for both Pylint (Python) and PMD (Java) used to support our classification of defects.
This folder contains all code to perform the defects analysis on both Python and Java.
-
Python
- use
pylint_ODC.py
to run Pylint on the Python dataset. Modify the jsonl field to analyze (i.e.,[modelname]_code
) - use
process_pylint_results.py
to process the results of the previous analysis. This will output a complete report of defective samples, syntax errors, and ODC defect types distribution.
- use
-
Java
- use
wrap_java_functions.py
to wrap all Java samples in minimal dummy classes prior to analysis. This ensures compatibility with PMD, which requires valid Java class structures to work. - use
run_PMD_analysis.sh
to run PMD on the Java dataset. Modify the jsonl field to analyze (i.e.,[modelname]_code
) - use
process_PMD_results.py
to process the results of the previous analysis. This will output a complete report of defective samples, syntax errors, and ODC defect types distribution.
- use
This folder contains all code to perform the security vulnerability analysis on both Python and Java using Semgrep.
-
Python
- use
run_semgrep_python.py
to run Semgrep on the Python dataset. Modify the jsonl field to analyze (i.e.,[modelname]_code
) - use
process_semgrep_results_python.py
to process the results of the previous analysis. This will output a complete report of vulnerable samples, errors, and CWEs distribution.
- use
-
Java
- use
run_semgrep_java.py
to run Semgrep on the Java dataset. Modify the jsonl field to analyze (i.e.,[modelname]_code
). - use
process_semgrep_results_java.py
to process the results of the previous analysis. This will output a complete report of vulnerable samples, errors, and CWEs distribution.
- use
Contains scripts, metrics, and results related to code complexity analysis for Python (complexity_stats_python.py
) and Java (complexity_stats_java.py
). This includes measures such as NLOC, cyclomatic complexity, and token counts computed using Lizard and Tiktoken.
Contains the complete reports and results obtained from our experimental evaluation, including reports on defects, security vulnerabilities and complexity metrics for both Python and Java.
The file run_instructions.txt
contains a detailed list of commands to replicate the full experimental evaluation on any Unix-based OS (e.g., Linux, macOS). It sets up a conda environment with all required packages, runs all analyses on the python and java datasets, and removes the environment if necessary.
To replicate the full evaluation, please make sure to download the complete datasets from the provided Zenodo link, to modify the required paths and fields to correctly execute each script, and to run the analyses for each code field: human_code, chatgpt_code, dsc_code, and qwen_code.