mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond
This document describes the artifacts accompanying our paper: "mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond". 'mono' represents Multi-agent Operated Noise Outfilter. The artifacts are organized in the following directories:
This directory contains the source code of our project.
This subfolder contains the final dataset, MonoLens, generated and analyzed by our framework.
The subfolders within MonoLens
are organized as follows:
This directory provides a sample of 8 data entries in the csv
file and the overall stats of these samples. Each entry includes the original CVE metadata, the root cause analysis performed by our agent, and other relevant information. It also contains a reference to a corresponding folder within other_context
folder, which holds the complete analysis results and the step-by-step process undertaken by the agent.
This directory contains the subset of CVEs for which our agent's final confidence score in its analysis was greater than 0.9. The other_context
subfolder is ommitted due to the large size of the data.
This directory includes the results for all CVEs that our agent was able to process and analyze. The other_context
subfolder is ommitted due to the large size of the data.
This directory showcases the complete analysis process of our mono framework for four specific cases, each with an ReadMe.md
. It details the entire pipeline:
-
Stage1
. Patch Pre-filtering and Classification: Filtering of security-related patches. -
Stage2
. Data Acquisition and Preprocessing: Preprocessing using Joern to generate Code Property Graphs (CPGs). The binary files (cpg.bin), whole repo are excluded due to its large size. -
Stage3
. Iterative Contextual Analysis: Including:- The agent's analysis of the CVEs.
- The contextual information gathered to understand the root cause of the CVE.
- The context as understood and summarized by the agent.
This directory is dedicated to the research questions (RQs) addressed in our paper. Each RQ has its own subfolder, which contains:
- The specific code used for that RQ.
- The data relevant to that RQ.
- The final results obtained for that RQ.
Each RQ subfolder also includes its own ReadMe.md
file providing more detailed information specific to that research question.
If this work is helpful for your research, please consider citing the following BibTeX entry.
@misc{gao2025monocleanvulnerabilitydataset,
title={mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond},
author={Zeyu Gao and Junlin Zhou and Bolun Zhang and Yi He and Chao Zhang and Yuxin Cui and Hao Wang},
year={2025},
eprint={2506.03651},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2506.03651},
}