This project leverages BERT (Bidirectional Encoder Representations from Transformers) to classify CVE descriptions into high-risk or low-risk categories based on vulnerability keywords. It automates parsing of large-scale CVE JSON records from CVE.org, generates labels, and fine-tunes a BERT model for high-accuracy binary classification.
- Official CVE JSON Records: https://github.com/CVEProject/cvelistV5
- Total JSON files processed: 296,491+
- Valid English CVE descriptions extracted: 275,202
- Parallel parsing of JSON files for scalable processing
- Automatic binary labeling using critical vulnerability keywords
- Balanced dataset creation for unbiased training
- Custom PyTorch dataset integration for BERT
- Fine-tuning BERT-Base Uncased for text classification
- Comprehensive evaluation with Accuracy, Precision, Recall, F1, Confusion Matrix, and ROC-AUC
- Language: Python 3.x
- Libraries: PyTorch, HuggingFace Transformers, Scikit-learn, Pandas, Seaborn, Matplotlib
- Train Set: 8,000 Samples (Balanced)
- Test Set: 2,000 Samples (Balanced)
- Final Test Accuracy: 99%
- F1-Score: 0.99
- ROC-AUC Score: High (See plot below)
Classification Report:
precision recall f1-score support
0 0.99 0.99 0.99 988
1 1.00 0.99 0.99 1012
accuracy 0.99 2000
macro avg 0.99 0.99 0.99 2000
weighted avg 0.99 0.99 0.99 2000
Confusion Matrix:
ROC-AUC Curve:
torch transformers sklearn pandas seaborn matplotlib tqdm
- Download CVE JSON dataset from CVEProject
- Set the correct
cve_json_rootpath in the script - Run the script for training and evaluation

