ML-DDOS-Detection-Project

A machine learning-based approach for detecting Distributed Denial of Service (DDoS) attacks. This repository is developed for the Data Science for Cybersecurity course as a group project by Team Olympians.

Overview

This project aims to demonstrate how data science and machine learning techniques can be applied to detect DDoS attacks in network traffic. By performing data preprocessing, feature engineering, model training, and evaluation, we showcase a pipeline that identifies malicious patterns indicative of DDoS behavior.

Key Objectives:

Understand and analyze network traffic data through Exploratory Data Analysis (EDA).
Clean and preprocess the data to handle missing values and outliers.
Select and engineer features relevant to DDoS attack detection.
Address class imbalance using techniques such as SMOTE.
Train, tune, and evaluate multiple machine learning models.
Propose future improvements, including advanced deep learning methods.

Dataset

Name: DDoS2020 (Prairie View A&M University)
Source: PVAMU-DDoS-2020.csv

For this project, we utilize the DDoS2020 dataset from Prairie View A&M University, introduced and described in the paper:

S. Alam, Y. Alam, S. Cui, C. Akujuobi, and M. Chouikha, “Toward Developing a Realistic DDoS Dataset for Anomaly-based Intrusion Detection,” 2021 IEEE International Conference on Consumer Electronics (ICCE), 2021.

This dataset was developed using Spirent’s CyberFlood CF20 emulator to capture realistic volumetric and protocol-based DDoS attacks (ICMP Flood, UDP Flood, SYN Flood, XMAS Tree Flood) alongside benign network traffic. The authors extracted and labeled over 4.5 million flow records with up to 80 features per flow, offering a comprehensive view of normal and malicious network behavior. By leveraging high-bandwidth emulation (up to 10 Gbps), the dataset effectively represents large-scale, real-world attack scenarios suitable for training and evaluating anomaly-based intrusion detection models.

Download Link: PVAMU-DDoS-2020.csv

Project Timeline & Tasks

Below is a summary of our project plan, tasks, responsible team members, and completion dates.

Sr No	Task	Team Member	Timeline
1	Exploratory Data Analysis (EDA)	1	3/30/2025
2	Data Cleaning	2	3/30/2025
3	Feature Selection & Extraction	3	4/2/2025
4	Handling Data Imbalance	4	4/5/2025
5	Splitting Data & Normalizing Data	4	4/5/2025
6	Training 4 Basic Models	All	4/8/2025
7	Training 4 Advanced Models	All	4/8/2025
8	Hyperparameter Tuning & Cross-Validation	3	4/10/2025
9	Model Evaluation on Test Data	All	4/12/2025
10	Conclusion & Future Enhancements	All	4/12/2025
11	Presentation	1 & 2	4/19/2025
12	Report	3 & 4	4/19/2025

Task Descriptions

Exploratory Data Analysis (EDA)
- Load dataset, visualize distributions, identify outliers, correlations, and initial insights.
Data Cleaning
- Handle missing values (drop or impute), remove duplicates, and address outliers using appropriate methods.
Feature Selection & Extraction
- Use techniques like correlation matrices, SelectKBest, or Lasso to identify and retain the most informative features.
Handling Data Imbalance
- Use SMOTE or similar techniques to address imbalance in class labels.
Splitting & Normalizing Data
- Split into training and test sets (80/20 split). Normalize/standardize numerical features for model compatibility.
Training 4 Basic Models
- Logistic Regression, Decision Tree, K-Nearest Neighbors (KNN), and Naïve Bayes.
Training 4 Advanced Models
- Random Forest, Gradient Boosting (XGBoost), Support Vector Machine (SVM), and Neural Networks (MLP).
Hyperparameter Tuning & Cross-Validation
- Perform GridSearchCV or RandomizedSearchCV with 5-fold CV to optimize model parameters and avoid overfitting.
Model Evaluation on Test Data
- Evaluate performance using Accuracy, Precision, Recall, F1-score, ROC-AUC, and Confusion Matrix.
- Optionally, use SHAP or LIME for model explainability.
Conclusion & Future Enhancements
- Summarize findings, compare model performances, and propose improvements (e.g., deep learning approaches, anomaly detection with autoencoders, or text-based analysis with BERT/GPT models).
Presentation
- Prepare a comprehensive project presentation showcasing methodologies, results, and key insights.
Report
- Compile a detailed report summarizing the entire project, methodologies, outcomes, and lessons learned.

Project Structure

ML-DDOS-Detection-Project/
├── data/
│   ├── raw/                 # Original dataset
│   └── processed/           # Cleaned and transformed data
├── notebooks/               # Jupyter notebooks for EDA, experiments, analysis
├── src/                     # Source code for data processing and model training
│   ├── preprocessing.py     # Scripts for cleaning and preparing data
│   ├── feature_engineering.py
│   ├── model_training.py    # Scripts to train both basic and advanced models
│   ├── hyperparameter.py    # Hyperparameter tuning and cross-validation
│   └── utils.py             # Utility functions
├── results/                 # Model outputs, evaluation metrics, plots
├── requirements.txt         # Python dependencies
└── README.md                # Project documentation

Installation

Clone the repository:
git clone https://github.com/ibrahimsaleem/ML-DDOS-Detection-Project.git
Navigate to the project directory:
cd ML-DDOS-Detection-Project
Create a virtual environment:
python -m venv venv
Activate the virtual environment:
- Windows: venv\Scripts\activate
- macOS/Unix: source venv/bin/activate
Install dependencies:
pip install -r requirements.txt

Usage

Data Preprocessing
Run the preprocessing script to clean and prepare the dataset:
python src/preprocessing.py
Feature Engineering
Generate new features or select top features:
python src/feature_engineering.py
Model Training
Train the basic and advanced models:
python src/model_training.py
Hyperparameter Tuning
Run hyperparameter tuning and cross-validation:
python src/hyperparameter.py
Evaluation & Visualization
Evaluate the trained models on the test set and generate plots. Refer to the notebooks/ directory or results/ folder for any additional scripts or detailed analysis.

Future Enhancements

Deep Learning Approaches: Implement LSTM or CNN for time-series based detection.
Anomaly Detection: Use autoencoders for unsupervised anomaly detection.
Text-Based Analysis: Integrate BERT/GPT models for analyzing text-based attack logs.

Team

Team Name: Team Olympians
Total Members: 4

Members

Danindu Gammanpilage
Mohammad Ibrahim Saleem
Simran Khaparde
Suvarna Aglave (Team Leader)

License

This project is licensed under the MIT License.

Acknowledgements

Course: Data Science for Cybersecurity
Instructors & Mentors: For guidance and feedback
Open-Source Community: For providing essential libraries and resources
Prairie View A&M University: For the DDoS2020 dataset

Feel free to open an issue or pull request for any improvements or clarifications!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML-DDOS-Detection-Project

Table of Contents

Overview

Dataset

Project Timeline & Tasks

Task Descriptions

Project Structure

Installation

Usage

Future Enhancements

Team

Members

License

Acknowledgements

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
notebooks		notebooks
results		results
src		src
README.md		README.md

ibrahimsaleem/ML-DDOS-Detection-Project

Folders and files

Latest commit

History

Repository files navigation

ML-DDOS-Detection-Project

Table of Contents

Overview

Dataset

Project Timeline & Tasks

Task Descriptions

Project Structure

Installation

Usage

Future Enhancements

Team

Members

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages