🌊 MalDataGen - v.1.0.0 (Jellyfish 🪼)

🌊 MalDataGen - v.1.0.0 (Jellyfish 🪼)

MalDataGen is an advanced Python framework for generating and evaluating synthetic tabular datasets using modern generative models, including diffusion and adversarial architectures. Designed for researchers and practitioners, it provides reproducible pipelines, fine-grained control over model configuration, and integrated evaluation metrics for realistic data synthesis.

📚 Table of Contents/Estrutura do readme.md

📖 Overview (Informações básicas)
Video
Security worries (Preocupações com segurança)
Stamps considered (Selos Considerados)
🚀 Getting Started
⚙️ Installation (Instalação)
🧠 Architectures
🛠 Features
📊 Evaluation Strategy
📈 Metrics
📋 Architecture Diagrams
🔧 Technologies Used (Dependências)
🔗 References

📖 Overview (Informações básicas)

MalDataGen is a modular and extensible synthetic data generation library for tabular data for malware dectition. It aims to:

Support state-of-the-art generative models (GANs, VAEs, Diffusion, etc.)
Improve model generalization by augmenting training data
Enable fair benchmarking via reproducible evaluations (TS-TR and TR-TS)
Provide publication-ready metrics and visualizations

It supports GPU acceleration, CSV/XLS ingestion, custom CLI scripts, and integration with academic pipelines.

Model architecure overivew

WWe provide a visual overview of the internal architecture of each model's building blocks through five detailed figures, highlighting the main structural changes across the models. These diagrams are documented and explained in the Overview.md [Overview.md ] file.(https://github.com/SBSeg25/MalDataGen/blob/2dd9eaad74da7726c130e50dbc35f95a463cbd00/Docs/Overview.md)

📋 Architecture Documentation

We provide a comprehensive visual overview (8 diagrams) at Docs/Diagrams/ of the MalDataGen framework, covering its architecture, design principles, data processing flow, and evaluation strategies. Developed using Mermaid notation, these diagrams support understanding of both the structural and functional aspects of the system. They include high-level system architecture, object-oriented class relationships, evaluation workflows, training pipelines, metric frameworks, and data flow. Together, they offer a detailed and cohesive view of how MalDataGen enables the generation and assessment of synthetic data in cybersecurity contexts.

📖 Video

The following link showcases a video of a demonstration of the tool: https://drive.google.com/file/d/1sbPZ1x5Np6zolhFvCBWoMzqNqrthlUe3/view?usp=sharing

if that doesn't work we have a backup on: https://youtu.be/t-AZtsLJUlQ

Stamps considered (selos considerados)

We, the authors, consider the following stamps:

Available artifacts (Stamp D)
Functional artifacts (Stamp F)
Sustainable artifacts (Stamp S)
Reproducible experiments (Stamp R)

We provide instructions for the installation, execution, and reproduction of the experiments presented in the paper, along with information about the execution environment and dependencies.

🚀 Getting Started

Prerequisites

Python 3.8+
pip
(Optional) CUDA 11+ for GPU acceleration

Optional: Create a virtual environment

pip install virtualenv
python3 -m venv ~/Python3venv/MalDataGen
source ~/Python3venv/MalDataGen/bin/activate

⚙️ Installation (Instalação)

git clone https://github.com/SBSeg25/MalDataGen.git
cd MalDataGen
pip install --upgrade pip
pip install -r requirements.txt
# or
pip install .

Security worries (Preocupações com segurança)

We declare that the local execution of experiments has no security worries, however the docker executing require sudo permissions being available to the docker engine.

🚀 Run Tests

Demo (Teste mínimo)

In order to execute a demo of the tool, utilized the comand listed below. The execution of this reduced demo takes around 3 minutes on a AMD Ryzen 7 5800x, 8 cores, 64 GB RAM machine.

# Run the basic demo
python3 run_campaign_sbseg.py -c sf

Alternatively, you can use the a docker container to execute the demo, by using the following comand:

# Run the basic demo
./run_demo_docker.sh

Reproduction (Experimentos)

In order to reproduce the results from the paper execute the comand below, the experiments take around 7 hours on a AMD Ryzen 7 5800x, 8 cores, 64 GB RAM machine.

# Run all experiments from the paper
python3 run_campaign_sbseg.py

Or to execute with docker:

# Run all experiments from the paper
./run_experiments_docker.sh

🧠 Architectures Supported

🔨 Native Models

Model	Description	Use Case
`CGAN`	Conditional GANs conditioned on labels or attributes	Class balancing, controlled generation
`WGAN`	Wasserstein GAN with Earth-Mover distance for improved stability	Imbalanced datasets, stable training
`WGAN-GP`	Wasserstein GAN with gradient penalty for stable training	Imbalanced datasets, complex distributions
`Autoencoder`	Latent-space learning through compression-reconstruction	Feature extraction, denoising
`VAE`	Probabilistic Autoencoder with latent sampling	Probabilistic generation and imputation
`Denoising Diffusion`	Progressive noise-based generative model	Robust generation with high-quality samples
`Latent Diffusion`	Diffusion model operating in compressed latent space	High-resolution generation, efficiency
`VQ-VAE`	Discrete latent-space via quantization	Categorical and mixed-type data
`SMOTE`	Synthetic Minority Over-sampling Technique (interpolation-based)	Class imbalance in tabular data

📦 Third-Party Supported (SDV)

Model	Description	Use Case
`TVAE`	Variational Autoencoder optimized for tabular data	Structured/tabular data synthesis
`Copula`	Statistical model based on dependency (copula) functions	Synthetic data with correlations
`CTGAN`	GAN with mode-specific normalization for tabular data	Mixed-type/categorical synthesis

Legenda:

SDV: Integração com a biblioteca Synthetic Data Vault.

🛠 Features

📊 Cross-validation (stratified k-fold)
⚙️ Fully customizable model configuration
📈 Built-in metrics for data quality
🔁 Persistent models & experiment saving
📉 Graphing utilities for visual reports
📉 Clustering visualization of datasets
📉 Heat maps between the synthetic and real samples
🧪 Automated experiment pipelines
💾 Data export to CSV/XLS formats

📊 Evaluation Strategy

Two validation approaches are supported:

TS-TR (Train Synthetic – Test Real)
Measures generalization ability by training on synthetic data and testing on real data.
TR-TS (Train Real – Test Synthetic)
Assesses generative realism by training on real and testing on synthetic samples.

📈 Metrics Tracked

Primary

Accuracy, Precision, Recall, F1-score, Specificity
ROC-AUC, MSE, MAE, FNR, TNR

Secondary

Euclidean Distance, Hellinger Distance
Log-Likelihood, Manhattan Distance

📋 Architecture Diagrams

Comprehensive architecture documentation is available in the Docs/Diagrams/ directory, including:

System Architecture: High-level framework overview and component relationships
Core Class Hierarchy: Object-oriented design and inheritance structure
Evaluation Strategy: TS-TR and TR-TS evaluation flow diagrams
Model Training Pipeline: Complete workflow sequence from data to results
Metrics Framework: Comprehensive evaluation metrics overview
Data Flow Architecture: End-to-end data processing pipeline
Generative Models Comparison: Model categories and characteristics
Deployment Architecture: Docker and execution mode options

All diagrams are created using Mermaid format for easy maintenance and version control. They can be viewed directly in GitHub or exported for academic publications.

🧰 Technologies Used

Tool	Purpose
Python 3.8+	Core language
NumPy, Pandas	Data processing
TensorFlow	Model building
Matplotlib, Plotly	Visualization
PyTorch (planned)	Future multi-backend support
Docker	Containerization
Git	Version control

🔬 System Requirements

Hardware

Component	Minimum	Recommended
CPU	Any x86_64	Multi-core (i5/Ryzen 5+)
RAM	4 GB	8 GB+
Storage	10 GB	20 GB SSD
GPU	Optional	NVIDIA with CUDA 11+

Software

Component	Version	Notes
OS	Ubuntu 22.04+	Linux preferred
Python	≥ 3.8.10	Virtualenv recommended
Docker	≥ 27.2.1	Optional but supported
Git	Latest	Required
CUDA	≥ 11.0	Optional for GPU execution

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Datasets/SBSeg_2025		Datasets/SBSeg_2025
Docs		Docs
Engine		Engine
Layout		Layout
SBSEG25_Tests		SBSEG25_Tests
Scripts		Scripts
Test		Test
Tools		Tools
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
README.md		README.md
main.py		main.py
pip_env_install.sh		pip_env_install.sh
plots.py		plots.py
plots_svm.py		plots_svm.py
requirements.txt		requirements.txt
results.sh		results.sh
run_campaign_sbseg.py		run_campaign_sbseg.py
run_demo_docker.sh		run_demo_docker.sh
run_experiments_docker.sh		run_experiments_docker.sh

License

SBSeg25/MalDataGen

Folders and files

Latest commit

History

Repository files navigation

🌊 MalDataGen - v.1.0.0 (Jellyfish 🪼)

📚 Table of Contents/Estrutura do readme.md

📖 Overview (Informações básicas)

It supports GPU acceleration, CSV/XLS ingestion, custom CLI scripts, and integration with academic pipelines.

Model architecure overivew

📋 Architecture Documentation

📖 Video

Stamps considered (selos considerados)

🚀 Getting Started

Prerequisites

Optional: Create a virtual environment

⚙️ Installation (Instalação)

Security worries (Preocupações com segurança)

We declare that the local execution of experiments has no security worries, however the docker executing require sudo permissions being available to the docker engine.

🚀 Run Tests

Demo (Teste mínimo)

Reproduction (Experimentos)

🧠 Architectures Supported

🔨 Native Models

📦 Third-Party Supported (SDV)

🛠 Features

📊 Evaluation Strategy

📈 Metrics Tracked

Primary

Secondary

📋 Architecture Diagrams

🧰 Technologies Used

🔬 System Requirements

Hardware

Software

🔗 References

Core Papers

SDV Ecosystem

Supplementary

🧩 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages