Skip to content

A comprehensive web-based framework for synthetic data generation and automated model performance benchmarking.

License

Notifications You must be signed in to change notification settings

leandrenash/DataSynthBench

Repository files navigation

DataSynthBench

License: MIT Node.js Version React TypeScript

A comprehensive web-based framework for synthetic data generation and automated model performance benchmarking. Upload your tabular dataset, generate synthetic variants using multiple methods, and automatically evaluate model performance drift.

datasynth

πŸš€ Features

Core Functionality

  • Multi-Method Synthetic Data Generation

    • SMOTE (Synthetic Minority Oversampling Technique)
    • GAN-based generation simulation
    • Gaussian noise injection
    • Bootstrap resampling
  • Automated Model Benchmarking

    • Random Forest, Logistic Regression, SVM, XGBoost
    • Cross-validation with configurable folds
    • Performance drift detection
    • Comprehensive metric evaluation
  • Interactive Dashboard

    • Real-time progress tracking
    • Visual performance comparisons
    • Drift analysis and insights
    • Export capabilities for CI/CD

Technical Features

  • Modern Web Interface: React 18 + TypeScript + Tailwind CSS
  • Responsive Design: Optimized for desktop and tablet workflows
  • File Processing: CSV upload with automatic column type detection
  • Export Formats: JSON, YAML, CSV for different use cases
  • CI/CD Ready: Structured output for automated pipelines

πŸ“‹ Prerequisites

Before running DataSynthBench locally, ensure you have:

  • Node.js (version 16.0.0 or higher)
  • npm (version 7.0.0 or higher) or yarn
  • Git for cloning the repository

Check your versions:

node --version
npm --version

πŸ› οΈ Local Development Setup

1. Clone the Repository

git clone https://github.com/leandrenash/datasynthbench.git
cd datasynthbench

2. Install Dependencies

Using npm:

npm install

Using yarn:

yarn install

3. Start Development Server

npm run dev

The application will be available at http://localhost:5173

4. Build for Production

npm run build

Built files will be in the dist/ directory.

πŸ“ Project Structure

datasynthbench/
β”œβ”€β”€ public/                 # Static assets
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ components/         # React components
β”‚   β”‚   β”œβ”€β”€ Dashboard.tsx
β”‚   β”‚   β”œβ”€β”€ DatasetUpload.tsx
β”‚   β”‚   β”œβ”€β”€ ConfigurationPanel.tsx
β”‚   β”‚   β”œβ”€β”€ ResultsViewer.tsx
β”‚   β”‚   └── ExportPanel.tsx
β”‚   β”œβ”€β”€ App.tsx            # Main application component
β”‚   β”œβ”€β”€ main.tsx           # Application entry point
β”‚   └── index.css          # Global styles
β”œβ”€β”€ docs/                  # Documentation and images
β”œβ”€β”€ package.json
β”œβ”€β”€ README.md
└── ...config files

🎯 Quick Start Guide

1. Upload Your Dataset

  • Navigate to the "Upload Data" tab
  • Drag and drop a CSV file or click to browse
  • Review the automatic column analysis and data preview

2. Configure Generation

  • Go to the "Configure" tab
  • Select synthetic data generation methods (SMOTE, GAN, Noise, Resample)
  • Choose models for benchmarking
  • Adjust parameters as needed

3. Run Benchmark

  • Click "Run Benchmark" to start the process
  • Monitor real-time progress
  • View detailed results in the "Results" tab

4. Export Results

  • Navigate to the "Export" tab
  • Choose export format (JSON, YAML, CSV)
  • Download complete results or summary reports

πŸ“Š Supported Data Formats

Input Requirements

  • File Format: CSV files only
  • Size Limit: Up to 100MB
  • Column Types: Automatic detection of numeric and categorical columns
  • Missing Values: Handled automatically during processing

Example Dataset Structure

feature1,feature2,feature3,target
1.2,category_a,0.5,class_1
2.1,category_b,0.8,class_2
1.8,category_a,0.3,class_1

βš™οΈ Configuration Options

Synthetic Data Generators

SMOTE

smote:
  enabled: true
  k_neighbors: 5
  sampling_strategy: "auto"

GAN Simulation

gan:
  enabled: true
  epochs: 100
  batch_size: 32

Noise Injection

noise:
  enabled: true
  noise_level: 0.1
  noise_type: "gaussian"

Resampling

resample:
  enabled: true
  strategy: "random"
  ratio: 1.0

Model Configuration

models:
  random_forest:
    enabled: true
    n_estimators: 100
  logistic_regression:
    enabled: true
    C: 1.0
  svm:
    enabled: true
    kernel: "rbf"
  xgboost:
    enabled: true
    n_estimators: 100

πŸ”§ Development

Available Scripts

  • npm run dev - Start development server
  • npm run build - Build for production
  • npm run preview - Preview production build
  • npm run lint - Run ESLint

Code Style

  • TypeScript: Strict mode enabled
  • ESLint: Configured with React and TypeScript rules
  • Prettier: Code formatting (recommended)

Adding New Features

  1. New Synthetic Data Method:

    • Add configuration options in ConfigurationPanel.tsx
    • Implement generation logic simulation
    • Update results processing
  2. New Model Type:

    • Extend model configuration interface
    • Add to benchmarking simulation
    • Update results visualization
  3. New Export Format:

    • Add format option in ExportPanel.tsx
    • Implement conversion function
    • Test with sample data

πŸš€ Deployment

Netlify (Recommended)

  1. Connect your GitHub repository to Netlify
  2. Set build command: npm run build
  3. Set publish directory: dist
  4. Deploy automatically on push

Vercel

  1. Import project from GitHub
  2. Framework preset: Vite
  3. Build command: npm run build
  4. Output directory: dist

Docker

FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "run", "preview"]

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes
  4. Run tests: npm run lint
  5. Commit changes: git commit -m 'Add amazing feature'
  6. Push to branch: git push origin feature/amazing-feature
  7. Open a Pull Request

Reporting Issues

  • Use the GitHub Issues page
  • Include detailed reproduction steps
  • Provide sample data if possible (anonymized)

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ—ΊοΈ Roadmap

  • Python backend integration for real synthetic data generation
  • Advanced visualization with D3.js
  • Model explainability features
  • Automated hyperparameter tuning
  • Integration with MLflow and other ML platforms
  • Support for time series data
  • Advanced drift detection algorithms

Made with ❀️ for the data science community

About

A comprehensive web-based framework for synthetic data generation and automated model performance benchmarking.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published