Skip to content
/ Synapse Public

Flask based ML platform for automatic EDA, Preprocessing, and Hyperparameter Finetuning

Notifications You must be signed in to change notification settings

msr8/Synapse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GitHub stars GitHub last commit GitHub issues

Index

  1. Abstract
  2. Usage
  3. Endpoints
  4. Search Space
  5. Screenshots



Abstract

Problem Statement: Extracting meaningful insights from raw datasets and selecting the most effective machine learning models remains a significant challenge for both non-technical and technical users. Non-technical users and analysts often face steep learning curves due to complex tools and the need for coding expertise. On the other hand, technical users struggle with fragmented workflows that lack intuitive interfaces for rapid experimentation, hyper-parameter tuning, and performance comparison. This disconnect hinders efficient model development and slows down decision-making across teams

Context and Background: In the current data-driven era, organizations and individuals increasingly rely on data analysis for strategic actions. However, most available tools require programming knowledge or familiarity with data science workflows. This creates a barrier for non-technical users and business professionals who need to make sense of data without specialized skills. Even technical users encounter inefficiencies due to disjointed tools and unintuitive interfaces, making it harder to iterate quickly, fine-tune models, and compare results effectively

Purpose and Contribution: Synapse aims to democratize data analysis by developing a no-code, web-based platform that enables users to upload datasets, perform exploratory data analysis (EDA), and select the most appropriate machine learning model through a simple, conversational interface. The system bridges the gap between usability and advanced analytics by combining automation with natural language interaction

Methods and Approach: Synapse includes a user-friendly web interface with two modes: a visual dashboard for EDA and bayesian optimization, and a chatbot for natural language queries. Upon uploading a dataset file, the system automatically handles data preprocessing such as cleaning, encoding, and scaling. Users can visualize the dataset and interact with the chatbot. For model selection, Bayesian optimization is used to identify the best-fit algorithm for classification

Results and Conclusion: Synapse successfully simplifies complex data tasks, enabling users to analyze and interpret their datasets without writing code. It demonstrates that combining automation, natural language processing, and model optimization can make machine learning more accessible, thereby enhancing decision-making for users across technical and non-technical domains


Features

  1. Engineered a full-stack real-time ML platform enabling the automation of ML workflows
  2. Integrated Bayesian Optimization to autonomously tune hyperparameters for diverse models
  3. Implemented an automated EDA pipeline generating insightful and interactive visualizations
  4. Developed a robust customisable preprocessing engine with intelligent missing value handling, feature selection, and scaling
  5. Embedded an AI chatbot to provide data-driven insights and statistical interpretations for technical and non-technical users

Tech Stack

  1. Backend & Frameworks
    • Python (Flask): The core web framework used to build the application
    • Flask-SocketIO: Enables real-time, bi-directional communication for the EDA and training logs
    • Flask-SQLAlchemy: ORM for database management
    • Flask-Dance: Handles Google OAuth 2.0 authentication
  2. Machine Learning & AI
    • Scikit-Learn: Used for standard algorithms (SVM, KNN, Random Forest, etc.) and metrics
    • Scikit-Optimize (skopt): Powers the Bayesian Optimization engine for hyperparameter tuning
    • XGBoost & LightGBM: Advanced gradient boosting frameworks integrated into the pipeline
  3. Visualization
    • Matplotlib & Seaborn: Generates static charts like correlation heatmaps and pairplots (rendered to Base64)
    • Pygal: Used for interactive vector-based (SVG) visualizations
  4. Frontend
    • HTML5 / CSS3 / JavaScript: Core technologies for the user interface
    • GSAP (GreenSock): Used for advanced animations and scroll triggers
    • Motion (Motion One): A modern animation library for UI transitions
    • JSZip & FileSaver.js: Allows users to zip and download generated charts directly from the browser
  5. Database
    • SQLite: A fast and simple database used for storing user data and task information



Usage

To run the application locally, follow these steps:

First of all, ensure that you have git and Python 3.8+ installed on your machine. Then, run the following commands:

# Clone the repository
git clone https://github.com/msr8/synapse
cd synapse/src
# Install the required dependencies
pip install -r requirements.txt
# Run the flask application
python run.py

The application will be accessible at http://127.0.0.1:5000 in your web browser

Warning

These instructions are intended for local deployment only. For production deployment, use a production-ready server like Gunicorn or uWSGI, and consider using a reverse proxy like Nginx



Endpoints

URL Path Description
/ Landing page
/learn-more Information page about the project
/dashboard User dashboard displaying tasks
/login User login page
/signup User registration page
/logout Logs the user out
/login/google-authorised/ Google OAuth callback URL
/task/<int:task_id> Main interface for a specific task
/api/auth/login API to handle user login
/api/auth/signup API to handle user registration
/api/auth/change-username API to update the current user's username
/api/auth/change-password API to update the current user's password
/api/upload API to handle dataset uploads
/api/task/set-target API to set the target column for a task
/api/task/change-taskname API to rename a specific task
/api/task/delete-task API to delete a task
/api/task/chatbot/initialise API to start the LLM chat session
/api/task/chatbot/chat API to send a message to the chatbot
/api/task/chatbot/reset API to clear chat history



Search Space

We optimise over the following classification models using Bayesian optimization to find the best model and hyperparameters for a given dataset:

1) K-Nearest-Neighbours
Hyperparameter Description Type Range / Values
n_neighbors Number of neighbors to use Integer 1 to 30
weights Weight function used in prediction Categorical uniform, distance
metric Distance metric to use Categorical chebyshev, cosine, euclidean, manhattan, minkowski, sqeuclidean
2) Support Vector Machine
Hyperparameter Description Type Range / Values
C Regularization parameter Float 1e-4 to 1e+4 (log-uniform)
kernel Kernel type to be used Categorical rbf,sigmoid, poly
degree Degree of the polynomial kernel Integer 1 to 3
gamma Kernel coefficient Categorical scale
3) Logistic Regression
Hyperparameter Description Type Range / Values
C Inverse of regularization strength Float 1e-6 to 1e+6 (log-uniform)
penalty Norm used in penalization Categorical l1, l2
solver Optimization algorithm Categorical liblinear, saga
4) Decision Tree
Hyperparameter Description Type Range / Values
criterion Function to measure split quality Categorical gini, entropy
splitter Strategy used to choose split Categorical best, random
max_depth Maximum depth of the tree Integer 1 to 10
min_samples_split Min samples required to split node Integer 2 to 10
min_samples_leaf Min samples required at leaf node Integer 1 to 10
max_features Number of features to consider Categorical None, sqrt, log2
5) Random Forest
Hyperparameter Description Type Range / Values
n_estimators Number of trees in the forest Integer 10 to 100
criterion Function to measure split quality Categorical gini, entropy
max_depth Maximum depth of the tree Integer 1 to 10
min_samples_split Min samples required to split node Integer 2 to 10
min_samples_leaf Min samples required at leaf node Integer 1 to 10
max_features Number of features to consider Categorical None, sqrt, log2
6) Extra Trees
Hyperparameter Description Type Range / Values
n_estimators Number of trees in the forest Integer 10 to 100
criterion Function to measure split quality Categorical gini, entropy
max_depth Maximum depth of the tree Integer 1 to 10
min_samples_split Min samples required to split node Integer 2 to 10
min_samples_leaf Min samples required at leaf node Integer 1 to 10
max_features Number of features to consider Categorical None, sqrt, log2
7) Gradient Boosting
Hyperparameter Description Type Range / Values
n_estimators Number of boosting stages Integer 10 to 100
learning_rate Shrinks contribution of each tree Float 1e-6 to 1 (log-uniform)
max_depth Maximum depth of estimators Integer 1 to 10
min_samples_split Min samples required to split node Integer 2 to 10
min_samples_leaf Min samples required at leaf node Integer 1 to 10
max_features Number of features to consider Categorical None, sqrt, log2
8) Light Gradient Boosting Machine (LGBM)
Hyperparameter Description Type Range / Values
n_estimators Number of boosted trees Integer 10 to 100
learning_rate Boosting learning rate Float 1e-6 to 1 (log-uniform)
max_depth Maximum tree depth Integer -1 to 15
num_leaves Max tree leaves for base learners Integer 10 to 50
min_child_samples Min data needed in a leaf Integer 5 to 20
subsample Subsample ratio of training instance Float 0.5 to 1.0
colsample_bytree Subsample ratio of columns per tree Float 0.5 to 1.0
reg_alpha L1 regularization term Float 0.0 to 5.0
reg_lambda L2 regularization term Float 0.0 to 5.0
9) Ada Boost
Hyperparameter Description Type Range / Values
n_estimators Maximum number of estimators Integer 10 to 100
learning_rate Weight applied to each classifier Float 1e-6 to 1 (log-uniform)
10) Bagging
Hyperparameter Description Type Range / Values
n_estimators Number of base estimators Integer 10 to 100
max_samples Number of samples to draw Float 0.1 to 1.0
max_features Number of features to draw Float 0.1 to 1.0
bootstrap Draw samples with replacement Boolean True, False
bootstrap_features Draw features with replacement Boolean True, False



Screenshots

Landing Page Figure 1: Landing Page

Configurable Options Figure 2: Configurable Options for EDA, Preprocessing, and Bayesian Optimization

Target Selection Figure 3: Target Column Selection

Feature Charts Figure 4: Feature Columns Distributions

Correlation and Mutual Info Figure 5: Correlation Heatmap and Mutual Information Heatmap

Pairplot Figure 6: Pairplot Visualization

Bayesian Optimization Figure 7: Real-time Bayesian Optimization Results

Chatbot Interface Figure 8: Chatbot Interface

Dashboard Figure 9: User Dashboard

Learn More Page Figure 10: Learn More Page

FAQ Page Figure 11: FAQs

About

Flask based ML platform for automatic EDA, Preprocessing, and Hyperparameter Finetuning

Resources

Stars

Watchers

Forks