Index

Abstract
- Features
- Tech Stack
Usage
Endpoints
Search Space
Screenshots

Abstract

Problem Statement: Extracting meaningful insights from raw datasets and selecting the most effective machine learning models remains a significant challenge for both non-technical and technical users. Non-technical users and analysts often face steep learning curves due to complex tools and the need for coding expertise. On the other hand, technical users struggle with fragmented workflows that lack intuitive interfaces for rapid experimentation, hyper-parameter tuning, and performance comparison. This disconnect hinders efficient model development and slows down decision-making across teams

Context and Background: In the current data-driven era, organizations and individuals increasingly rely on data analysis for strategic actions. However, most available tools require programming knowledge or familiarity with data science workflows. This creates a barrier for non-technical users and business professionals who need to make sense of data without specialized skills. Even technical users encounter inefficiencies due to disjointed tools and unintuitive interfaces, making it harder to iterate quickly, fine-tune models, and compare results effectively

Purpose and Contribution: Synapse aims to democratize data analysis by developing a no-code, web-based platform that enables users to upload datasets, perform exploratory data analysis (EDA), and select the most appropriate machine learning model through a simple, conversational interface. The system bridges the gap between usability and advanced analytics by combining automation with natural language interaction

Methods and Approach: Synapse includes a user-friendly web interface with two modes: a visual dashboard for EDA and bayesian optimization, and a chatbot for natural language queries. Upon uploading a dataset file, the system automatically handles data preprocessing such as cleaning, encoding, and scaling. Users can visualize the dataset and interact with the chatbot. For model selection, Bayesian optimization is used to identify the best-fit algorithm for classification

Results and Conclusion: Synapse successfully simplifies complex data tasks, enabling users to analyze and interpret their datasets without writing code. It demonstrates that combining automation, natural language processing, and model optimization can make machine learning more accessible, thereby enhancing decision-making for users across technical and non-technical domains

Features

Engineered a full-stack real-time ML platform enabling the automation of ML workflows
Integrated Bayesian Optimization to autonomously tune hyperparameters for diverse models
Implemented an automated EDA pipeline generating insightful and interactive visualizations
Developed a robust customisable preprocessing engine with intelligent missing value handling, feature selection, and scaling
Embedded an AI chatbot to provide data-driven insights and statistical interpretations for technical and non-technical users

Tech Stack

Backend & Frameworks
- Python (Flask): The core web framework used to build the application
- Flask-SocketIO: Enables real-time, bi-directional communication for the EDA and training logs
- Flask-SQLAlchemy: ORM for database management
- Flask-Dance: Handles Google OAuth 2.0 authentication
Machine Learning & AI
- Scikit-Learn: Used for standard algorithms (SVM, KNN, Random Forest, etc.) and metrics
- Scikit-Optimize (skopt): Powers the Bayesian Optimization engine for hyperparameter tuning
- XGBoost & LightGBM: Advanced gradient boosting frameworks integrated into the pipeline
Visualization
- Matplotlib & Seaborn: Generates static charts like correlation heatmaps and pairplots (rendered to Base64)
- Pygal: Used for interactive vector-based (SVG) visualizations
Frontend
- HTML5 / CSS3 / JavaScript: Core technologies for the user interface
- GSAP (GreenSock): Used for advanced animations and scroll triggers
- Motion (Motion One): A modern animation library for UI transitions
- JSZip & FileSaver.js: Allows users to zip and download generated charts directly from the browser
Database
- SQLite: A fast and simple database used for storing user data and task information

Usage

To run the application locally, follow these steps:

First of all, ensure that you have git and Python 3.8+ installed on your machine. Then, run the following commands:

# Clone the repository
git clone https://github.com/msr8/synapse
cd synapse/src
# Install the required dependencies
pip install -r requirements.txt
# Run the flask application
python run.py

The application will be accessible at http://127.0.0.1:5000 in your web browser

Warning

These instructions are intended for local deployment only. For production deployment, use a production-ready server like Gunicorn or uWSGI, and consider using a reverse proxy like Nginx

Endpoints

URL Path	Description
`/`	Landing page
`/learn-more`	Information page about the project
`/dashboard`	User dashboard displaying tasks
`/login`	User login page
`/signup`	User registration page
`/logout`	Logs the user out
`/login/google-authorised/`	Google OAuth callback URL
`/task/<int:task_id>`	Main interface for a specific task
`/api/auth/login`	API to handle user login
`/api/auth/signup`	API to handle user registration
`/api/auth/change-username`	API to update the current user's username
`/api/auth/change-password`	API to update the current user's password
`/api/upload`	API to handle dataset uploads
`/api/task/set-target`	API to set the target column for a task
`/api/task/change-taskname`	API to rename a specific task
`/api/task/delete-task`	API to delete a task
`/api/task/chatbot/initialise`	API to start the LLM chat session
`/api/task/chatbot/chat`	API to send a message to the chatbot
`/api/task/chatbot/reset`	API to clear chat history

Search Space

We optimise over the following classification models using Bayesian optimization to find the best model and hyperparameters for a given dataset:

1) K-Nearest-Neighbours

Hyperparameter	Description	Type	Range / Values
`n_neighbors`	Number of neighbors to use	Integer	1 to 30
`weights`	Weight function used in prediction	Categorical	`uniform`, `distance`
`metric`	Distance metric to use	Categorical	`chebyshev`, `cosine`, `euclidean`, `manhattan`, `minkowski`, `sqeuclidean`

2) Support Vector Machine

Hyperparameter	Description	Type	Range / Values
`C`	Regularization parameter	Float	1e-4 to 1e+4 (log-uniform)
`kernel`	Kernel type to be used	Categorical	`rbf`,`sigmoid`, `poly`
`degree`	Degree of the polynomial kernel	Integer	1 to 3
`gamma`	Kernel coefficient	Categorical	`scale`

3) Logistic Regression

Hyperparameter	Description	Type	Range / Values
`C`	Inverse of regularization strength	Float	1e-6 to 1e+6 (log-uniform)
`penalty`	Norm used in penalization	Categorical	`l1`, `l2`
`solver`	Optimization algorithm	Categorical	`liblinear`, `saga`

4) Decision Tree

Hyperparameter	Description	Type	Range / Values
`criterion`	Function to measure split quality	Categorical	`gini`, `entropy`
`splitter`	Strategy used to choose split	Categorical	`best`, `random`
`max_depth`	Maximum depth of the tree	Integer	1 to 10
`min_samples_split`	Min samples required to split node	Integer	2 to 10
`min_samples_leaf`	Min samples required at leaf node	Integer	1 to 10
`max_features`	Number of features to consider	Categorical	`None`, `sqrt`, `log2`

5) Random Forest

Hyperparameter	Description	Type	Range / Values
`n_estimators`	Number of trees in the forest	Integer	10 to 100
`criterion`	Function to measure split quality	Categorical	`gini`, `entropy`
`max_depth`	Maximum depth of the tree	Integer	1 to 10
`min_samples_split`	Min samples required to split node	Integer	2 to 10
`min_samples_leaf`	Min samples required at leaf node	Integer	1 to 10
`max_features`	Number of features to consider	Categorical	`None`, `sqrt`, `log2`

6) Extra Trees

Hyperparameter	Description	Type	Range / Values
`n_estimators`	Number of trees in the forest	Integer	10 to 100
`criterion`	Function to measure split quality	Categorical	`gini`, `entropy`
`max_depth`	Maximum depth of the tree	Integer	1 to 10
`min_samples_split`	Min samples required to split node	Integer	2 to 10
`min_samples_leaf`	Min samples required at leaf node	Integer	1 to 10
`max_features`	Number of features to consider	Categorical	`None`, `sqrt`, `log2`

7) Gradient Boosting

Hyperparameter	Description	Type	Range / Values
`n_estimators`	Number of boosting stages	Integer	10 to 100
`learning_rate`	Shrinks contribution of each tree	Float	1e-6 to 1 (log-uniform)
`max_depth`	Maximum depth of estimators	Integer	1 to 10
`min_samples_split`	Min samples required to split node	Integer	2 to 10
`min_samples_leaf`	Min samples required at leaf node	Integer	1 to 10
`max_features`	Number of features to consider	Categorical	`None`, `sqrt`, `log2`

8) Light Gradient Boosting Machine (LGBM)

Hyperparameter	Description	Type	Range / Values
`n_estimators`	Number of boosted trees	Integer	10 to 100
`learning_rate`	Boosting learning rate	Float	1e-6 to 1 (log-uniform)
`max_depth`	Maximum tree depth	Integer	-1 to 15
`num_leaves`	Max tree leaves for base learners	Integer	10 to 50
`min_child_samples`	Min data needed in a leaf	Integer	5 to 20
`subsample`	Subsample ratio of training instance	Float	0.5 to 1.0
`colsample_bytree`	Subsample ratio of columns per tree	Float	0.5 to 1.0
`reg_alpha`	L1 regularization term	Float	0.0 to 5.0
`reg_lambda`	L2 regularization term	Float	0.0 to 5.0

9) Ada Boost

Hyperparameter	Description	Type	Range / Values
`n_estimators`	Maximum number of estimators	Integer	10 to 100
`learning_rate`	Weight applied to each classifier	Float	1e-6 to 1 (log-uniform)

10) Bagging

Hyperparameter	Description	Type	Range / Values
`n_estimators`	Number of base estimators	Integer	10 to 100
`max_samples`	Number of samples to draw	Float	0.1 to 1.0
`max_features`	Number of features to draw	Float	0.1 to 1.0
`bootstrap`	Draw samples with replacement	Boolean	`True`, `False`
`bootstrap_features`	Draw features with replacement	Boolean	`True`, `False`