This repository is dedicated to machine learning projects with the primary goal of learning and practicing ML concepts.
The projects included here are meant to:
- Explore different areas of machine learning.
- Practice the full workflow: data preprocessing, exploratory analysis, model development, and evaluation.
- Build both theoretical understanding and practical coding skills.
The main purpose of this repository is to serve as a personal learning space while also providing others with examples and references for their own ML journey.
Each project is a step toward improving skills in data science and machine learning through hands-on implementation.
π Task Description
Given a product and its corresponding details, the task is to predict the amount of sales based on the available features.
- Dataset Size: ~8k rows
- Training Dataset Link: Google Sheets Dataset
π Dataset Columns
Item_Identifier: Unique identity number for a productItem_Weight: Weight of the productItem_Fat_Content: Fat content (Low Fat / Regular)Item_Visibility: Percentage of store display allocated to the productItem_Type: Category of the productItem_MRP: Maximum Retail Price (list price)Outlet_Identifier: Unique store IDOutlet_Establishment_Year: Year the store was establishedOutlet_Size: Store size in terms of ground areaOutlet_Location_Type: Type of city in which the store is locatedOutlet_Type: Type of outlet (grocery store or supermarket)Item_Outlet_Sales: Target variable β product sales in the store
π€ Recommended Approaches & Models
- Data Preprocessing: Handle missing values, encode categorical features, and scale numeric values.
- Exploratory Analysis: Identify relationships between
Item_MRP,Outlet_Type, andItem_Outlet_Sales. - Baseline Models: Linear Regression, Decision Tree Regressor
- Advanced Models: Random Forest, XGBoost, LightGBM, CatBoost, Neural Networks (MLP)
- Evaluation Metric: Root Mean Squared Error (RMSE)
π Task Description
Given an image as input, the task is to classify the image into one of four disaster categories:
- CYCLONE
- EARTHQUAKE
- FLOOD
- WILDFIRE
π Dataset Information
- Training Samples: 400 per category
- Validation Samples: 100 per category
- Test Samples: 100 per category
- Dataset Link: Google Drive Dataset
Note: The dataset does not include .csv/.tsv/.txt annotation files. Images are organized in category-named subfolders; this folder structure can be used for labeling.
π€ Recommended Approaches & Models
- Data Preparation: Image augmentation (rotation, flipping, scaling) and normalization.
- Baseline Models: Custom CNN architectures.
- Pretrained Models (Transfer Learning): ResNet50, VGG16/VGG19, EfficientNet, MobileNetV2.
- Evaluation Metrics: Precision, Recall, and F1-score per category; Macro Precision/Recall/F1 overall.
π Objective
Build and evaluate a model that recognizes handwritten text using labeled character images.
π Dataset Information
- Dataset: Kaggle β Handwriting Recognition
- Structure:
train.csv,validation.csv,test.csv+ image folders - CSV columns:
filename: image filepathidentity: label (text/character)
βοΈ Workflow Overview
- EDA: Visualize character distribution and sample images with labels.
- Preprocessing: Normalize images, split into train/val/test, apply augmentation.
- Model Selection:
- Baselines: Logistic Regression, SVM (on flattened features)
- Deep Learning: CNNs (e.g., simple CNN stacks/LeNet)
- Advanced: CNN + RNN hybrids (CRNN) for sequence modeling
- Training: Track accuracy/loss, use dropout, early stopping, regularization.
- Optimization: Tune hyperparameters; compare architectures.
- Evaluation: Character-wise F1 scores, confusion matrix, analyze misclassifications.
π€ Recommended Approaches & Models
- Baseline: Logistic Regression, SVM
- Deep Learning: CNNs; CNN+RNN for sequences
- Modern: Transformer-based OCR approaches
- Metrics: Character-level F1, accuracy, confusion matrix
π Description
Generate new poems inspired by Robert Frostβs style using next-word prediction.
π Dataset Information
- Source: Project Gutenberg β Robert Frostβs Poems
- Preprocessing: tokenize words, build vocabulary, convert to sequences
- Split: 80% training / 20% testing
βοΈ Model Architecture
- Baseline: BiLSTM
- Input: sequence length & vocab size
- Hidden: BiLSTM units + dropout
- Output: softmax for next-word prediction
- Training: categorical cross-entropy, Adam, batch size & epochs configurable
π€ Recommended Approaches & Models
- Baseline: BiLSTM next-word model
- Advanced (Optional): GPT/T5 fine-tuning; experiment with diffusion-style text models
π Evaluation
- Perplexity (next-word prediction quality)
- Accuracy (coherence proxy)
- Generate example poems and qualitatively assess style/fluency
π Description
Predict the cuisine of a recipe from its list of ingredients (e.g., Indian, Mexican, Moroccan, Korean, Greek).
π Dataset Information
- Dataset: Google Drive β Recipe Dataset
- Format: JSON
- Fields:
id: unique recipe identifiercuisine: target label (train only)ingredients: list of ingredients
π Example (train.json)
{
"id": 24717,
"cuisine": "indian",
"ingredients": [
"turmeric",
"vegetable stock",
"tomatoes",
"garam masala",
"naan",
"red lentils",
"red chili peppers",
"onions",
"spinach",
"sweet potatoes"
]
}π€ Recommended Approaches & Models
- Baseline: ANN / simple feed-forward neural network with bag-of-words or TF-IDF features
- Intermediate: Embedding layers with CNN or BiLSTM for ingredient sequences
- Advanced (Optional): Transformer-based models
π Evaluation
- Accuracy (For simplicity)
The goal of this project is to predict whether a flight will be delayed based on historical and contextual flight data.
This is a binary classification problem (Delayed vs On-time).
- Dataset Link: Google Drive Dataset
- Dataset: Flight records with the following key fields:
Year,Month,Dayβ flight dateDayOfWeekβ numeric day of the weekAirlineβ airline carrier codeFlightNumβ flight numberOriginβ origin airportDestβ destination airportDepTimeβ actual departure timeArrTimeβ actual arrival timeDepDelayβ departure delay in minutesArrDelayβ arrival delay in minutes (target variable)
- Target Variable:
ArrDelay - Convert into binary classification:
Delayedβ ifArrDelay > 15 minutesOn-timeβ otherwise
- Metrics: Accuracy, Precision, Recall, F1-score
- ROC-AUC can also be used for better evaluation of class imbalance.
- Baseline: Logistic Regression / Random Forest
- Intermediate: Gradient Boosted Trees (XGBoost, LightGBM, CatBoost)
- Advanced:
- Neural Networks with embedding layers for categorical features
- Sequence models (RNNs/LSTMs) for temporal flight patterns



