Predicting the age of abalones based on physical measurements using PySpark and Gradient Descent variations.
- Description
- Objective
- Dataset Information
- Methodology
- Results and Impact
- Project Structure
- Usage
- Contributing
- License
Welcome to the Abalone Age Prediction project! This repository hosts the code and documentation for a machine learning project focused on predicting the age of abalones using PySpark and various Gradient Descent variations.
The primary objective of this project is to develop a robust predictive model capable of accurately estimating the age of abalones based on their physical measurements. The model's application extends to conservation, sustainable resource management, and marine biology research.
Dataset Name: Abalone Data Set
Source: Abalone Data Set on Kaggle by Rodolfo Mendes
Format: Tabular (CSV)
- Rigorous examination for missing values and outliers.
- Visualizations and summary statistics for data distribution and correlations.
- Identified relationships between features and age (Rings).
- Categorical variable encoding (One-Hot Encoding for 'Sex').
- Data scaling and encoding for machine learning models.
- Dataset split into training and testing sets.
Explored various gradient descent variations for optimization:
- Bold Driver Approach
- Full Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-Batch Gradient Descent
- Adagrad Approach
- RMSprop Approach
- Adam Optimization
Evaluated each method's performance using MSE, RMSE, and R-squared scores.
- Comprehensive comparison table for optimization methods.
- Visualization of model predictions against actual values.
- Identified Mini-Batch Gradient Descent as the standout performer.
This project has achieved:
- Accurate age prediction of abalones.
- Robust model adaptability to variations.
- Comprehensive comparison of Gradient Descent optimization methods.
The impact of this work extends to marine conservation, resource management, and research in marine biology.
The repository structure is organized as follows:
src/
: Contains the source code for data preprocessing, model training, and evaluation.data/
: Stores the dataset used in the project.notebooks/
: Jupyter notebooks for exploratory data analysis and model development.images/
: Image files related to the project.
Feel free to explore each directory for detailed information.
To run the project locally, follow these steps:
- Clone the repository:
git clone https://github.com/your-username/abalone-age-prediction.git
- Navigate to the project directory:
cd abalone-age-prediction
- Install dependencies:
pip install -r requirements.txt
- Run the main script:
python src/main.py
Make sure to replace "your-username" with your GitHub username.
If you'd like to contribute to this project, please follow the guidelines outlined in CONTRIBUTING.md.
This project is licensed under the MIT License.