🎶 Song Genre Classification with PySpark

This project implements a song genre classification system using PySpark MLlib. The objective is to predict the genre of a song based on its lyrics. The pipeline includes data cleaning, preprocessing using TF-IDF, and model training with Logistic Regression. Trained models are saved for reuse, and a Streamlit web app is provided for user interaction. The app allows users to input lyrics, predict the corresponding genre, and visualize the genre compatibility across all 8 genres using a bar chart.

📁 Project Structure

├── clean_data.py                      # Cleans raw dataset into a uniform format
├── merge_dataset.py                   # Merges datasets into a single file (Merged_dataset.csv)
├── classify_logistic.py               # PySpark ML pipeline with Logistic Regression
├── app.py                             # # Streamlit app for genre prediction and visualization UI
├── run.bat                            # Batch file to run the full pipeline
├── genre_classifier_model_logistic/   # Saved PySpark model
├── label_indexer_model_logistic/      # Saved label indexer model
├── vectorizer_model_logistic/         # Saved vectorizer model
├── idf_model_logistic/                # Saved IDF model
├── Merged_dataset.csv                 # Final dataset used for training
├── student_dataset.csv                # Dataset of student-specific lyrics (pre-cleaning)
├── ska_dataset_raw.csv                # Original Ska dataset before cleaning
├── tcc_ceds_music.csv                 # Dataset with 7 genres before merging
├── requirements.txt                   # Project dependencies
└── README.md                          # Project documentation

⚙️ Technologies Used

Python 3.x
PySpark (MLlib)
NLTK for preprocessing
Matplotlib for visualization
Batch scripting (.bat)
Python environment
Streamlit

⚙️ How It Works

Data Cleaning: clean_data.py script standardizes raw datasets, handling missing values and unifying formats across different datasets.
Dataset Merging: The merge_dataset.py script merges multiple datasets into one unified dataset (Merged_dataset.csv) for model training.
Model Training: The classify_logistic.py script:
- Preprocesses the lyrics (e.g., tokenization, stopword removal, lemmatization)
- Converts lyrics into TF-IDF vectors
- Trains a Logistic Regression model
- Saves the trained models for future use (saved in respective directories)
treamlit App for Genre Prediction: The app.py script:
- Uses Streamlit to create an interactive web app where users can input lyrics for song genre prediction.
- The app loads the saved model (genre_classifier_model_logistic), applies the necessary transformations to the input lyrics (using the saved vectorizer and IDF model), and predicts the genre.
- It displays the predicted genre along with a bar chart visualizing the model's compatibility score with all 8 genres, offering a clear and interactive user experience.

🚀 Getting Started

1. Clone the repository

Clone this project to your local machine using the following command:

git clone https://github.com/Shabthana123/genre-classifier-pyspark.git
cd Song-Genre-Classification-with-PySpark

🚀 Quick Start

To run the project:

.\run.bat

🧰 Manual Setup (Alternative to run.bat)

Step 1: Create and activate virtual environment

python -m venv venv venv\Scripts\activate # On Windows
source venv/bin/activate # On macOS/Linux

Step 2: Install dependencies

pip install -r requirements.txt

Step 3: Run the Streamlit app

streamlit run app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎶 Song Genre Classification with PySpark

📁 Project Structure

⚙️ Technologies Used

⚙️ How It Works

🚀 Getting Started

1. Clone the repository

🚀 Quick Start

🧰 Manual Setup (Alternative to run.bat)

Step 1: Create and activate virtual environment

Step 2: Install dependencies

Step 3: Run the Streamlit app

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
genre_classifier_model_logistic		genre_classifier_model_logistic
idf_model_logistic		idf_model_logistic
label_indexer_model_logistic		label_indexer_model_logistic
vectorizer_model_logistic		vectorizer_model_logistic
Merged_dataset.csv		Merged_dataset.csv
README.md		README.md
Student_dataset.csv		Student_dataset.csv
app.py		app.py
classify_logistic.py		classify_logistic.py
clean_data.py		clean_data.py
merge_dataset.py		merge_dataset.py
requirements.txt		requirements.txt
run.bat		run.bat
ska.csv		ska.csv
tcc_ceds_music.csv		tcc_ceds_music.csv

Shabthana123/Song-Genre-Classification-with-PySpark

Folders and files

Latest commit

History

Repository files navigation

🎶 Song Genre Classification with PySpark

📁 Project Structure

⚙️ Technologies Used

⚙️ How It Works

🚀 Getting Started

1. Clone the repository

🚀 Quick Start

🧰 Manual Setup (Alternative to run.bat)

Step 1: Create and activate virtual environment

Step 2: Install dependencies

Step 3: Run the Streamlit app

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages