This project implements a song genre classification system using PySpark MLlib. The objective is to predict the genre of a song based on its lyrics. The pipeline includes data cleaning, preprocessing using TF-IDF, and model training with Logistic Regression. Trained models are saved for reuse, and a Streamlit web app is provided for user interaction. The app allows users to input lyrics, predict the corresponding genre, and visualize the genre compatibility across all 8 genres using a bar chart.
├── clean_data.py # Cleans raw dataset into a uniform format
├── merge_dataset.py # Merges datasets into a single file (Merged_dataset.csv)
├── classify_logistic.py # PySpark ML pipeline with Logistic Regression
├── app.py # # Streamlit app for genre prediction and visualization UI
├── run.bat # Batch file to run the full pipeline
├── genre_classifier_model_logistic/ # Saved PySpark model
├── label_indexer_model_logistic/ # Saved label indexer model
├── vectorizer_model_logistic/ # Saved vectorizer model
├── idf_model_logistic/ # Saved IDF model
├── Merged_dataset.csv # Final dataset used for training
├── student_dataset.csv # Dataset of student-specific lyrics (pre-cleaning)
├── ska_dataset_raw.csv # Original Ska dataset before cleaning
├── tcc_ceds_music.csv # Dataset with 7 genres before merging
├── requirements.txt # Project dependencies
└── README.md # Project documentation- Python 3.x
- PySpark (MLlib)
- NLTK for preprocessing
- Matplotlib for visualization
- Batch scripting (.bat)
- Python environment
- Streamlit
- Data Cleaning:
clean_data.pyscript standardizes raw datasets, handling missing values and unifying formats across different datasets. - Dataset Merging: The
merge_dataset.pyscript merges multiple datasets into one unified dataset (Merged_dataset.csv) for model training. - Model Training: The
classify_logistic.pyscript:- Preprocesses the lyrics (e.g., tokenization, stopword removal, lemmatization)
- Converts lyrics into TF-IDF vectors
- Trains a Logistic Regression model
- Saves the trained models for future use (saved in respective directories)
- treamlit App for Genre Prediction: The
app.pyscript:- Uses Streamlit to create an interactive web app where users can input lyrics for song genre prediction.
- The app loads the saved model (genre_classifier_model_logistic), applies the necessary transformations to the input lyrics (using the saved vectorizer and IDF model), and predicts the genre.
- It displays the predicted genre along with a bar chart visualizing the model's compatibility score with all 8 genres, offering a clear and interactive user experience.
Clone this project to your local machine using the following command:
git clone https://github.com/Shabthana123/genre-classifier-pyspark.git
cd Song-Genre-Classification-with-PySpark
To run the project:
.\run.batpython -m venv venv
venv\Scripts\activate # On Windows
source venv/bin/activate # On macOS/Linux
pip install -r requirements.txt
streamlit run app.py