This project trains a machine learning model to classify YouTube comments into three categories: Positive, Negative, or Neutral. It includes scripts for training a sentiment analysis model from scratch and for predicting the sentiment of new, unseen comments.
A pre-trained model is included in the /models directory, so you can immediately run predictions without training it yourself.
- Data Processing: Cleans and preprocesses text data using techniques like stopword removal, lemmatization, and negation handling.
- Model Training: Trains and tunes multiple classifiers (Logistic Regression, Naive Bayes) and an ensemble Voting Classifier.
- Model Evaluation: Generates classification reports and confusion matrices to evaluate model performance.
- Inference: Provides a simple command-line interface to predict the sentiment of your own sentences using the best-trained model.
- Python 3.8+
- Git
First, clone the repository to your local machine:
git clone https://github.com/DavidTalevski/youtube-comment-sentiment-classifier
cd youtube-comment-classifierNext, install the required Python packages using requirements.txt:
pip install -r requirements.txtThe model included in this repository is already trained. However, if you wish to train the model yourself, you must download the dataset.
- Download the dataset from Kaggle: YouTube Comments Sentiment Dataset.
- Place the downloaded file into the
/datadirectory. - Ensure the file is named exactly
youtube_comments.csv.
There are two main ways to use this project.
To test the included model with your own sentences, run the interactive prediction script:
python src/predict_comment.pyThe script will load the saved model from the /models folder and prompt you to enter a comment. Type your sentence and press Enter to see the predicted sentiment. To exit the script, type quit or exit.
If you have downloaded the dataset as described above, you can train the model from scratch. This will overwrite the existing model files in the /models directory.
python src/youtube_comment_sentiment.pyThe script will preprocess the data, train the classifiers, evaluate their performance, and save the best-performing model for future use.