This project focuses on sentiment analysis of movie reviews using various machine learning models. It includes data preprocessing, model training, hyperparameter tuning, and a web interface for users to input their reviews and receive sentiment predictions.
The goal of this project is to classify movie reviews as positive or negative. Various models like Naive Bayes, Support Vector Machine (SVM), Random Forest, Logistic Regression, and Gradient Boosting are trained and evaluated. The best models are saved, and an ensemble model is created for better performance.
The dataset used in this project is a subset of the IMDB movie reviews dataset, which includes both positive and negative reviews. The data is cleaned, tokenized, lemmatized, and vectorized using methods like Bag of Words, TF-IDF, and Word2Vec.
dataset: IMDB Dataset of Movie Review
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/your-username/sentiment-analysis.git cd sentiment-analysis
-
Create and activate a virtual environment (optional but recommended):
python3 -m venv env source env/bin/activate # On Windows, use `env\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Download NLTK data:
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') nltk.download('averaged_perceptron_tagger')
The project is organized as follows:
sentiment_analysis/
│
├── saved_models/ # Directory containing the trained machine learning models
├── IMDB.csv # Dataset file with movie reviews for sentiment analysis
├── sentiment_analysis.ipynb # Jupyter Notebook with the complete sentiment analysis code
├── README.md # Project overview and instructions
├── app.py # Flask application script for web interface
├── templates/ # Directory containing HTML templates for the web app
└── static/ # Directory containing static files like CSS for the web app
The following models were trained and evaluated:
- Naive Bayes
- Support Vector Machine (SVM)
- Logistic Regression
- Random Forest
- Gradient Boosting
Additionally, a Voting Classifier (Ensemble Model) was created by combining the predictions of the above models.
The dataset is preprocessed using techniques like:
- Text cleaning: Removing HTML tags, punctuation, and stopwords.
- Tokenization: Splitting text into individual words.
- Lemmatization: Reducing words to their base forms.
- Vectorization: Converting text into numerical features using Bag of Words, TF-IDF, and Word2Vec.
The performance of each model is assessed on the test set using metrics such as accuracy, precision, recall, and F1-score.
A Flask web app is available for users to input their reviews and view the sentiment predictions from each model.
To run the web application:
-
Navigate to the project directory.
-
Start the Flask app by running the following command:
python app.py
-
Open a browser and go to http://127.0.0.1:5000/.
-
Enter a review in the text box and submit it to see the sentiment predictions.
The project includes the following visualizations:
- Word Cloud: Displays common words in the dataset.
- Bar Plot: Shows the most frequent words.
- Confusion Matrices: Visualizes model performance.
The ensemble model achieved the highest accuracy and is used as the default model in the web application.