This project builds a Sentiment Analysis Model using the IMDb dataset. The main goal is to classify movie reviews as positive or negative. We use Natural Language Processing (NLP) techniques along with Machine Learning models to analyze sentiment.
Marimo is a Python-based interactive computing framework that allows users to create and share computational notebooks. Unlike traditional notebooks like Jupyter, Marimo emphasizes modularity, reactivity, and a seamless development experience.
- Download & Extract the IMDb dataset.
- Load the dataset into pandas DataFrames.
- Perform Exploratory Data Analysis (EDA) to understand the data.
- Build & Train a sentiment analysis model using a scikit-learn pipeline.
- Evaluate the model on the test data.
- Improve the model using hyperparameter tuning.
- Explain model predictions using LIME.
- Source: Stanford IMDb dataset
- Structure:
- 50,000 movie reviews (25,000 for training, 25,000 for testing)
- Balanced dataset (equal positive and negative reviews)
- Reviews stored as raw text files
To install the required dependencies, use the following command:
pip install -r requirements.txt
- HTML Cleaning: Remove unwanted HTML tags from the text using
BeautifulSoup
. - TF-IDF Vectorization: Transform text into numerical features for machine learning.
We implement a scikit-learn pipeline consisting of:
- HTMLCleaner (Custom Transformer) – Removes HTML tags.
- TF-IDF Vectorizer – Converts text into numerical features.
- Logistic Regression – Classifies reviews as positive or negative.
We use Optuna to optimize parameters:
max_df
,min_df
,ngram_range
(TF-IDF Vectorizer)C
(Regularization strength for Logistic Regression)
- Accuracy Score – Overall model performance.
- Classification Report – Precision, Recall, F1-score.
- Confusion Matrix – Breakdown of correct and incorrect classifications.
- We use LIME (Local Interpretable Model-agnostic Explanations) to understand the model’s predictions.
- Generates word importance scores for individual reviews.
- Baseline Model Accuracy: ~88%
- After Hyperparameter Tuning: ~90%
- LIME Interpretation: Model correctly identifies sentiment-heavy words.
- Machine Learning-based sentiment analysis is effective on movie reviews.
- Hyperparameter tuning improves accuracy but marginally.
- LIME helps explain model decisions, making the classifier more interpretable.
- Future improvements: Try deep learning (e.g., LSTMs, Transformers) for more advanced NLP modeling.
To execute the project:
- Install dependencies:
pip install -r requirements.txt
- Run the Marimo notebook:
marimo run IMDB Sentiment Analysis.py