IMDb Sentiment Analysis

Project Overview

This project builds a Sentiment Analysis Model using the IMDb dataset. The main goal is to classify movie reviews as positive or negative. We use Natural Language Processing (NLP) techniques along with Machine Learning models to analyze sentiment.

What is Marimo?

Marimo is a Python-based interactive computing framework that allows users to create and share computational notebooks. Unlike traditional notebooks like Jupyter, Marimo emphasizes modularity, reactivity, and a seamless development experience.

Project Workflow

Download & Extract the IMDb dataset.
Load the dataset into pandas DataFrames.
Perform Exploratory Data Analysis (EDA) to understand the data.
Build & Train a sentiment analysis model using a scikit-learn pipeline.
Evaluate the model on the test data.
Improve the model using hyperparameter tuning.
Explain model predictions using LIME.

Dataset

Source: Stanford IMDb dataset
Structure:
- 50,000 movie reviews (25,000 for training, 25,000 for testing)
- Balanced dataset (equal positive and negative reviews)
- Reviews stored as raw text files

Dependencies

To install the required dependencies, use the following command:

pip install -r requirements.txt

Implementation Details

Data Preprocessing

HTML Cleaning: Remove unwanted HTML tags from the text using BeautifulSoup.
TF-IDF Vectorization: Transform text into numerical features for machine learning.

Model Pipeline

We implement a scikit-learn pipeline consisting of:

HTMLCleaner (Custom Transformer) – Removes HTML tags.
TF-IDF Vectorizer – Converts text into numerical features.
Logistic Regression – Classifies reviews as positive or negative.

Hyperparameter Tuning

We use Optuna to optimize parameters:

max_df, min_df, ngram_range (TF-IDF Vectorizer)
C (Regularization strength for Logistic Regression)

Model Evaluation

Accuracy Score – Overall model performance.
Classification Report – Precision, Recall, F1-score.
Confusion Matrix – Breakdown of correct and incorrect classifications.

Explainability with LIME

We use LIME (Local Interpretable Model-agnostic Explanations) to understand the model’s predictions.
Generates word importance scores for individual reviews.

Results

Baseline Model Accuracy: ~88%
After Hyperparameter Tuning: ~90%
LIME Interpretation: Model correctly identifies sentiment-heavy words.

Conclusion

Machine Learning-based sentiment analysis is effective on movie reviews.
Hyperparameter tuning improves accuracy but marginally.
LIME helps explain model decisions, making the classifier more interpretable.
Future improvements: Try deep learning (e.g., LSTMs, Transformers) for more advanced NLP modeling.

Running the Project

To execute the project:

Install dependencies:
```
pip install -r requirements.txt
```
Run the Marimo notebook:
```
marimo run IMDB Sentiment Analysis.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
IMDB Sentiment Analysis.py		IMDB Sentiment Analysis.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDb Sentiment Analysis

Project Overview

What is Marimo?

Project Workflow

Dataset

Dependencies

Implementation Details

Data Preprocessing

Model Pipeline

Hyperparameter Tuning

Model Evaluation

Explainability with LIME

Results

Conclusion

Running the Project

About

Releases

Packages

Languages

edoardodraetta/sentiment_analysis_marimo

Folders and files

Latest commit

History

Repository files navigation

IMDb Sentiment Analysis

Project Overview

What is Marimo?

Project Workflow

Dataset

Dependencies

Implementation Details

Data Preprocessing

Model Pipeline

Hyperparameter Tuning

Model Evaluation

Explainability with LIME

Results

Conclusion

Running the Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages