This project is a sentiment analysis task on movie reviews. It was implemented as part of a term-break assignment and aims to classify the sentiment of movie reviews using machine learning models.
The dataset comprises:
- train.csv: Training data with labeled sentiments.
- test.csv: Test data with unlabeled reviews.
- movies.csv: Metadata related to the movies (possibly for enrichment).
- sample.csv: A sample submission format for Kaggle.
Predict the sentiment of movie reviews as a classification problem. The task involves:
- Data preprocessing and feature engineering
- Exploratory Data Analysis (EDA)
- Model training and evaluation
- Generating predictions for test data
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
-
Import Libraries
- All necessary data science libraries are imported: numpy, pandas, sklearn, matplotlib, seaborn.
-
Load and Inspect Data
- Data is read from CSV files using Pandas.
- Checks for null values and performs basic statistical description using .describe() and .isnull().
-
Exploratory Data Analysis (EDA)
- Visualization of sentiment frequencies.
- Possibly investigates the distribution of sentiments and reviews.
-
Preprocessing
- Categorical encoding using LabelEncoder, OneHotEncoder, and OrdinalEncoder.
- Scaling with StandardScaler and MinMaxScaler.
-
Model Training
- Models used: SGDClassifier, RidgeClassifier, LogisticRegression.
- Uses cross_val_predict and RandomizedSearchCV for tuning and evaluation.
- Evaluated with metrics like precision, recall, confusion matrix, and classification report.
-
Prediction
- Predictions are made on the test dataset using the trained model.
- Output prepared in submission format.
- The notebook includes metrics such as precision, recall, and confusion matrix to evaluate model performance.
- Visual tools like ConfusionMatrixDisplay and precision_recall_curve are used for performance analysis.
- Clone the repository or download the notebook.
- Ensure you have the required datasets (train.csv, test.csv, etc.) in the correct folder structure.
- Install dependencies:
pip install numpy pandas scikit-learn matplotlib seaborn
- Run the notebook using Jupyter or any IDE that supports .ipynb.
.
├── train.csv
├── test.csv
├── movies.csv
├── sample.csv
├── 21f3000953-notebook-t22023.ipynb
└── README.md
Name: Shreya Garg
Assignment: Term Break 1 — Sentiment Prediction on Movie Reviews
Platform: Kaggle
- This notebook uses traditional ML models rather than deep learning or NLP techniques like LSTM or Transformers.
- Label encoding and standard ML preprocessing are effectively applied.
- Could be further improved by including NLP-based features like TF-IDF or word embeddings.