This project analyzes movie data to predict box office success and perform sentiment analysis on movie metadata using IMDB and TMDB datasets. The entire analysis is performed within a Jupyter notebook environment.
merged_movie_dataset.csv
: Cleaned and merged dataset created bydata_clean_merge.py
movie_analysis.ipynb
: Main Jupyter notebook containing the full analysis pipelinedata_clean_merge.py
: Script used to clean and merge the original IMDB and TMDB datasets
-
Sentiment Analysis
- Uses VADER sentiment analyzer to compute sentiment scores from movie titles and genres
- Analyzes sentiment distribution across different genres
- Visualizes sentiment trends
-
Box Office Success Prediction
- Trains regression models to predict movie revenue
- Uses features such as budget, vote score, runtime, genre, and sentiment
- Evaluates model performance using R², RMSE, and MAE
- Displays feature importance using visualizations
Make sure the following Python libraries are installed in your environment:
pandas
numpy
scikit-learn
nltk
vaderSentiment
matplotlib
seaborn
-
Data Exploration
- Statistical summaries and quality checks
- Correlation heatmaps of key features
-
Sentiment Analysis
- Sentiment score computation using VADER
- Sentiment breakdown by genre
- Trend visualization
-
Success Prediction
- Model training using Random Forest Regressor
- Feature selection and engineering
- Model evaluation and result visualization
The analysis provides:
- Insight into sentiment trends across movie genres
- Predictive models for estimating box office success
- Feature importance metrics influencing success
- Clear, visual understanding of sentiment and financial outcomes
The analysis provides:
- Sentiment distribution across different movie genres
- Predictive models for box office success
- Feature importance in determining movie success
- Visualizations of key findings