Skip to content

This is a NLP project built using Python. The goal was to find patterns in the text of movie scripts and identify similarities and if we can successfully cluster the movies of a similar plot together and build a movie recommender trained on the movie script data.

Notifications You must be signed in to change notification settings

fatimaazfar/Movie-Script-Analysis

Repository files navigation

Movie Plots Analysis

Clusters Formed

Table of Contents

  1. Introduction
  2. Data Cleaning
  3. Classification
  4. Similar Movie Finder
  5. Clustering
  6. Keyword Extraction
  7. Analyzing the Scripts Classified Plot-wise
  8. Analyzing the Scripts Decade-wise
  9. Exploratory Data Analysis
  10. References

Introduction

Movies have long been a captivating medium, weaving stories that resonate with audiences across generations. This project delves into the essence of cinematic narratives, aiming to identify the fundamental plots that have shaped our understanding of storytelling. By analyzing the scripts of the top 3 IMDb hits from the 1970s to the 2020s, we embark on a journey to uncover the underlying patterns that define these cinematic masterpieces.

Data Cleaning

The data is sourced from a JSON file containing movie scripts. The following cleaning steps were performed:

  • Lowercasing all text
  • Removing symbols and line breaks
  • Eliminating punctuation
  • Removing stop words
  • Lemmatizing the data for better contextual analysis

Classification

A TF-IDF matrix of the movie scripts is created, and similarity distances are computed using cosine similarity. Hierarchical clustering is applied to identify basic plots, and the results are visualized using dendrograms.

Similar Movie Finder

Cosine similarity is used to find similar movies based on the TF-IDF vectors of their scripts. A function, find_similar, is provided to identify the most similar movie for a given title.

Clustering

Clustering is the building block of this project helping us see if there exists a natural sectioning of movies from last 50 years to fall into constant plots/themes purely based on their script proving our initial hypothesis true. For this purpose, scripts are clustered based on their similarity using the KMeans algorithm. The optimal number of clusters is determined using the Elbow Method.

Keyword Extraction

Keywords act as the building blocks of narratives, encapsulating the essence of each story. Through the extraction of keywords using TF-IDF vectors, we gain a nuanced understanding of the pivotal elements that define these cinematic experiences. The top 10 keywords for each movie serve as beacons, illuminating the thematic landscape.

Analyzing the Scripts Classified Plot-wise

Plot-wise classification opens a portal to the diverse worlds crafted by filmmakers. The subsequent generation of word clouds for each cluster transforms textual data into visually engaging insights. By visually representing common themes, these word clouds offer a unique perspective on the shared narratives within each cluster.

Analyzing the Scripts Decade-wise

Decades are chapters in the cinematic timeline, each marked by unique storytelling trends. By categorizing scripts based on their respective decades, we embark on a journey through the evolving landscape of movie genres. The resulting word clouds serve as time capsules, preserving the essence of storytelling trends across different eras.

Exploratory Data Analysis

Beyond the narrative nuances, exploratory data analysis brings quantitative insights to the forefront. The conversion of running time to minutes and the subsequent visualization of movie durations over the years offer a comprehensive view of temporal trends. This analytical journey bridges the gap between storytelling and data, providing a holistic understanding of cinematic evolution.

References

My work was underlined by studies done in the research paper "The emotional arcs of stories are dominated by six basic shapes" by Andrew J. Reagan, Lewis Mitchell, Dilan Kiley, Christopher M. Danforth, and Peter Sheridan Dodds, published on July 8, 2016. The paper investigates emotional arcs in stories, which provides additional context and inspiration for the clustering approach used in this project.


Note: This readme provides an overview and explanations of the code. For specific code details, refer to the source code file.

About

This is a NLP project built using Python. The goal was to find patterns in the text of movie scripts and identify similarities and if we can successfully cluster the movies of a similar plot together and build a movie recommender trained on the movie script data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published