Vector Search for Visualizing Movie Script Relationships

Interactive Flask App Visualizations of 3 Combined Movie Datasets

Project Diagram

&

Vector Search Plot

Visualize 3 vector similarity measures that would commonly be used in vector search.

The Dataset

data/out/movie-script-dataset.parquet contains data from 3 sources.

Kaggle Movie Scripts Dataset
Manually downloading movie scripts from Google search
Kaggle IMDB Movie Dataset (for year and genre)

See the ipynb files for regex data cleaning and polars data joining.

Embedding

After cleaning the scripts using regex, we feed them into bge-large-en-v1.5 in chunks, then use mean pool embedding to collapse the (n_chunks, n_tokens, hidden_size) into a single vector of length hidden_size = 1024.

UMAP

UMAP is a dimension reduction technique that learns a low dimensional projection which preserves equivalent fuzzy topological structure. This is a non-linear alternative to PCA, and is most comparable to t-SNE.

We use it to project the 1024 dimensional embeddings down to 2 dimensions so we can plot them and confirm that similar movies are embedded into similar vectors.

Visualizations

Visualizations are made with plotly express and hosted on a flask app.

The visualization titled "Nearest Neighbors to Fight Club" depicts the nearest neighbors (KNN) to a given movie in terms of the embedded vector distance for the movie script. The y-axis & size are the dot product, which was not as strongly correlated to distance (KNN) as I had expected. The color is the cosine similarity, which is extremely correlated to distance, likely because the embeddings have values from the standard normal distribution.

Notes

Feel free to reach out at linkedin.com/in/anders-ward/

tinyurl.com/movie-vector-search

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data/out		data/out
static		static
.gitignore		.gitignore
1-eda-and-cleaning.ipynb		1-eda-and-cleaning.ipynb
2-add-scripts-to-dataset.ipynb		2-add-scripts-to-dataset.ipynb
README.md		README.md
app.py		app.py
embed_scripts.py		embed_scripts.py
make_umap_plots.py		make_umap_plots.py
requirements.txt		requirements.txt
vector_search.py		vector_search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Search for Visualizing Movie Script Relationships

Interactive Flask App Visualizations of 3 Combined Movie Datasets

Project Diagram

&

Vector Search Plot

The Dataset

Embedding

UMAP

Visualizations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

award96/Movie-Script-Vector-Search

Folders and files

Latest commit

History

Repository files navigation

Vector Search for Visualizing Movie Script Relationships

Interactive Flask App Visualizations of 3 Combined Movie Datasets

Project Diagram

&

Vector Search Plot

The Dataset

Embedding

UMAP

Visualizations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages