Visualize 3 vector similarity measures that would commonly be used in vector search.

data/out/movie-script-dataset.parquet contains data from 3 sources.
- Kaggle Movie Scripts Dataset
- Manually downloading movie scripts from Google search
- Kaggle IMDB Movie Dataset (for year and genre)
See the ipynb files for regex data cleaning and polars data joining.
After cleaning the scripts using regex, we feed them into bge-large-en-v1.5 in chunks, then use mean pool embedding to collapse the (n_chunks, n_tokens, hidden_size) into a single vector of length hidden_size = 1024.
UMAP is a dimension reduction technique that learns a low dimensional projection which preserves equivalent fuzzy topological structure. This is a non-linear alternative to PCA, and is most comparable to t-SNE.
We use it to project the 1024 dimensional embeddings down to 2 dimensions so we can plot them and confirm that similar movies are embedded into similar vectors.

Visualizations are made with plotly express and hosted on a flask app.
The visualization titled "Nearest Neighbors to Fight Club" depicts the nearest neighbors (KNN) to a given movie in terms of the embedded vector distance for the movie script. The y-axis & size are the dot product, which was not as strongly correlated to distance (KNN) as I had expected. The color is the cosine similarity, which is extremely correlated to distance, likely because the embeddings have values from the standard normal distribution.
Notes
Feel free to reach out at linkedin.com/in/anders-ward/


