Visualizing Word Embeddings

In a previous project, I compared word embeddings between models trained on a Wikipedia corpus and a Twitter corpus. The same words had different vectors associated with them, different words closest to them in vector space, and different semantic meanings. One problem I encountered during this project was the curse of dimensionality: reducing 200-dimensional vectors down to 2 or 3 meant losing important information and patterns. Many of the visualizations of these embeddings looked like random dots in space, with no visible pattern.

I had two goals in mind when I created this app. I wanted to let users interact with word embeddings and machine-learned vocabulary by being able to input their own words. I also wanted to see if certain words had specific patterns in vector space.

Design

In the Streamlit app, users can input a word or click on the “Random Word” button. An interactive 3D graph shows that word plus the nearest words based on cosine similarity. A slider determines how many nearest words are presented, from 5 to 50. Below the graph is a table with the nearest words ranked by cosine similarity to the input word.

Data

The pre-trained word embeddings used in this project were 50-dimensional GloVe embeddings from a 2014 Wikipedia dump. I used 50-dimensional embeddings in hopes of reducing computational time. The model was trained on 6 billion tokens, resulting in a vocabulary of 400,000 words and corresponding 50-dimensional vectors.

Algorithms

Caching was used in hopes of reducing the time it took for the application to load. Errors were handled with try/except clauses. PCA was used for dimensionality reduction.

Tools

Streamlit for app deployment
Plotly for 3D visualization
Gensim for data acquisition
Scikit-learn for PCA
Pandas and numpy for analysis

Communication

The Streamlit app is available for use. It takes a few minutes to load.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Visualizing Word Embeddings

Files

README.md

Latest commit

History

README.md

File metadata and controls

Visualizing Word Embeddings