Understanding UMAP

Dimensionality reduction is a powerful tool for machine learning practitioners to visualize and understand large, high dimensional datasets. One of the most widely used techniques for visualization is t-SNE, but its performance suffers with large datasets and using it correctly can be challenging.

UMAP is a new technique by McInnes et al. that offers a number of advantages over t-SNE, most notably increased speed and better preservation of the data's global structure. In this article, we'll take a look at the theory behind UMAP in order to better understand how the algorithm works, how to use it effectively, and how its performance compares with t-SNE.

yarn
yarn dev

Publishing to github pages

yarn pub

To develop figures individually

yarn dev:cech
yarn dev:hyperparameters
yarn dev:mammoth-umap
yarn dev:mammoth-tsne
yarn dev:supplement
yarn dev:toy
yarn dev:toy_comparison

Data preprocessing

For the mammoth figures, the raw 3D data was downsampled to 50,000 points before being projected with UMAP / t-SNE. These 50,000 points were then randomly subsampled to 10,000 points in order to minimize the payload size.

Understanding UMAP uses a few tricks to make the data payloads for some of the interactive figures small enough to download in a reasonable time. The mammoth figures use a 10-bit encoding scheme to compress the 10,000 data points into a significantly smaller payload. The hyperparameters and toy_comparison figures precompute UMAP embeddings for all of their different combinations, then use the same 10-bit encoding scheme to compress the data.

yarn preprocess:hyperparameters
yarn preprocess:mammoth
yarn preprocess:toy_comparison

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Understanding UMAP

Publishing to github pages

To develop figures individually

Data preprocessing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Understanding UMAP

Publishing to github pages

To develop figures individually

Data preprocessing