High-performance dimensionality reduction for Python
Squeeze is a fast, CPU-optimized library for dimensionality reduction techniques including UMAP, t-SNE, PCA, and more. Built with a Rust backend and SIMD vectorization for maximum performance.
Dimensionality reduction "squeezes" high-dimensional data into lower dimensions while preserving structure. Squeeze provides:
- Multiple Algorithms: UMAP today, t-SNE, PCA, Isomap, and more coming soon
- Fast: 27x faster k-NN construction than PyNNDescent via HNSW with SIMD
- CPU-Optimized: No GPU required - runs anywhere
- Production Ready: Scikit-learn compatible API
I've used my own fork of Shinka Evolve (https://github.com/GeorgePearse/Genesis) in order to iteratively optimize the solutions
pip install squeezeOr with uv (recommended):
uv pip install squeezeimport squeeze
from sklearn.datasets import load_digits
digits = load_digits()
# UMAP embedding
umap_embedding = squeeze.UMAP(n_neighbors=15, min_dist=0.1).fit_transform(digits.data)
# t-SNE embedding
tsne_embedding = squeeze.TSNE(perplexity=30, n_iter=1000).fit_transform(digits.data)
# PCA embedding
pca_embedding = squeeze.PCA(n_components=2).fit_transform(digits.data)
# All algorithms share a consistent API: fit_transform(X)All algorithms are implemented in Rust for maximum performance. See the Algorithm Guide for detailed documentation on each algorithm.
| Algorithm | Status | Description | Docs |
|---|---|---|---|
| UMAP | ✅ Implemented | Uniform Manifold Approximation and Projection | Guide |
| t-SNE | ✅ Implemented | t-Distributed Stochastic Neighbor Embedding | Guide |
| PCA | ✅ Implemented | Principal Component Analysis (eigendecomposition) | Guide |
| Isomap | ✅ Implemented | Isometric Mapping (geodesic distances + MDS) | Guide |
| LLE | ✅ Implemented | Locally Linear Embedding | Guide |
| MDS | ✅ Implemented | Multidimensional Scaling (classical + metric SMACOF) | Guide |
| PHATE | ✅ Implemented | Potential of Heat-diffusion for Affinity-based Trajectory Embedding | Guide |
| TriMap | ✅ Implemented | Large-scale Dimensionality Reduction Using Triplets | Guide |
| PaCMAP | ✅ Implemented | Pairwise Controlled Manifold Approximation | Guide |
| Use Case | Recommended | Why |
|---|---|---|
| Quick exploration | PCA | Fast, interpretable |
| Cluster visualization | t-SNE, UMAP | Best local structure |
| Large datasets (>100k) | PaCMAP, TriMap | Fast, scalable |
| Biological trajectories | PHATE | Designed for this |
| Best speed/quality | PaCMAP | 0.13s, 0.978 trustworthiness |
All algorithms benchmarked on the sklearn Digits dataset (1,797 samples, 64 features):
Green = best performance, Red = worst performance (relative within each metric)
| Algorithm | Trust. k=15 | Spearman | Global | Silhouette | Classif. Acc | Time |
|---|---|---|---|---|---|---|
| t-SNE | 0.59 | 0.41 | 0.63 | 0.65 | 0.97 | 24.2s |
| UMAP | 0.51 | 0.36 | 0.62 | 0.78 | 0.98 | 9.2s |
| PaCMAP | 0.48 | 0.24 | 0.30 | 0.71 | 0.97 | 0.2s |
| MDS | 0.20 | 0.73 | 0.83 | 0.39 | 0.72 | 10.6s |
| PHATE | 0.16 | 0.55 | 0.79 | 0.40 | 0.61 | 7.4s |
| PCA | 0.15 | 0.58 | 0.82 | 0.39 | 0.61 | <0.01s |
| Isomap | 0.09 | 0.29 | 0.55 | 0.39 | 0.41 | 6.3s |
| LLE | 0.02 | -0.03 | 0.05 | 0.32 | 0.13 | 13.7s |
| TriMap | 0.01 | -0.06 | 0.05 | 0.33 | 0.09 | 0.6s |
Legend:
- Trust. k=15: Trustworthiness - local neighborhood preservation (higher = better)
- Spearman: Distance correlation - global structure preservation (higher = better)
- Global: Inter-cluster distance preservation (higher = better)
- Silhouette: Cluster separation quality (higher = better)
- Classif. Acc: 5-fold CV classification accuracy (higher = better)
| Rank | Algorithm | Best For |
|---|---|---|
| 1 | t-SNE | Local structure, cluster visualization |
| 2 | UMAP | Balanced local/global, fast |
| 3 | MDS | Global structure preservation |
| 4 | PaCMAP | Speed + quality tradeoff |
| 5 | PHATE | Biological trajectories |
Intelligent combinations of algorithms can outperform individual methods:
| Hybrid Technique | Trust. k=15 | Silhouette | Classif. Acc | Time | Key Benefit |
|---|---|---|---|---|---|
| PCA(50)→t-SNE | 0.59 | 0.66 | 0.97 | 32.3s | Best local structure |
| PCA(30)→UMAP | 0.52 | 0.80 | 0.98 | 6.5s | Best silhouette, 35% faster |
| Multi-scale UMAP | 0.51 | 0.77 | 0.98 | 16.4s | Captures multiple scales |
| MDS+UMAP Ensemble | 0.34 | 0.63 | 0.91 | 18.8s | Best global structure |
| Progressive PaCMAP→UMAP | 0.51 | 0.76 | 0.98 | 8.7s | Fast + refined |
Key Findings:
- PCA(50)→t-SNE achieves the best trustworthiness (0.59), beating pure t-SNE
- PCA(30)→UMAP is Pareto optimal: 35% faster than UMAP with better silhouette
- Hybrid pipelines inherit PCA's noise reduction + manifold method's structure preservation
from squeeze.composition import DRPipeline, EnsembleDR, ProgressiveDR
from sklearn.decomposition import PCA
import squeeze
# Best overall: PCA → t-SNE
pipeline = DRPipeline([
('pca', PCA(n_components=50)),
('tsne', squeeze.TSNE(n_components=2))
])
# Fastest high-quality: PCA → UMAP
pipeline = DRPipeline([
('pca', PCA(n_components=30)),
('umap', squeeze.UMAP(n_components=2))
])
# Multi-scale structure
ensemble = EnsembleDR([
('local', squeeze.UMAP(n_neighbors=5), 0.5),
('global', squeeze.UMAP(n_neighbors=30), 0.5)
], blend_mode='procrustes')Run the hybrid benchmark:
python benchmark_hybrid_techniques.pyRun the base benchmark:
just benchmark # Quick benchmark
python benchmark_metrics_heatmap.py # Full metrics heatmapSqueeze includes a Rust-based HNSW (Hierarchical Navigable Small World) backend with SIMD-accelerated distance computations:
k-NN Backend Comparison (sklearn digits dataset)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Backend Build Time Recall Speedup
─────────────────────────────────────────────────
PyNNDescent 6.512s 93.3% 1.00x
HNSW Simple 0.242s 95.1% 26.96x
HNSW Robust α=1.2 0.256s 97.6% 25.46x
import squeeze
import numpy as np
# Load your data
X = np.random.randn(1000, 50) # 1000 samples, 50 features
# PCA - fast linear projection
pca = squeeze.PCA(n_components=2)
X_pca = pca.fit_transform(X)
# t-SNE - preserves local structure
tsne = squeeze.TSNE(n_components=2, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X)
# MDS - preserves pairwise distances
mds = squeeze.MDS(n_components=2, metric=True, n_iter=300)
X_mds = mds.fit_transform(X)
# Isomap - geodesic distances on manifold
isomap = squeeze.Isomap(n_components=2, n_neighbors=10)
X_isomap = isomap.fit_transform(X)
# LLE - local linear relationships
lle = squeeze.LLE(n_components=2, n_neighbors=10)
X_lle = lle.fit_transform(X)
# PHATE - diffusion-based embedding
phate = squeeze.PHATE(n_components=2, k=15, t=10)
X_phate = phate.fit_transform(X)
# TriMap - triplet-based embedding
trimap = squeeze.TriMap(n_components=2, n_inliers=10, n_outliers=5)
X_trimap = trimap.fit_transform(X)
# PaCMAP - pair-based embedding
pacmap = squeeze.PaCMAP(n_components=2, n_neighbors=10)
X_pacmap = pacmap.fit_transform(X)import squeeze
# Use the fast HNSW backend (default)
reducer = squeeze.UMAP(
n_neighbors=15,
min_dist=0.1,
use_hnsw=True, # Default
hnsw_prune_strategy="robust", # Better graph quality
hnsw_alpha=1.2
)
embedding = reducer.fit_transform(data)Chain multiple reduction techniques:
from squeeze.composition import DRPipeline
from sklearn.decomposition import PCA
import squeeze
# 2048D → 100D → 2D
pipeline = DRPipeline([
('pca', PCA(n_components=100)),
('umap', squeeze.UMAP(n_components=2))
])
embedding = pipeline.fit_transform(high_dim_data)Blend multiple algorithms:
from squeeze.composition import EnsembleDR
from sklearn.decomposition import PCA
import squeeze
ensemble = EnsembleDR([
('pca', PCA(n_components=2), 0.3),
('umap', squeeze.UMAP(n_components=2), 0.7)
], alignment='procrustes')
blended = ensemble.fit_transform(data)Efficient handling of sparse matrices:
from squeeze.sparse_ops import SparseUMAP
import scipy.sparse as sp
sparse_data = sp.random(10000, 5000, density=0.05, format='csr')
embedding = SparseUMAP(n_components=2).fit_transform(sparse_data)Comprehensive metrics for evaluating DR quality:
from squeeze.evaluation import trustworthiness, continuity, DREvaluator, quick_evaluate
# Quick evaluation (3 core metrics)
metrics = quick_evaluate(X_original, X_embedded, k=15)
print(f"Trustworthiness: {metrics['trustworthiness']:.3f}")
print(f"Continuity: {metrics['continuity']:.3f}")
print(f"Spearman: {metrics['spearman_correlation']:.3f}")
# Comprehensive evaluation
evaluator = DREvaluator(X_original, X_embedded, labels=y, method_name='UMAP')
report = evaluator.evaluate_all()
print(report)
# Individual metrics
from squeeze.evaluation import (
spearman_distance_correlation,
global_structure_preservation,
local_density_preservation,
clustering_quality,
classification_accuracy,
)See Evaluation Metrics Guide for full documentation.
# Clone the repo
git clone https://github.com/georgepearse/squeeze
cd squeeze
# Install with uv
uv sync --extra dev
# Build Rust extension
uv run maturin develop --release
# Run tests
uv run pytest squeeze/tests/ -v
# Run benchmarks
uv run python benchmark_optimizations.pyOr use the justfile:
just install # Install deps + build
just test # Run tests
just benchmark # Run benchmarks
just lint # Check code style- Algorithm Agnostic: One library for all DR techniques
- Performance First: SIMD, Rust backend, optimized algorithms
- CPU-Focused: No GPU dependencies - runs everywhere
- Research Platform: Easy experimentation with techniques and parameters
- Production Ready: Reliable, tested, well-documented
If you use Squeeze in your research, please cite the original UMAP paper:
@article{mcinnes2018umap,
title={UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction},
author={McInnes, Leland and Healy, John and Melville, James},
journal={arXiv preprint arXiv:1802.03426},
year={2018}
}Apache License 2.0
Squeeze builds on the excellent work of:
- UMAP by Leland McInnes
- PyNNDescent for approximate nearest neighbors
- The scientific Python ecosystem (NumPy, SciPy, scikit-learn, Numba)
- https://github.com/JelmerBot/fast_plscan



