Squeeze

High-performance dimensionality reduction for Python

Squeeze is a fast, CPU-optimized library for dimensionality reduction techniques including UMAP, t-SNE, PCA, and more. Built with a Rust backend and SIMD vectorization for maximum performance.

Why Squeeze?

Dimensionality reduction "squeezes" high-dimensional data into lower dimensions while preserving structure. Squeeze provides:

Multiple Algorithms: UMAP today, t-SNE, PCA, Isomap, and more coming soon
Fast: 27x faster k-NN construction than PyNNDescent via HNSW with SIMD
CPU-Optimized: No GPU required - runs anywhere
Production Ready: Scikit-learn compatible API

I've used my own fork of Shinka Evolve (https://github.com/GeorgePearse/Genesis) in order to iteratively optimize the solutions

Installation

pip install squeeze

Or with uv (recommended):

uv pip install squeeze

Quick Start

import squeeze
from sklearn.datasets import load_digits

digits = load_digits()

# UMAP embedding
umap_embedding = squeeze.UMAP(n_neighbors=15, min_dist=0.1).fit_transform(digits.data)

# t-SNE embedding
tsne_embedding = squeeze.TSNE(perplexity=30, n_iter=1000).fit_transform(digits.data)

# PCA embedding
pca_embedding = squeeze.PCA(n_components=2).fit_transform(digits.data)

# All algorithms share a consistent API: fit_transform(X)

Supported Algorithms

All algorithms are implemented in Rust for maximum performance. See the Algorithm Guide for detailed documentation on each algorithm.

Algorithm	Status	Description	Docs
UMAP	✅ Implemented	Uniform Manifold Approximation and Projection	Guide
t-SNE	✅ Implemented	t-Distributed Stochastic Neighbor Embedding	Guide
PCA	✅ Implemented	Principal Component Analysis (eigendecomposition)	Guide
Isomap	✅ Implemented	Isometric Mapping (geodesic distances + MDS)	Guide
LLE	✅ Implemented	Locally Linear Embedding	Guide
MDS	✅ Implemented	Multidimensional Scaling (classical + metric SMACOF)	Guide
PHATE	✅ Implemented	Potential of Heat-diffusion for Affinity-based Trajectory Embedding	Guide
TriMap	✅ Implemented	Large-scale Dimensionality Reduction Using Triplets	Guide
PaCMAP	✅ Implemented	Pairwise Controlled Manifold Approximation	Guide

Choosing an Algorithm

Use Case	Recommended	Why
Quick exploration	PCA	Fast, interpretable
Cluster visualization	t-SNE, UMAP	Best local structure
Large datasets (>100k)	PaCMAP, TriMap	Fast, scalable
Biological trajectories	PHATE	Designed for this
Best speed/quality	PaCMAP	0.13s, 0.978 trustworthiness

Benchmark Results

All algorithms benchmarked on the sklearn Digits dataset (1,797 samples, 64 features):

Algorithm vs Metrics Heatmap

Green = best performance, Red = worst performance (relative within each metric)

Metrics Comparison Table

Algorithm	Trust. k=15	Spearman	Global	Silhouette	Classif. Acc	Time
t-SNE	0.59	0.41	0.63	0.65	0.97	24.2s
UMAP	0.51	0.36	0.62	0.78	0.98	9.2s
PaCMAP	0.48	0.24	0.30	0.71	0.97	0.2s
MDS	0.20	0.73	0.83	0.39	0.72	10.6s
PHATE	0.16	0.55	0.79	0.40	0.61	7.4s
PCA	0.15	0.58	0.82	0.39	0.61	<0.01s
Isomap	0.09	0.29	0.55	0.39	0.41	6.3s
LLE	0.02	-0.03	0.05	0.32	0.13	13.7s
TriMap	0.01	-0.06	0.05	0.33	0.09	0.6s

Legend:

Trust. k=15: Trustworthiness - local neighborhood preservation (higher = better)
Spearman: Distance correlation - global structure preservation (higher = better)
Global: Inter-cluster distance preservation (higher = better)
Silhouette: Cluster separation quality (higher = better)
Classif. Acc: 5-fold CV classification accuracy (higher = better)

Overall Rankings

Rank	Algorithm	Best For
1	t-SNE	Local structure, cluster visualization
2	UMAP	Balanced local/global, fast
3	MDS	Global structure preservation
4	PaCMAP	Speed + quality tradeoff
5	PHATE	Biological trajectories

Hybrid Techniques

Intelligent combinations of algorithms can outperform individual methods:

Hybrid Technique	Trust. k=15	Silhouette	Classif. Acc	Time	Key Benefit
PCA(50)→t-SNE	0.59	0.66	0.97	32.3s	Best local structure
PCA(30)→UMAP	0.52	0.80	0.98	6.5s	Best silhouette, 35% faster
Multi-scale UMAP	0.51	0.77	0.98	16.4s	Captures multiple scales
MDS+UMAP Ensemble	0.34	0.63	0.91	18.8s	Best global structure
Progressive PaCMAP→UMAP	0.51	0.76	0.98	8.7s	Fast + refined

Key Findings:

PCA(50)→t-SNE achieves the best trustworthiness (0.59), beating pure t-SNE
PCA(30)→UMAP is Pareto optimal: 35% faster than UMAP with better silhouette
Hybrid pipelines inherit PCA's noise reduction + manifold method's structure preservation

from squeeze.composition import DRPipeline, EnsembleDR, ProgressiveDR
from sklearn.decomposition import PCA
import squeeze

# Best overall: PCA → t-SNE
pipeline = DRPipeline([
    ('pca', PCA(n_components=50)),
    ('tsne', squeeze.TSNE(n_components=2))
])

# Fastest high-quality: PCA → UMAP  
pipeline = DRPipeline([
    ('pca', PCA(n_components=30)),
    ('umap', squeeze.UMAP(n_components=2))
])

# Multi-scale structure
ensemble = EnsembleDR([
    ('local', squeeze.UMAP(n_neighbors=5), 0.5),
    ('global', squeeze.UMAP(n_neighbors=30), 0.5)
], blend_mode='procrustes')

Run the hybrid benchmark:

python benchmark_hybrid_techniques.py

Run the base benchmark:

just benchmark              # Quick benchmark
python benchmark_metrics_heatmap.py  # Full metrics heatmap

k-NN Backend Performance

Squeeze includes a Rust-based HNSW (Hierarchical Navigable Small World) backend with SIMD-accelerated distance computations:

k-NN Backend Comparison (sklearn digits dataset)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Backend              Build Time   Recall    Speedup
─────────────────────────────────────────────────
PyNNDescent          6.512s       93.3%     1.00x
HNSW Simple          0.242s       95.1%     26.96x
HNSW Robust α=1.2    0.256s       97.6%     25.46x

Features

All Algorithms

import squeeze
import numpy as np

# Load your data
X = np.random.randn(1000, 50)  # 1000 samples, 50 features

# PCA - fast linear projection
pca = squeeze.PCA(n_components=2)
X_pca = pca.fit_transform(X)

# t-SNE - preserves local structure
tsne = squeeze.TSNE(n_components=2, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X)

# MDS - preserves pairwise distances
mds = squeeze.MDS(n_components=2, metric=True, n_iter=300)
X_mds = mds.fit_transform(X)

# Isomap - geodesic distances on manifold
isomap = squeeze.Isomap(n_components=2, n_neighbors=10)
X_isomap = isomap.fit_transform(X)

# LLE - local linear relationships
lle = squeeze.LLE(n_components=2, n_neighbors=10)
X_lle = lle.fit_transform(X)

# PHATE - diffusion-based embedding
phate = squeeze.PHATE(n_components=2, k=15, t=10)
X_phate = phate.fit_transform(X)

# TriMap - triplet-based embedding
trimap = squeeze.TriMap(n_components=2, n_inliers=10, n_outliers=5)
X_trimap = trimap.fit_transform(X)

# PaCMAP - pair-based embedding
pacmap = squeeze.PaCMAP(n_components=2, n_neighbors=10)
X_pacmap = pacmap.fit_transform(X)

UMAP with HNSW Backend

import squeeze

# Use the fast HNSW backend (default)
reducer = squeeze.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    use_hnsw=True,  # Default
    hnsw_prune_strategy="robust",  # Better graph quality
    hnsw_alpha=1.2
)
embedding = reducer.fit_transform(data)

Composition Pipeline

Chain multiple reduction techniques:

from squeeze.composition import DRPipeline
from sklearn.decomposition import PCA
import squeeze

# 2048D → 100D → 2D
pipeline = DRPipeline([
    ('pca', PCA(n_components=100)),
    ('umap', squeeze.UMAP(n_components=2))
])
embedding = pipeline.fit_transform(high_dim_data)

Ensemble Methods

Blend multiple algorithms:

from squeeze.composition import EnsembleDR
from sklearn.decomposition import PCA
import squeeze

ensemble = EnsembleDR([
    ('pca', PCA(n_components=2), 0.3),
    ('umap', squeeze.UMAP(n_components=2), 0.7)
], alignment='procrustes')

blended = ensemble.fit_transform(data)

Sparse Data Support

Efficient handling of sparse matrices:

from squeeze.sparse_ops import SparseUMAP
import scipy.sparse as sp

sparse_data = sp.random(10000, 5000, density=0.05, format='csr')
embedding = SparseUMAP(n_components=2).fit_transform(sparse_data)

Evaluation Metrics

Comprehensive metrics for evaluating DR quality:

from squeeze.evaluation import trustworthiness, continuity, DREvaluator, quick_evaluate

# Quick evaluation (3 core metrics)
metrics = quick_evaluate(X_original, X_embedded, k=15)
print(f"Trustworthiness: {metrics['trustworthiness']:.3f}")
print(f"Continuity: {metrics['continuity']:.3f}")
print(f"Spearman: {metrics['spearman_correlation']:.3f}")

# Comprehensive evaluation
evaluator = DREvaluator(X_original, X_embedded, labels=y, method_name='UMAP')
report = evaluator.evaluate_all()
print(report)

# Individual metrics
from squeeze.evaluation import (
    spearman_distance_correlation,
    global_structure_preservation,
    local_density_preservation,
    clustering_quality,
    classification_accuracy,
)

See Evaluation Metrics Guide for full documentation.

Development

# Clone the repo
git clone https://github.com/georgepearse/squeeze
cd squeeze

# Install with uv
uv sync --extra dev

# Build Rust extension
uv run maturin develop --release

# Run tests
uv run pytest squeeze/tests/ -v

# Run benchmarks
uv run python benchmark_optimizations.py

Or use the justfile:

just install    # Install deps + build
just test       # Run tests
just benchmark  # Run benchmarks
just lint       # Check code style

Project Philosophy

Algorithm Agnostic: One library for all DR techniques
Performance First: SIMD, Rust backend, optimized algorithms
CPU-Focused: No GPU dependencies - runs everywhere
Research Platform: Easy experimentation with techniques and parameters
Production Ready: Reliable, tested, well-documented

Citation

If you use Squeeze in your research, please cite the original UMAP paper:

@article{mcinnes2018umap,
  title={UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction},
  author={McInnes, Leland and Healy, John and Melville, James},
  journal={arXiv preprint arXiv:1802.03426},
  year={2018}
}

License

Apache License 2.0

Acknowledgments

Squeeze builds on the excellent work of:

UMAP by Leland McInnes
PyNNDescent for approximate nearest neighbors
The scientific Python ecosystem (NumPy, SciPy, scikit-learn, Numba)
https://github.com/JelmerBot/fast_plscan

Name		Name	Last commit message	Last commit date
Latest commit History 2,004 Commits
.code/agents		.code/agents
.github/workflows		.github/workflows
.idea		.idea
benches		benches
ci_scripts		ci_scripts
demos		demos
doc		doc
docs		docs
examples		examples
images		images
notebooks		notebooks
scripts		scripts
site		site
squeeze		squeeze
src		src
working_docs		working_docs
.gitattributes		.gitattributes
.gitignore		.gitignore
.pep8speaks.yml		.pep8speaks.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
benchmark_hybrid_techniques.py		benchmark_hybrid_techniques.py
benchmark_metrics_heatmap.py		benchmark_metrics_heatmap.py
benchmark_optimizations.py		benchmark_optimizations.py
benchmark_results.png		benchmark_results.png
convert_rst_to_md.py		convert_rst_to_md.py
docs_requirements.txt		docs_requirements.txt
embeddings_comparison.png		embeddings_comparison.png
fix_all_ruff.py		fix_all_ruff.py
fix_ruff_errors.py		fix_ruff_errors.py
hybrid_comparison.png		hybrid_comparison.png
hybrid_metrics_heatmap.png		hybrid_metrics_heatmap.png
hybrid_results.csv		hybrid_results.csv
justfile		justfile
metrics_heatmap.png		metrics_heatmap.png
metrics_results.csv		metrics_results.csv
mkdocs.yml		mkdocs.yml
paper.bib		paper.bib
paper.md		paper.md
pre-commit-output.log		pre-commit-output.log
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Squeeze

Why Squeeze?

Installation

Quick Start

Supported Algorithms

Choosing an Algorithm

Benchmark Results

Algorithm vs Metrics Heatmap

Metrics Comparison Table

Overall Rankings

Hybrid Techniques

k-NN Backend Performance

Features

All Algorithms

UMAP with HNSW Backend

Composition Pipeline

Ensemble Methods

Sparse Data Support

Evaluation Metrics

Development

Project Philosophy

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 130

Uh oh!

Languages

License

GeorgePearse/squeeze

Folders and files

Latest commit

History

Repository files navigation

Squeeze

Why Squeeze?

Installation

Quick Start

Supported Algorithms

Choosing an Algorithm

Benchmark Results

Algorithm vs Metrics Heatmap

Metrics Comparison Table

Overall Rankings

Hybrid Techniques

k-NN Backend Performance

Features

All Algorithms

UMAP with HNSW Backend

Composition Pipeline

Ensemble Methods

Sparse Data Support

Evaluation Metrics

Development

Project Philosophy

Citation

License

Acknowledgments

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 130

Uh oh!

Languages

Packages