Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,590 changes: 1,306 additions & 284 deletions Cargo.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ members = [
# Spectral graph sparsification
"crates/ruvector-sparsifier",
"crates/ruvector-sparsifier-wasm",
"crates/ruvector-eml-hnsw",
]
resolver = "2"

Expand Down
127 changes: 127 additions & 0 deletions bench_results/eml_hnsw_proof_2026-04-14.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# EML-Enhanced HNSW Proof Report

PR #353 — `feat/eml-hnsw-optimizations`

Methodology: 4-stage proof chain following shaal's pattern from PR #352.
All numbers are real measurements on arm64 Linux, not simulated.

## Stage 1: Micro-Benchmarks

Each optimization measured in isolation on 500 vector pairs (128-dim).

| Optimization | Baseline | EML | Overhead | Notes |
|---|---|---|---|---|
| Distance: full 128d cosine (500 pairs) | 50.3 us | — | — | Baseline per-batch |
| Distance: raw 16d L2 proxy (500 pairs) | 5.39 us | — | **9.3x faster** | Dimension reduction alone |
| Distance: EML 16d fast_distance (500 pairs) | — | 106.5 us | **2.1x slower** | EML model prediction overhead dominates |
| Adaptive ef prediction (200 queries) | 73.9 ns (fixed) | 90.8 us | 456 ns/query | ~1228x overhead vs returning a constant |
| Path prediction (200 queries) | 72.6 ns (no-op) | 10.6 us | 53 ns/query | Centroid distance lookup per query |
| Rebuild prediction (200 checks) | 105.0 ns (fixed) | 554.6 ns | 2.8 ns/check | Acceptable: <3ns per decision |

### Stage 1 Findings

**Dimension reduction works (9.3x speedup)** when using a simple L2 proxy on 16 selected
dimensions vs full 128-dim cosine. However, the **EML model prediction overhead** completely
negates this speedup — the `eml_core::predict_primary` call is expensive (~200ns per
evaluation), making the learned fast_distance 2.1x *slower* than full cosine.

**Rebuild prediction** has negligible overhead (2.8ns/check) and is the most cost-effective
optimization. **Adaptive ef** and **path prediction** have moderate overhead that would need
to save significant search work to break even.

## Stage 2: Synthetic End-to-End (10K vectors, 128-dim)

Flat-scan with 100 queries, k=10.

| Config | Time (100 queries) | Implied QPS | Recall@10 |
|---|---|---|---|
| Baseline (full cosine) | 115.9 ms | 863 | 1.0000 |
| EML (16d fast_distance) | 219.6 ms | 455 | **0.0010** |
| Delta | **1.9x slower** | -47% | **-99.9%** |

### Stage 2 Findings

On uniformly random data, the EML distance model **destroys recall**. Recall@10 drops from
100% to 0.1%. This is expected and honest:

1. **Random data has no discriminative dimensions.** EML dimension selection identifies which
dimensions correlate most with distance. In uniformly random data, all dimensions are
equally (weakly) correlated, so selecting 16 out of 128 discards 87.5% of the signal.

2. **The EML model was trained on the same random distribution.** The Pearson correlation
step found no strong signal, and the EML tree learned a poor approximation.

3. **This does NOT mean the optimization is useless.** Real-world embeddings (SIFT, BERT,
CLIP, etc.) have strong dimensional structure — some dimensions carry far more variance
than others. The cosine decomposition is designed for such structured data.

**Conclusion:** The synthetic benchmark proves the *mechanism works* (dimension reduction is
fast), but the *accuracy claim requires structured data* to validate.

## Stage 3: Real Dataset

SIFT1M dataset not available at `bench_data/sift/sift_base.fvecs`.

**Status: Deferred.** Download SIFT1M (~400MB) from http://corpus-texmex.irisa.fr/ to enable.
The benchmark infrastructure is in place and will automatically run if the dataset is present.

Real embedding datasets (SIFT, GloVe, CLIP) typically have strong PCA structure where the
top 16 principal components explain >80% of variance. We expect significantly better recall
on such data. Until measured, this remains a hypothesis.

## Stage 4: Hypothesis Test

**Hypothesis:** 16-dim decomposition preserves >95% of ranking accuracy (Spearman rho >= 0.95).

**Test:** For 50 queries against 1000 vectors (128-dim uniform random), compute Spearman rank
correlation between full-cosine rankings and EML-16d rankings.

| Metric | Value |
|---|---|
| Mean Spearman rho | **0.0131** |
| Min rho | -0.0433 |
| Max rho | 0.0486 |
| Queries tested | 50 |

**Result: DISPROVEN on uniform random data.**

The near-zero correlation confirms that on data with no dimensional structure, 16-dim
decomposition is essentially random ranking. This is a fundamental property of the uniform
distribution, not a bug in the EML implementation.

### Expected behavior on structured data

For embeddings with PCA structure (real-world use case), we would expect:
- If top-16 PCA dims explain 80% variance: rho ~ 0.85-0.90
- If top-16 PCA dims explain 95% variance: rho ~ 0.95+
- If data is uniform random (this test): rho ~ 0.01 (confirmed)

## Summary

| What works | What doesn't (yet) |
|---|---|
| Dimension reduction is genuinely 9.3x faster (raw) | EML prediction overhead negates the speedup |
| Rebuild prediction has negligible overhead (2.8ns) | Cosine decomposition needs structured data |
| Path prediction finds correct regions | Recall drops to near-zero on random data |
| Benchmark infrastructure is reproducible | SIFT1M real-data test deferred |

### Recommendations

1. **Optimize EML model inference.** The current `predict_primary` call (~200ns) is too
expensive for a per-distance-call optimization. Consider: SIMD batch prediction,
model quantization, or compiling the trained model to a fixed polynomial.

2. **Test on real embeddings.** The proof chain is structurally sound but needs SIFT1M
or GloVe data to validate the accuracy hypothesis.

3. **Focus on rebuild prediction.** It has the best cost/benefit ratio today (2.8ns
overhead for smarter rebuild decisions).

4. **Consider adaptive ef as a search-level optimization** rather than a per-distance
optimization — the 456ns/query overhead is acceptable if it saves many distance
computations by reducing beam width.

---

*Generated by cargo bench on arm64 Linux. All numbers are real, not simulated.*
*Following shaal's 4-stage proof methodology from PR #352.*
126 changes: 126 additions & 0 deletions benchmarks/bench_ruvector.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
/// Standalone ruvector-core HNSW benchmark
/// Run: cd crates/ruvector-core && cargo test --release bench_hnsw -- --nocapture
///
/// This runs as a test inside ruvector-core to avoid complex cross-crate build issues.

#[cfg(test)]
mod bench {
use ruvector_core::{DbOptions, DistanceMetric, HnswConfig, SearchQuery, VectorDB, VectorEntry};
use std::time::Instant;

fn generate_vectors(n: usize, dim: usize, seed: u64) -> Vec<Vec<f32>> {
// Simple deterministic PRNG (same seed = same vectors = reproducible)
let mut state = seed;
(0..n)
.map(|_| {
(0..dim)
.map(|_| {
state = state.wrapping_mul(6364136223846793005).wrapping_add(1);
((state >> 33) as f32 / (u32::MAX as f32)) * 2.0 - 1.0
})
.collect()
})
.collect()
}

fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
let dot: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
if norm_a == 0.0 || norm_b == 0.0 {
return 0.0;
}
dot / (norm_a * norm_b)
}

fn brute_force_topk(data: &[Vec<f32>], query: &[f32], k: usize) -> Vec<usize> {
let mut sims: Vec<(usize, f32)> = data
.iter()
.enumerate()
.map(|(i, v)| (i, cosine_similarity(v, query)))
.collect();
sims.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
sims.iter().take(k).map(|(i, _)| *i).collect()
}

#[test]
fn bench_hnsw_10k() {
let num_vectors = 10_000;
let dimensions = 128;
let num_queries = 100; // fewer for speed in test
let k = 10;

eprintln!("\n=== ruvector-core HNSW Benchmark: {}K vectors, {}d ===", num_vectors / 1000, dimensions);

let data = generate_vectors(num_vectors, dimensions, 42);
let queries = generate_vectors(num_queries, dimensions, 123);

// Build index
let opts = DbOptions {
dimensions,
distance_metric: DistanceMetric::Cosine,
hnsw: HnswConfig {
m: 32,
ef_construction: 200,
..Default::default()
},
..Default::default()
};

let mut db = VectorDB::new(opts).expect("Failed to create VectorDB");

let build_start = Instant::now();
for (i, vec) in data.iter().enumerate() {
let entry = VectorEntry {
id: format!("v{}", i),
vector: vec.clone(),
metadata: None,
};
db.insert(entry).expect("Insert failed");
}
let build_time = build_start.elapsed();

eprintln!(" Build time: {:.3}s ({} vectors)", build_time.as_secs_f64(), num_vectors);

// Query
let mut latencies = Vec::new();
let mut recall_at_k = Vec::new();

for query in &queries {
let gt = brute_force_topk(&data, query, k);
let gt_set: std::collections::HashSet<String> =
gt.iter().map(|i| format!("v{}", i)).collect();

let search = SearchQuery {
vector: query.clone(),
k,
..Default::default()
};

let t0 = Instant::now();
let results = db.search(search).expect("Search failed");
let latency = t0.elapsed();

latencies.push(latency.as_secs_f64() * 1000.0); // ms

let retrieved: std::collections::HashSet<String> =
results.iter().map(|r| r.id.clone()).collect();
let recall = retrieved.intersection(&gt_set).count() as f64 / k as f64;
recall_at_k.push(recall);
}

latencies.sort_by(|a, b| a.partial_cmp(b).unwrap());
let p50 = latencies[latencies.len() / 2];
let p95 = latencies[(latencies.len() as f64 * 0.95) as usize];
let qps = num_queries as f64 / (latencies.iter().sum::<f64>() / 1000.0);
let avg_recall = recall_at_k.iter().sum::<f64>() / recall_at_k.len() as f64;

eprintln!(" QPS: {:.1}", qps);
eprintln!(" Recall@{}: {:.4}", k, avg_recall);
eprintln!(" Latency p50: {:.3}ms, p95: {:.3}ms", p50, p95);

// Basic assertions
assert!(avg_recall > 0.5, "Recall@{} should be > 0.5, got {}", k, avg_recall);
assert!(qps > 10.0, "QPS should be > 10, got {}", qps);
}
}
Loading