ruvnet · aepod · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -138,6 +138,7 @@ members = [
     # Spectral graph sparsification
     "crates/ruvector-sparsifier",
     "crates/ruvector-sparsifier-wasm",
+    "crates/ruvector-eml-hnsw",
 ]
 resolver = "2"
 

diff --git a/bench_results/eml_hnsw_proof_2026-04-14.md b/bench_results/eml_hnsw_proof_2026-04-14.md
@@ -0,0 +1,127 @@
+# EML-Enhanced HNSW Proof Report
+
+PR #353 — `feat/eml-hnsw-optimizations`
+
+Methodology: 4-stage proof chain following shaal's pattern from PR #352.
+All numbers are real measurements on arm64 Linux, not simulated.
+
+## Stage 1: Micro-Benchmarks
+
+Each optimization measured in isolation on 500 vector pairs (128-dim).
+
+| Optimization | Baseline | EML | Overhead | Notes |
+|---|---|---|---|---|
+| Distance: full 128d cosine (500 pairs) | 50.3 us | — | — | Baseline per-batch |
+| Distance: raw 16d L2 proxy (500 pairs) | 5.39 us | — | **9.3x faster** | Dimension reduction alone |
+| Distance: EML 16d fast_distance (500 pairs) | — | 106.5 us | **2.1x slower** | EML model prediction overhead dominates |
+| Adaptive ef prediction (200 queries) | 73.9 ns (fixed) | 90.8 us | 456 ns/query | ~1228x overhead vs returning a constant |
+| Path prediction (200 queries) | 72.6 ns (no-op) | 10.6 us | 53 ns/query | Centroid distance lookup per query |
+| Rebuild prediction (200 checks) | 105.0 ns (fixed) | 554.6 ns | 2.8 ns/check | Acceptable: <3ns per decision |
+
+### Stage 1 Findings
+
+**Dimension reduction works (9.3x speedup)** when using a simple L2 proxy on 16 selected
+dimensions vs full 128-dim cosine. However, the **EML model prediction overhead** completely
+negates this speedup — the `eml_core::predict_primary` call is expensive (~200ns per
+evaluation), making the learned fast_distance 2.1x *slower* than full cosine.
+
+**Rebuild prediction** has negligible overhead (2.8ns/check) and is the most cost-effective
+optimization. **Adaptive ef** and **path prediction** have moderate overhead that would need
+to save significant search work to break even.
+
+## Stage 2: Synthetic End-to-End (10K vectors, 128-dim)
+
+Flat-scan with 100 queries, k=10.
+
+| Config | Time (100 queries) | Implied QPS | Recall@10 |
+|---|---|---|---|
+| Baseline (full cosine) | 115.9 ms | 863 | 1.0000 |
+| EML (16d fast_distance) | 219.6 ms | 455 | **0.0010** |
+| Delta | **1.9x slower** | -47% | **-99.9%** |
+
+### Stage 2 Findings
+
+On uniformly random data, the EML distance model **destroys recall**. Recall@10 drops from
+100% to 0.1%. This is expected and honest:
+
+1. **Random data has no discriminative dimensions.** EML dimension selection identifies which
+   dimensions correlate most with distance. In uniformly random data, all dimensions are
+   equally (weakly) correlated, so selecting 16 out of 128 discards 87.5% of the signal.
+
+2. **The EML model was trained on the same random distribution.** The Pearson correlation
+   step found no strong signal, and the EML tree learned a poor approximation.
+
+3. **This does NOT mean the optimization is useless.** Real-world embeddings (SIFT, BERT,
+   CLIP, etc.) have strong dimensional structure — some dimensions carry far more variance
+   than others. The cosine decomposition is designed for such structured data.
+
+**Conclusion:** The synthetic benchmark proves the *mechanism works* (dimension reduction is
+fast), but the *accuracy claim requires structured data* to validate.
+
+## Stage 3: Real Dataset
+
+SIFT1M dataset not available at `bench_data/sift/sift_base.fvecs`.
+
+**Status: Deferred.** Download SIFT1M (~400MB) from http://corpus-texmex.irisa.fr/ to enable.
+The benchmark infrastructure is in place and will automatically run if the dataset is present.
+
+Real embedding datasets (SIFT, GloVe, CLIP) typically have strong PCA structure where the
+top 16 principal components explain >80% of variance. We expect significantly better recall
+on such data. Until measured, this remains a hypothesis.
+
+## Stage 4: Hypothesis Test
+
+**Hypothesis:** 16-dim decomposition preserves >95% of ranking accuracy (Spearman rho >= 0.95).
+
+**Test:** For 50 queries against 1000 vectors (128-dim uniform random), compute Spearman rank
+correlation between full-cosine rankings and EML-16d rankings.
+
+| Metric | Value |
+|---|---|
+| Mean Spearman rho | **0.0131** |
+| Min rho | -0.0433 |
+| Max rho | 0.0486 |
+| Queries tested | 50 |
+
+**Result: DISPROVEN on uniform random data.**
+
+The near-zero correlation confirms that on data with no dimensional structure, 16-dim
+decomposition is essentially random ranking. This is a fundamental property of the uniform
+distribution, not a bug in the EML implementation.
+
+### Expected behavior on structured data
+
+For embeddings with PCA structure (real-world use case), we would expect:
+- If top-16 PCA dims explain 80% variance: rho ~ 0.85-0.90
+- If top-16 PCA dims explain 95% variance: rho ~ 0.95+
+- If data is uniform random (this test): rho ~ 0.01 (confirmed)
+
+## Summary
+
+| What works | What doesn't (yet) |
+|---|---|
+| Dimension reduction is genuinely 9.3x faster (raw) | EML prediction overhead negates the speedup |
+| Rebuild prediction has negligible overhead (2.8ns) | Cosine decomposition needs structured data |
+| Path prediction finds correct regions | Recall drops to near-zero on random data |
+| Benchmark infrastructure is reproducible | SIFT1M real-data test deferred |
+
+### Recommendations
+
+1. **Optimize EML model inference.** The current `predict_primary` call (~200ns) is too
+   expensive for a per-distance-call optimization. Consider: SIMD batch prediction,
+   model quantization, or compiling the trained model to a fixed polynomial.
+
+2. **Test on real embeddings.** The proof chain is structurally sound but needs SIFT1M
+   or GloVe data to validate the accuracy hypothesis.
+
+3. **Focus on rebuild prediction.** It has the best cost/benefit ratio today (2.8ns
+   overhead for smarter rebuild decisions).
+
+4. **Consider adaptive ef as a search-level optimization** rather than a per-distance
+   optimization — the 456ns/query overhead is acceptable if it saves many distance
+   computations by reducing beam width.
+
+---
+
+*Generated by cargo bench on arm64 Linux. All numbers are real, not simulated.*
+*Following shaal's 4-stage proof methodology from PR #352.*
diff --git a/benchmarks/bench_ruvector.rs b/benchmarks/bench_ruvector.rs
@@ -0,0 +1,126 @@
+/// Standalone ruvector-core HNSW benchmark
+/// Run: cd crates/ruvector-core && cargo test --release bench_hnsw -- --nocapture
+///
+/// This runs as a test inside ruvector-core to avoid complex cross-crate build issues.
+
+#[cfg(test)]
+mod bench {
+    use ruvector_core::{DbOptions, DistanceMetric, HnswConfig, SearchQuery, VectorDB, VectorEntry};
+    use std::time::Instant;
+
+    fn generate_vectors(n: usize, dim: usize, seed: u64) -> Vec<Vec<f32>> {
+        // Simple deterministic PRNG (same seed = same vectors = reproducible)
+        let mut state = seed;
+        (0..n)
+            .map(|_| {
+                (0..dim)
+                    .map(|_| {
+                        state = state.wrapping_mul(6364136223846793005).wrapping_add(1);
+                        ((state >> 33) as f32 / (u32::MAX as f32)) * 2.0 - 1.0
+                    })
+                    .collect()
+            })
+            .collect()
+    }
+
+    fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
+        let dot: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
+        let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
+        let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
+        if norm_a == 0.0 || norm_b == 0.0 {
+            return 0.0;
+        }
+        dot / (norm_a * norm_b)
+    }
+
+    fn brute_force_topk(data: &[Vec<f32>], query: &[f32], k: usize) -> Vec<usize> {
+        let mut sims: Vec<(usize, f32)> = data
+            .iter()
+            .enumerate()
+            .map(|(i, v)| (i, cosine_similarity(v, query)))
+            .collect();
+        sims.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
+        sims.iter().take(k).map(|(i, _)| *i).collect()
+    }
+
+    #[test]
+    fn bench_hnsw_10k() {
+        let num_vectors = 10_000;
+        let dimensions = 128;
+        let num_queries = 100; // fewer for speed in test
+        let k = 10;
+
+        eprintln!("\n=== ruvector-core HNSW Benchmark: {}K vectors, {}d ===", num_vectors / 1000, dimensions);
+
+        let data = generate_vectors(num_vectors, dimensions, 42);
+        let queries = generate_vectors(num_queries, dimensions, 123);
+
+        // Build index
+        let opts = DbOptions {
+            dimensions,
+            distance_metric: DistanceMetric::Cosine,
+            hnsw: HnswConfig {
+                m: 32,
+                ef_construction: 200,
+                ..Default::default()
+            },
+            ..Default::default()
+        };
+
+        let mut db = VectorDB::new(opts).expect("Failed to create VectorDB");
+
+        let build_start = Instant::now();
+        for (i, vec) in data.iter().enumerate() {
+            let entry = VectorEntry {
+                id: format!("v{}", i),
+                vector: vec.clone(),
+                metadata: None,
+            };
+            db.insert(entry).expect("Insert failed");
+        }
+        let build_time = build_start.elapsed();
+
+        eprintln!("  Build time: {:.3}s ({} vectors)", build_time.as_secs_f64(), num_vectors);
+
+        // Query
+        let mut latencies = Vec::new();
+        let mut recall_at_k = Vec::new();
+
+        for query in &queries {
+            let gt = brute_force_topk(&data, query, k);
+            let gt_set: std::collections::HashSet<String> =
+                gt.iter().map(|i| format!("v{}", i)).collect();
+
+            let search = SearchQuery {
+                vector: query.clone(),
+                k,
+                ..Default::default()
+            };
+
+            let t0 = Instant::now();
+            let results = db.search(search).expect("Search failed");
+            let latency = t0.elapsed();
+
+            latencies.push(latency.as_secs_f64() * 1000.0); // ms
+
+            let retrieved: std::collections::HashSet<String> =
+                results.iter().map(|r| r.id.clone()).collect();
+            let recall = retrieved.intersection(&gt_set).count() as f64 / k as f64;
+            recall_at_k.push(recall);
+        }
+
+        latencies.sort_by(|a, b| a.partial_cmp(b).unwrap());
+        let p50 = latencies[latencies.len() / 2];
+        let p95 = latencies[(latencies.len() as f64 * 0.95) as usize];
+        let qps = num_queries as f64 / (latencies.iter().sum::<f64>() / 1000.0);
+        let avg_recall = recall_at_k.iter().sum::<f64>() / recall_at_k.len() as f64;
+
+        eprintln!("  QPS: {:.1}", qps);
+        eprintln!("  Recall@{}: {:.4}", k, avg_recall);
+        eprintln!("  Latency p50: {:.3}ms, p95: {:.3}ms", p50, p95);
+
+        // Basic assertions
+        assert!(avg_recall > 0.5, "Recall@{} should be > 0.5, got {}", k, avg_recall);
+        assert!(qps > 10.0, "QPS should be > 10, got {}", qps);
+    }
+}