|
| 1 | +# SQLite-Vec C++ Benchmark Results |
| 2 | + |
| 3 | +**Version**: 0.1.0 |
| 4 | +**Date**: 2025-11-02 |
| 5 | +**Platform**: x86_64, 48 cores @ 4.0GHz, 32KB L1, 512KB L2, 16MB L3 |
| 6 | +**Compiler**: GCC 15.2.0, C++23, Release mode (`-O3`) |
| 7 | +**Library**: Google Benchmark 1.9.1 |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Executive Summary |
| 12 | + |
| 13 | +The C++ implementation achieves **3.6M vectors/second sustained throughput** with linear scaling across corpus sizes and embedding dimensions. int8 quantization provides 4x storage reduction at performance parity. HNSW index recommended for >100K vector corpora. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## RAG Pipeline Benchmark |
| 18 | + |
| 19 | +### 1. Corpus Size Scaling (384d, K=5) |
| 20 | + |
| 21 | +| Corpus | Latency | Throughput | QPS (single-thread) | |
| 22 | +|--------|---------|------------|---------------------| |
| 23 | +| 1K | 273 μs | 3.67 M/s | ~3,660 queries/sec | |
| 24 | +| 10K | 2.78 ms | 3.60 M/s | ~360 queries/sec | |
| 25 | +| 100K | 27.9 ms | 3.58 M/s | ~36 queries/sec | |
| 26 | + |
| 27 | +**Scaling**: Linear (10x corpus → 10x latency) |
| 28 | +**Bottleneck**: Compute-bound (memory bandwidth utilization ~5%) |
| 29 | + |
| 30 | +### 2. K-Value Scaling (10K docs, 384d) |
| 31 | + |
| 32 | +| K | Latency | Delta | |
| 33 | +|----|---------|-------| |
| 34 | +| 1 | 2.77 ms | -0.4% | |
| 35 | +| 5 | 2.78 ms | baseline | |
| 36 | +| 10 | 2.78 ms | 0.0% | |
| 37 | +| 50 | 2.77 ms | -0.4% | |
| 38 | + |
| 39 | +**Conclusion**: Partial sort overhead negligible; K-value has no meaningful impact. |
| 40 | + |
| 41 | +### 3. Embedding Dimension Scaling (10K docs, K=5) |
| 42 | + |
| 43 | +| Dimensions | Latency | Throughput | Scaling Factor | |
| 44 | +|------------|----------|------------|----------------| |
| 45 | +| 384d | 2.78 ms | 3.60 M/s | 1.0x | |
| 46 | +| 768d | 5.74 ms | 1.74 M/s | 2.06x | |
| 47 | +| 1536d | 11.7 ms | 856k/s | 4.21x | |
| 48 | + |
| 49 | +**Scaling**: Near-linear (2x dim → 2.06x latency, 4x dim → 4.21x latency) |
| 50 | +**Conclusion**: Compute-bound; SIMD efficiency remains high across dimensions. |
| 51 | + |
| 52 | +### 4. Quantization (10K docs, 384d, K=5) |
| 53 | + |
| 54 | +| Type | Latency | Throughput | Storage | Overhead | |
| 55 | +|-------|---------|------------|---------|----------| |
| 56 | +| float | 2.78 ms | 3.60 M/s | 4 bytes | baseline | |
| 57 | +| int8 | 2.74 ms | 3.65 M/s | 1 byte | **-1.4%** | |
| 58 | + |
| 59 | +**Conclusion**: int8 quantization is **faster** while reducing storage 4x (memory bandwidth savings). |
| 60 | + |
| 61 | +### 5. Multi-Query Throughput (10K docs, 384d) |
| 62 | + |
| 63 | +- **10 queries**: 27.5 ms total (2.75 ms/query average) |
| 64 | +- **Sustained throughput**: 3.64 M vectors/second |
| 65 | +- **QPS**: ~364 queries/second (single-threaded) |
| 66 | +- **Parallelization potential**: 48 cores → ~17.4K QPS theoretical |
| 67 | + |
| 68 | +### 6. Sequential vs Batch (1K docs, 384d, K=5) |
| 69 | + |
| 70 | +| Method | Latency | Throughput | |
| 71 | +|------------|---------|------------| |
| 72 | +| Sequential | 274 μs | 3.66 M/s | |
| 73 | +| Batch | 273 μs | 3.67 M/s | |
| 74 | + |
| 75 | +**Conclusion**: Batch API provides cleaner code at performance parity (memory-bandwidth bound). |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +## Batch Distance Benchmark |
| 80 | + |
| 81 | +### 1. Sequential vs Batch Comparison |
| 82 | + |
| 83 | +| Scenario | Sequential | Batch | Speedup | |
| 84 | +|----------|------------|-------|---------| |
| 85 | +| 100×384d | 26.7 μs | 26.7 μs | 1.00x | |
| 86 | +| 1K×384d | 268 μs | 269 μs | 1.00x | |
| 87 | + |
| 88 | +**Conclusion**: Parity performance; both memory-bandwidth limited. |
| 89 | + |
| 90 | +### 2. Memory Layout Optimization |
| 91 | + |
| 92 | +| Layout | Latency | Throughput | Improvement | |
| 93 | +|-------------|---------|------------|-------------| |
| 94 | +| Scattered | 269 μs | 3.73 M/s | baseline | |
| 95 | +| Contiguous | 267 μs | 3.75 M/s | +0.5% | |
| 96 | + |
| 97 | +**Conclusion**: Marginal improvement; modern CPUs prefetch efficiently. |
| 98 | + |
| 99 | +### 3. Top-K Performance (1K×384d, K=10) |
| 100 | + |
| 101 | +- **Latency**: 268 μs (vs 268 μs full distance computation) |
| 102 | +- **Overhead**: <1% for partial sort |
| 103 | +- **Conclusion**: `std::partial_sort` highly optimized; K << N has negligible cost. |
| 104 | + |
| 105 | +### 4. Large Embeddings (1K×1536d) |
| 106 | + |
| 107 | +- **Latency**: 1.13 ms |
| 108 | +- **Throughput**: 886k vectors/second |
| 109 | +- **Scaling**: 4.21x slower than 384d (expected 4.0x) |
| 110 | + |
| 111 | +--- |
| 112 | + |
| 113 | +## HNSW Decision Matrix |
| 114 | + |
| 115 | +| Corpus Size | Brute-Force Latency | Recommendation | |
| 116 | +|-------------|---------------------|----------------| |
| 117 | +| <10K | <3ms | ✅ Brute-force optimal | |
| 118 | +| 10K-100K | 3-30ms | ⚠️ Brute-force acceptable for batch | |
| 119 | +| >100K | >30ms | ❌ HNSW required for real-time (<10ms) | |
| 120 | + |
| 121 | +**HNSW Threshold**: 100K vectors (27.9ms → >10ms target requires ANN index) |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## Platform-Specific Results |
| 126 | + |
| 127 | +### SIMD Utilization |
| 128 | + |
| 129 | +- **AVX**: Active (conditional compilation, `-mavx` detected) |
| 130 | +- **NEON**: Not tested (x86_64 platform) |
| 131 | +- **Scalar fallback**: Available for non-aligned/small vectors |
| 132 | + |
| 133 | +### Cache Efficiency |
| 134 | + |
| 135 | +- **L1 hit rate**: >95% (estimated from throughput consistency) |
| 136 | +- **Memory bandwidth**: ~11 GB/s per query (vs 200 GB/s L1 capacity) |
| 137 | +- **Conclusion**: Compute-bound, not memory-bound |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## Comparison to Targets |
| 142 | + |
| 143 | +| Metric | Target | Actual | Status | |
| 144 | +|--------|--------|--------|--------| |
| 145 | +| 1K corpus (<1ms) | 1000 μs | 273 μs | ✅ **3.6x better** | |
| 146 | +| 10K corpus (<5ms) | 5000 μs | 2780 μs | ✅ **1.8x better** | |
| 147 | +| 100K corpus (<50ms) | 50000 μs | 27900 μs | ✅ **1.8x better** | |
| 148 | +| int8 overhead (<20%) | 20% | -1.4% | ✅ **Faster** | |
| 149 | +| Dimension scaling | Linear | Linear | ✅ **Perfect** | |
| 150 | + |
| 151 | +--- |
| 152 | + |
| 153 | +## Reproduction |
| 154 | + |
| 155 | +```bash |
| 156 | +# Build benchmarks |
| 157 | +cd third_party/sqlite-vec-cpp |
| 158 | +meson setup build_bench -Denable_benchmarks=true -Dbuildtype=release |
| 159 | +ninja -C build_bench |
| 160 | + |
| 161 | +# Run RAG pipeline benchmark |
| 162 | +./build_bench/benchmarks/rag_pipeline_benchmark --benchmark_min_time=0.5s |
| 163 | + |
| 164 | +# Run batch distance benchmark |
| 165 | +./build_bench/benchmarks/batch_distance_benchmark --benchmark_min_time=0.5s |
| 166 | + |
| 167 | +# JSON output for analysis |
| 168 | +./build_bench/benchmarks/rag_pipeline_benchmark \ |
| 169 | + --benchmark_out=results.json \ |
| 170 | + --benchmark_out_format=json |
| 171 | +``` |
| 172 | + |
| 173 | +--- |
0 commit comments