Vector Search benchmark comparing USearch HNSW across precisions (f32, f16, bf16, i8) vs Lucene in-memory HNSW (f32, i8) baselines, leveraging Apache Spark for distributed indexing and search.
gradle build --warning-mode all # ensure all dependencies are resolved
gradle run --args="-h" # print supported arguments & available options
gradle run --args="unum-wiki-1m" # index & search `f32` vectors across all cores
gradle run --args="yandex-deep-10m --max-vectors 100000" # limit vectors for testing
gradle run --args="msft-spacev-100m --precision f32,i8" # test specific precisionsWhen running on larger machines, consider overriding JVM settings based on your hardware. Auto-detect optimal settings for your machine:
# For 750+ GB machines with custom heap size
JAVA_OPTS="-Xmx512g -Xms64g -XX:ParallelGCThreads=$(nproc)" gradle run --args="msft-spacev-100m"
# For development with limited resources
JAVA_OPTS="-Xmx8g -Xms2g" gradle run --args="unum-wiki-1m --max-vectors 100000"For small test runs comparing the impact of multi-threading you may run:
JAVA_OPTS="-Xms2g -Xmx8g" gradle run --args="unum-wiki-1m --max-vectors 10000 --queries 10000 --batch-size 100 --threads 1"
JAVA_OPTS="-Xms2g -Xmx8g" gradle run --args="unum-wiki-1m --max-vectors 10000 --queries 10000 --batch-size 100 --threads 8"To test Spark distributed execution locally, you may run:
JAVA_OPTS="-Xmx8g -Xms2g" gradle run --args="unum-wiki-1m --shards 2 --max-vectors 100000"
JAVA_OPTS="-Xmx8g -Xms2g" gradle run --args="unum-wiki-1m --shards 32 --batch-size 32768 --engines lucene"
JAVA_OPTS="-Xmx512g -Xms64g -XX:ParallelGCThreads=$(nproc)" gradle run --args="msft-spacev-100m --shards 32 --batch-size 32768 --engines lucene"Benchmarks USearch (f32, f16, bf16, i8) against Lucene (f32) on Wiki dataset locally, producing clean output like, the following results obtained for the 100M msft-spacev-100m subset of Microsoft SpaceV:
π PERFORMANCE METRICS
ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ
β Engine β Precision β IPS β QPS β Memory β
ββββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β Lucene β F32 β 20,665 β 864 β 49.0 GB β
β Lucene β I8 β 26,408 β 1,218 β 20.8 GB β
β USearch β F32 β 96,119 β 126,582 β 96.0 GB β
β USearch β BF16 β 113,090 β 129,870 β 64.0 GB β
β USearch β F16 β 124,297 β 144,928 β 64.0 GB β
β USearch β I8 β 137,329 β 166,667 β 48.0 GB β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ
π― RECALL & NDCG METRICS
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ
β Engine β Precision β Recall@10 β NDCG@10 β Recall@100 β NDCG@100 β
βββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββ€
β Lucene β F32 β 90.00% β 88.17% β 94.70% β 93.34% β
β Lucene β I8 β 90.00% β 87.75% β 94.97% β 93.30% β
β USearch β F32 β 90.03% β 88.63% β 95.76% β 95.21% β
β USearch β BF16 β 90.12% β 88.46% β 95.62% β 95.21% β
β USearch β F16 β 90.27% β 88.59% β 95.78% β 95.31% β
β USearch β I8 β 90.34% β 88.66% β 95.81% β 95.31% β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ
Hardware: AWS
m7i.metal-48xlinstances (92 physical cores, 192 threads, 2 sockets). OS: Linux 6.8.0-1024-aws (Ubuntu 22.04.5 LTS). CPU: Intel Xeon Platinum 8488C @ 2.4GHz. Memory: 768 GB RAM total. Java: OpenJDK 21.0.5 with Java Vector API (--add-modules=jdk.incubator.vector). JVM: 128GB heap (-Xmx128g) with ZGC garbage collector (-XX:+UseZGC) for sub-10ms pauses. Library versions: USearch v2.20.8, Lucene v9.12.0.
IPS stands for Insertions Per Second, and QPS is Queries Per Second. Recall@K is computed as a fraction of search queries, where the known "ground-truth" Top-1 result appeared among the Top-K approximate results. NDCG@K stands for Normalized Discounted Cumulative Gain at K, which measures the effectiveness of the search results by considering the position of the relevant documents.
The BigANN benchmark is a good starting point if you are searching for extensive collections of high-dimensional vectors. Still, it rarely considers datasets with more than tens of millions of entries. In the era of 100+ core CPUs and Petabyte-scale storage in 2U servers, larger datasets are required.
| Dataset | Codename | DType | NDim | Metric | Size | |
|---|---|---|---|---|---|---|
| Unum UForm Creative Captions | unum-cc-3m |
f32 |
256 | IP | 3 GB | 100 |
| Unum UForm Wiki | unum-wiki-1m |
f32 |
256 | IP | 1 GB | 10 |
| Yandex Text-to-Image subset | yandex-t2i-1m |
f32 |
200 | Cos | 1 GB | 100 |
| Yandex Deep10M subset | yandex-deep-10m |
f32 |
96 | L2 | 4 GB | 100 |
| Microsoft SpaceV-100M subset | msft-spacev-100m |
i8 |
100 | L2 | 9 GB | 100 |
| Microsoft SpaceV-1B | msft-spacev-1b |
i8 |
100 | L2 | 131 GB | 100 |
| Microsoft Turing-ANNS | msft-turing-1b |
f32 |
100 | L2 | 373 GB | 100 |
| Yandex Deep1B | yandex-deep-1b |
f32 |
96 | L2 | 358 GB | 100 |
| Yandex Text-to-Image | yandex-t2i-1b |
f32 |
200 | Cos | 750 GB | 100 |
| ViT-L/12 LAION | laion-5b |
f32 |
2048 | Cos | 10 TB | - |
Those often come with f32 or i8 scalars, for high compatibility, as opposed to less common f16 and bf16.
mkdir -p datasets/unum-wiki-1m/ && \
wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/base.1M.fbin -P datasets/unum-wiki-1m/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/query.public.100K.fbin -P datasets/unum-wiki-1m/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/groundtruth.public.100K.ibin -P datasets/unum-wiki-1m/mkdir -p datasets/yandex-t2i-1m/ && \
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1B.fbin -P datasets/yandex-t2i-1m/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1M.fbin -P datasets/yandex-t2i-1m/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.public.100K.fbin -P datasets/yandex-t2i-1m/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/groundtruth.public.100K.ibin -P datasets/yandex-t2i-1m/mkdir -p datasets/yandex-deep-1b/ && \
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.1B.fbin -P datasets/yandex-deep-1b/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.10M.fbin -P datasets/yandex-deep-1b/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/query.public.10K.fbin -P datasets/yandex-deep-1b/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/groundtruth.public.10K.ibin -P datasets/yandex-deep-1b/The original dataset can be pulled in a USearch-compatible form from AWS S3:
mkdir -p datasets/msft-spacev-1b/ && \
aws s3 cp s3://bigger-ann/spacev-1b/ datasets/msft-spacev-1b/ --recursiveA smaller 100M dataset can be pulled from Hugging Face:
mkdir -p datasets/msft-spacev-100m/ && \
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/ids.100m.i32bin -P datasets/msft-spacev-100m/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/base.100m.i8bin -P datasets/msft-spacev-100m/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/query.30K.i8bin -P datasets/msft-spacev-100m/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/groundtruth.30K.i32bin -P datasets/msft-spacev-100m/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/groundtruth.30K.f32bin -P datasets/msft-spacev-100m/For true distributed execution, create a local Kubernetes cluster using KIND (Kubernetes in Docker):
# Install KIND (if not already installed)
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.25.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
# Create multi-node cluster
kind create cluster --config=k8s/kind-cluster.yaml --name=spark-benchmark
# Install Spark Operator
kubectl create namespace spark-operator
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm install spark-operator spark-operator/spark-operator --namespace spark-operator
# Build and load your application image
docker build -t usearch-spark:latest .
kind load docker-image usearch-spark:latest --name=spark-benchmark
# Submit Spark job
kubectl apply -f k8s/spark-job.yaml
kubectl logs -f spark-benchmark-driverThis creates a 3-node Kubernetes cluster locally and runs the benchmark in a truly distributed manner across the nodes.
- Code style: 120 chars/line using Eclipse JDT via Spotless. Config lives in
java-format.xml. - Formatting: run
gradle spotlessApplybefore pushing changes.
gradle spotlessApplyMany attempts have been made to tune Lucene and squeeze better numbers out of it. In every case an improvement along one metric (IPS, QPS, Recall) has come at the cost of another. That's common for search engines and suggests that we are approaching the limitations of the engine.
Here are some of the ideas considered and sometimes accepted:
- Using
StoredFieldinstead ofNumericDocValuesFieldfor IDs. - Tuning the size and number of index segments to allow query-level parallelism.
- Fixed 4GB segment size instead of dynamic calculation based on CPU cores.
- Force merging segments when count exceeds optimal threshold.
- Eliminated thread explosion where 191 threads were used per individual query.
- Batch query processing using configurable
--batch-sizeparameter. TieredMergePolicyresulted in extremely slow single-threaded reductions by the end.- Replaced
ByteBuffersDirectorywithDynamicByteBuffersDirectoryto handle >4GB indexes. - Aggressive concurrent merge scheduling with progress tracking.
- Custom HNSW codec selection with fallback to best available implementation.
- Using
forceMergebefore search queries to compact & optimize the index. - Committing the
IndexWriterbefore search queries to trigger mergers.