USearchBench.java

Vector Search benchmark comparing USearch HNSW across precisions (f32, f16, bf16, i8) vs Lucene in-memory HNSW (f32, i8) baselines, leveraging Apache Spark for distributed indexing and search.

Quick Start

gradle build --warning-mode all # ensure all dependencies are resolved
gradle run --args="-h" # print supported arguments & available options
gradle run --args="unum-wiki-1m" # index & search `f32` vectors across all cores
gradle run --args="yandex-deep-10m --max-vectors 100000" # limit vectors for testing
gradle run --args="msft-spacev-100m --precision f32,i8" # test specific precisions

When running on larger machines, consider overriding JVM settings based on your hardware. Auto-detect optimal settings for your machine:

# For 750+ GB machines with custom heap size
JAVA_OPTS="-Xmx512g -Xms64g -XX:ParallelGCThreads=$(nproc)" gradle run --args="msft-spacev-100m"

# For development with limited resources
JAVA_OPTS="-Xmx8g -Xms2g" gradle run --args="unum-wiki-1m --max-vectors 100000"

For small test runs comparing the impact of multi-threading you may run:

JAVA_OPTS="-Xms2g -Xmx8g" gradle run --args="unum-wiki-1m --max-vectors 10000 --queries 10000 --batch-size 100 --threads 1"
JAVA_OPTS="-Xms2g -Xmx8g" gradle run --args="unum-wiki-1m --max-vectors 10000 --queries 10000 --batch-size 100 --threads 8"

To test Spark distributed execution locally, you may run:

JAVA_OPTS="-Xmx8g -Xms2g" gradle run --args="unum-wiki-1m --shards 2 --max-vectors 100000"
JAVA_OPTS="-Xmx8g -Xms2g" gradle run --args="unum-wiki-1m --shards 32 --batch-size 32768 --engines lucene"
JAVA_OPTS="-Xmx512g -Xms64g -XX:ParallelGCThreads=$(nproc)" gradle run --args="msft-spacev-100m --shards 32 --batch-size 32768 --engines lucene"

Benchmarks USearch (f32, f16, bf16, i8) against Lucene (f32) on Wiki dataset locally, producing clean output like, the following results obtained for the 100M msft-spacev-100m subset of Microsoft SpaceV:

🚀 PERFORMANCE METRICS
┌──────────────┬──────────────┬──────────────┬──────────────┬─────────────┐
│ Engine       │ Precision    │ IPS          │ QPS          │ Memory      │
├──────────────┼──────────────┼──────────────┼──────────────┼─────────────┤
│ Lucene       │ F32          │ 20,665       │ 864          │ 49.0 GB     │
│ Lucene       │ I8           │ 26,408       │ 1,218        │ 20.8 GB     │
│ USearch      │ F32          │ 96,119       │ 126,582      │ 96.0 GB     │
│ USearch      │ BF16         │ 113,090      │ 129,870      │ 64.0 GB     │
│ USearch      │ F16          │ 124,297      │ 144,928      │ 64.0 GB     │
│ USearch      │ I8           │ 137,329      │ 166,667      │ 48.0 GB     │
└──────────────┴──────────────┴──────────────┴──────────────┴─────────────┘
🎯 RECALL & NDCG METRICS
┌─────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ Engine      │ Precision   │ Recall@10   │ NDCG@10     │ Recall@100  │ NDCG@100    │
├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ Lucene      │ F32         │ 90.00%      │ 88.17%      │ 94.70%      │ 93.34%      │
│ Lucene      │ I8          │ 90.00%      │ 87.75%      │ 94.97%      │ 93.30%      │
│ USearch     │ F32         │ 90.03%      │ 88.63%      │ 95.76%      │ 95.21%      │
│ USearch     │ BF16        │ 90.12%      │ 88.46%      │ 95.62%      │ 95.21%      │
│ USearch     │ F16         │ 90.27%      │ 88.59%      │ 95.78%      │ 95.31%      │
│ USearch     │ I8          │ 90.34%      │ 88.66%      │ 95.81%      │ 95.31%      │
└─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘

Hardware: AWS m7i.metal-48xl instances (92 physical cores, 192 threads, 2 sockets). OS: Linux 6.8.0-1024-aws (Ubuntu 22.04.5 LTS). CPU: Intel Xeon Platinum 8488C @ 2.4GHz. Memory: 768 GB RAM total. Java: OpenJDK 21.0.5 with Java Vector API (--add-modules=jdk.incubator.vector). JVM: 128GB heap (-Xmx128g) with ZGC garbage collector (-XX:+UseZGC) for sub-10ms pauses. Library versions: USearch v2.20.8, Lucene v9.12.0.

IPS stands for Insertions Per Second, and QPS is Queries Per Second. Recall@K is computed as a fraction of search queries, where the known "ground-truth" Top-1 result appeared among the Top-K approximate results. NDCG@K stands for Normalized Discounted Cumulative Gain at K, which measures the effectiveness of the search results by considering the position of the relevant documents.

Datasets

The BigANN benchmark is a good starting point if you are searching for extensive collections of high-dimensional vectors. Still, it rarely considers datasets with more than tens of millions of entries. In the era of 100+ core CPUs and Petabyte-scale storage in 2U servers, larger datasets are required.

Dataset	Codename	DType	NDim	Metric	Size	$N$
Unum UForm Creative Captions	`unum-cc-3m`	`f32`	256	IP	3 GB	100
Unum UForm Wiki	`unum-wiki-1m`	`f32`	256	IP	1 GB	10
Yandex Text-to-Image subset	`yandex-t2i-1m`	`f32`	200	Cos	1 GB	100
Yandex Deep10M subset	`yandex-deep-10m`	`f32`	96	L2	4 GB	100
Microsoft SpaceV-100M subset	`msft-spacev-100m`	`i8`	100	L2	9 GB	100

Microsoft SpaceV-1B	`msft-spacev-1b`	`i8`	100	L2	131 GB	100
Microsoft Turing-ANNS	`msft-turing-1b`	`f32`	100	L2	373 GB	100
Yandex Deep1B	`yandex-deep-1b`	`f32`	96	L2	358 GB	100
Yandex Text-to-Image	`yandex-t2i-1b`	`f32`	200	Cos	750 GB	100

ViT-L/12 LAION	`laion-5b`	`f32`	2048	Cos	10 TB	-

Those often come with $N$ precomputed ground-truth neighbors, which is handy for recall evaluation. The ground-truth neighbors are computed with respect to some "metric", such as Inner Product (IP), Cosine similarity (Cos), or Euclidean distance (L2). They are generally distributed in a form of binary matrix files, packing either f32 or i8 scalars, for high compatibility, as opposed to less common f16 and bf16.

Unum UForm Wiki

mkdir -p datasets/unum-wiki-1m/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/base.1M.fbin -P datasets/unum-wiki-1m/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/query.public.100K.fbin -P datasets/unum-wiki-1m/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/groundtruth.public.100K.ibin -P datasets/unum-wiki-1m/

Yandex Text-to-Image

mkdir -p datasets/yandex-t2i-1m/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1B.fbin -P datasets/yandex-t2i-1m/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1M.fbin -P datasets/yandex-t2i-1m/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.public.100K.fbin -P datasets/yandex-t2i-1m/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/groundtruth.public.100K.ibin -P datasets/yandex-t2i-1m/

Yandex Deep1B

mkdir -p datasets/yandex-deep-1b/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.1B.fbin -P datasets/yandex-deep-1b/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.10M.fbin -P datasets/yandex-deep-1b/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/query.public.10K.fbin -P datasets/yandex-deep-1b/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/groundtruth.public.10K.ibin -P datasets/yandex-deep-1b/

Microsoft SpaceV

The original dataset can be pulled in a USearch-compatible form from AWS S3:

mkdir -p datasets/msft-spacev-1b/ && \
    aws s3 cp s3://bigger-ann/spacev-1b/ datasets/msft-spacev-1b/ --recursive

A smaller 100M dataset can be pulled from Hugging Face:

mkdir -p datasets/msft-spacev-100m/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/ids.100m.i32bin -P datasets/msft-spacev-100m/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/base.100m.i8bin -P datasets/msft-spacev-100m/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/query.30K.i8bin -P datasets/msft-spacev-100m/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/groundtruth.30K.i32bin -P datasets/msft-spacev-100m/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/groundtruth.30K.f32bin -P datasets/msft-spacev-100m/

TODO: Kubernetes Cluster Setup

For true distributed execution, create a local Kubernetes cluster using KIND (Kubernetes in Docker):

# Install KIND (if not already installed)
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.25.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Create multi-node cluster
kind create cluster --config=k8s/kind-cluster.yaml --name=spark-benchmark

# Install Spark Operator
kubectl create namespace spark-operator
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm install spark-operator spark-operator/spark-operator --namespace spark-operator

# Build and load your application image
docker build -t usearch-spark:latest .
kind load docker-image usearch-spark:latest --name=spark-benchmark

# Submit Spark job
kubectl apply -f k8s/spark-job.yaml
kubectl logs -f spark-benchmark-driver

This creates a 3-node Kubernetes cluster locally and runs the benchmark in a truly distributed manner across the nodes.

Contributing

Code style: 120 chars/line using Eclipse JDT via Spotless. Config lives in java-format.xml.
Formatting: run gradle spotlessApply before pushing changes.

gradle spotlessApply

Tuning Lucene

Many attempts have been made to tune Lucene and squeeze better numbers out of it. In every case an improvement along one metric (IPS, QPS, Recall) has come at the cost of another. That's common for search engines and suggests that we are approaching the limitations of the engine.

Here are some of the ideas considered and sometimes accepted:

Using StoredField instead of NumericDocValuesField for IDs.
Tuning the size and number of index segments to allow query-level parallelism.
Fixed 4GB segment size instead of dynamic calculation based on CPU cores.
Force merging segments when count exceeds optimal threshold.
Eliminated thread explosion where 191 threads were used per individual query.
Batch query processing using configurable --batch-size parameter.
TieredMergePolicy resulted in extremely slow single-threaded reductions by the end.
Replaced ByteBuffersDirectory with DynamicByteBuffersDirectory to handle >4GB indexes.
Aggressive concurrent merge scheduling with progress tracking.
Custom HNSW codec selection with fallback to best available implementation.
Using forceMerge before search queries to compact & optimize the index.
Committing the IndexWriter before search queries to trigger mergers.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.vscode		.vscode
examples		examples
k8s		k8s
src/main		src/main
stats		stats
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
docker-compose.yml		docker-compose.yml
java-format.xml		java-format.xml
pom.xml		pom.xml
settings.gradle		settings.gradle
setup-spark4-cluster.sh		setup-spark4-cluster.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

USearchBench.java

Quick Start

Datasets

Unum UForm Wiki

Yandex Text-to-Image

Yandex Deep1B

Microsoft SpaceV

TODO: Kubernetes Cluster Setup

Contributing

Tuning Lucene

About

Uh oh!

Languages

License

ashvardanian/USearchBench.java

Folders and files

Latest commit

History

Repository files navigation

USearchBench.java

Quick Start

Datasets

Unum UForm Wiki

Yandex Text-to-Image

Yandex Deep1B

Microsoft SpaceV

TODO: Kubernetes Cluster Setup

Contributing

Tuning Lucene

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages