Add JVector Codec to Lucene for ANN Searches #14892

RKSPD · 2025-07-02T01:22:14Z

Motivation

Lucene’s built‑in HNSW KnnVectorsFormat delivers strong recall/latency, but its index must reside entirely in RAM. As demand for vector datasets of larger dimensionality and greater index size increases, the cost of scaling systems like HNSW become prohibitively expensive.

JVector is a pure‑Java ANN engine that ultimately aims to merge DiskANN’s disk‑resident search with HNSW’s navigable‑small‑world graph. OpenSearch has successfully integrated JVector through the OpenSearch-JVector repository, but the current implementation contains several OpenSearch-specific dependencies.

Today, OpenSearch's implementation, and by extension this implementation, still loads the whole graph in RAM (like plain HNSW), but its public roadmap is moving toward split‑layer storage where only the upper graph levels live in memory and deeper layers + raw vectors remain on disk. As OpenSearch continues to develop new features and optimization to their codec, this implementation allows the continual development and testing of those features in Lucene itself. As such, with this PR, I will also include a link to a luceneutil-jvector repository that works with the proposed JVector codec without significant modifications.

Dependency Information

io.github.jbellis:jvector:4.0.0-beta.6 – the ANN engine (automatic module jvector)
org.agrona:agrona:1.20.0 – off-heap buffer utilities
org.apache.commons:commons-math3:3.6.1 – PQ math helpers
org.yaml:snakeyaml:2.4 – only needed if you load YAML tuning files
org.slf4j:slf4j-api:2.0.17 – logging façade (overrides JVector’s 2.0.16 to match the rest of Lucene)
All jars have matching LICENSE/NOTICE entries added under lucene/licenses/

Vector Codec – design highlights

Per-segment, per-field indexes
Each Lucene segment owns its own JVector graph index. The graph payloads live in a single *.data-jvector file and the per-field metadata lives in a companion *.meta-jvector file, mirroring Lucene’s existing *.vec/ *.vex layout

Bulk build at flush time
Vectors are streamed into the ordinary flat-vector writer while an in-memory OnHeapGraphIndex is built.
When the segment flushes, the whole graph (and optional Product-Quantization code-books) is handed to OnDiskSequentialGraphIndexWriter and serialized to disk in one pass

Single data file, concatenated fields
All field-specific graphs (and PQ blobs) are appended one after another inside *.data-jvector; their start-offsets, lengths and build parameters are recorded in *.meta-jvector so the reader can jump straight to the right slice

Zero-copy loading on open
JVectorReader memory-maps the data file and spawns a lightweight OnDiskGraphIndex for each field via ReaderSupplier. No temp files are created; the mmap’d bytes are shared across threads and searches

Pure-Java search path
At query time the float vector is passed directly to GraphSearcher (DiskANN-style). Results are optionally re-ranked with an exact scorer, then surfaced through a thin JVectorKnnCollector wrapper so the rest of Lucene sees a normal TopDocs

Ordinal → doc-ID mapping still in Lucene
JVector returns internal ordinals; we convert them to docIDs using Lucene’s existing ordinal map during collection.

Long-Term Considerations

Split-layer storage roadmap

JVector aims to only keep upper graph levels in RAM while deeper layers and raw vectors live on disk. Plan for API changes and configuration knobs as this feature stabilizes.

Backwards compatibility with previous JVector implementations

As the codec changes, there’s no guarantee whether indexes generated in past JVectorCodec implementations will work with new version of JVector.

RKSPD · 2025-07-02T01:22:49Z

Initial Benchmark Results

Small Corpus Testing (Wikipedia Cohere 768, 200k docs)

Results: Lucene
recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  overSample  vec_disk(MB)  vec_RAM(MB)  indexType
 0.803        2.800   2.343        0.837  200000   100     300       12         16     7 bits      8.46      23646.25            7.58             1          736.49       1.000       733.185      147.247       HNSW
 0.822        2.486   2.286        0.920  200000   100     300       12         20     7 bits      7.33      27273.97            7.45             1          736.76       1.000       733.185      147.247       HNSW
 0.857        2.657   2.429        0.914  200000   100     300       12         28     7 bits     13.64      14658.46            8.97             1          737.15       1.000       733.185      147.247       HNSW
 0.831        2.771   2.514        0.907  200000   100     300       16         16     7 bits      6.44      31075.20            7.42             1          736.61       1.000       733.185      147.247       HNSW
 0.846        2.857   2.571        0.900  200000   100     300       16         20     7 bits      7.19      27812.54            8.42             1          736.86       1.000       733.185      147.247       HNSW
 0.869        3.029   2.657        0.877  200000   100     300       16         28     7 bits      8.47      23626.70           10.04             1          737.17       1.000       733.185      147.247       HNSW
 0.847        2.829   2.486        0.879  200000   100     300       20         16     7 bits      6.11      32717.16            7.05             1          736.68       1.000       733.185      147.247       HNSW
 0.862        2.743   2.429        0.885  200000   100     300       20         20     7 bits      6.92      28893.38            8.13             1          736.88       1.000       733.185      147.247       HNSW
 0.883        3.086   2.743        0.889  200000   100     300       20         28     7 bits      7.94      25176.23            8.90             1          737.26       1.000       733.185      147.247       HNSW
 0.860        2.943   2.657        0.903  200000   100     300       24         16     7 bits      9.37      21342.44            7.21             1          736.69       1.000       733.185      147.247       HNSW
 0.880        3.371   3.143        0.932  200000   100     300       24         20     7 bits      7.77      25749.97            8.38             1          736.92       1.000       733.185      147.247       HNSW
 0.900        3.086   2.886        0.935  200000   100     300       24         28     7 bits      8.37      23900.57            9.70             1          737.29       1.000       733.185      147.247       HNSW

Results: JVector
recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.877        3.943   3.714        0.942  200000   100     300       12         16     7 bits     12.94      15458.34          101.28             1         1197.28       733.185      147.247       HNSW
 0.901        3.771   3.629        0.962  200000   100     300       12         20     7 bits     13.89      14394.70          123.37             1         1197.28       733.185      147.247       HNSW
 0.913        3.457   3.314        0.959  200000   100     300       12         28     7 bits     18.52      10802.05          136.17             1         1197.28       733.185      147.247       HNSW
 0.915        3.743   3.571        0.954  200000   100     300       16         16     7 bits     15.16      13193.48          118.83             1         1200.28       733.185      147.247       HNSW
 0.921        4.029   3.857        0.957  200000   100     300       16         20     7 bits     18.83      10620.22          134.91             1         1200.28       733.185      147.247       HNSW
 0.931        3.886   3.714        0.956  200000   100     300       16         28     7 bits     22.87       8746.61          174.35             1         1200.28       733.185      147.247       HNSW
 0.921        5.400   5.257        0.974  200000   100     300       20         16     7 bits     15.68      12758.36          126.82             1         1203.30       733.185      147.247       HNSW
 0.929        4.229   4.057        0.959  200000   100     300       20         20     7 bits     19.68      10161.57          152.86             1         1203.30       733.185      147.247       HNSW
 0.942        4.343   4.171        0.961  200000   100     300       20         28     7 bits     27.79       7197.35          212.50             1         1203.30       733.185      147.247       HNSW
 0.930        4.257   4.086        0.960  200000   100     300       24         16     7 bits     17.47      11449.51          131.11             1         1206.33       733.185      147.247       HNSW
 0.943        4.314   4.143        0.960  200000   100     300       24         20     7 bits     21.34       9371.63          162.54             1         1206.33       733.185      147.247       HNSW
 0.940        4.914   4.743        0.965  200000   100     300       24         28     7 bits     29.75       6722.24          235.78             1         1206.33       733.185      147.247       HNSW

Link to dataset straight from luceneutil.

Test Specifications

Hardware:

Amazon EC2 m8gd.2xlarge
8 core / 8 thread Gravitron 4 ARM Processor
32GB RAM + 474GB SSD

Lucene Version Tested:

Lucene99HNSWScalarQuantizedVectorsFormat

JVector Config:

QUERY_OVER_FACTOR = 3
compressedBytes = 64

Benchmark Config:

numMergeWorker = 16
numMergeThread = 8
numSearchThread = 8
numIndexingThread = 8

Merge Policy:

ForceMergesOnlyMergePolicy (bundled with the OpenSearch-JVector project and used for testing)
Found that for small corpus testing < 10M num_docs, it provided more consistent indexing throughput benchmark results than TieredMergePolicy
However, overall indexing time + merge time was slower since with knnPerfTest, merge time makes indexing throughput slower

Notes:
In testing, I found these values allowed for high recall/indexing throughput/latency with similar latency to Lucene. In future testing, need guidance on whether to turn on/off oversample to test JVector since knnPerfTest allows for codec-agnostic oversampling as a parameter.

github-actions · 2025-07-02T01:23:07Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

RKSPD · 2025-07-02T01:23:35Z

Testing JVectorCodec Using luceneutil-jvector

This guide provides step-by-step instructions for benchmarking and testing JVectorCodec performance using the luceneutil-jvector testing framework.

Prerequisites

Java development environment with Gradle support
Python 3.x installed
Git installed
SSD storage recommended for optimal performance

Setup Instructions

1. Environment Preparation

Create a benchmark directory on an SSD for optimal I/O performance:

mkdir LUCENE_BENCH_HOME
cd LUCENE_BENCH_HOME

2. Repository Cloning

Clone the required repositories:

git clone https://github.com/RKSPD/lucene-jvector lucene_candidate
git clone https://github.com/RKSPD/luceneutil-jvector util

Note: The lucene-jvector repository contains the same code as the PR under review.

3. Initial Setup and Data Download

Navigate to the utilities directory and run the initial setup:

cd util
python3 src/python/initial_setup.py -d

This command will download the necessary test datasets. The download process may take some time depending on your internet connection.

4. Lucene Build

While the data is downloading, open a new terminal session and build Lucene:

cd LUCENE_BENCH_HOME/lucene_candidate
./gradlew build

Running Performance Tests

5. Initial Test Run

Once both the build and download processes are complete, navigate back to the utilities directory:

cd LUCENE_BENCH_HOME/util

Run the KNN performance test:

./gradlew runKnnPerfTest

Important: The first execution will fail as expected. This initial run generates the path definitions for your Lucene repository and determines the Lucene version.

6. Successful Test Execution

Run the performance test a second time:

./gradlew runKnnPerfTest

This execution should complete successfully and provide performance metrics.

Configuration and Tuning

7. Parameter Customization

To customize the testing parameters for your specific benchmarking needs:

Merge Policy Configuration

File: util/src/main/knn/KnnIndexer.java
Purpose: Configure the merge policy for index optimization

Codec Configuration

File: util/src/main/knn/KnnGraphTester.java
Method: getCodec()
Purpose: Specify which codec implementation to test

Performance Test Parameters

File: src/python/knnPerfTest.py
Section: params block
Purpose: Adjust various performance testing parameters including:
- Vector dimensions
- Index size
- Query parameters
- Recall targets
- Other algorithm-specific settings

Expected Outcomes

Upon successful completion, you will have:

A fully configured benchmarking environment
Performance metrics comparing JVectorCodec against baseline implementations
Configurable parameters for comprehensive testing scenarios

Troubleshooting

Ensure sufficient disk space for dataset downloads and index generation
Verify Java and Python environments are properly configured
Check network connectivity if initial setup fails during download phase
Confirm SSD usage for optimal I/O performance during benchmarking

github-actions · 2025-07-02T05:43:55Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

benwtrent · 2025-07-02T11:47:06Z

@RKSPD Those benchmarking results are interesting. Is the Lucene clause just regular Lucene HNSW or the Lucene integration of JVector?

RKSPD · 2025-07-02T18:01:26Z

@benwtrent It's a Lucene99HnswScalarQuantizedVectorsFormat. Based on the parameters used with knnPerfTest.py, different codecs are supported with the benchmark so it's not immediately clear without checking the code... I updated my benchmark comment with benchmark and machine specs!

benwtrent · 2025-07-02T18:05:10Z

Thank you @RKSPD !

Wrestling with Lucene Util can be frustrating at times but it's useful once you get the hang of it :)

msokolov · 2025-07-03T14:40:51Z

BTW if you use a current checkout of luceneutil knnPerfTest.py will produce an HTML file with a graph of the test run - would love to see that here if possible?

RKSPD · 2025-07-07T20:32:16Z

BTW if you use a current checkout of luceneutil knnPerfTest.py will produce an HTML file with a graph of the test run - would love to see that here if possible?

@msokolov Just updated my luceneutil, currently rerunning tests and will get those results to you asap!

Edit: I added the result html files as a Github Gist. Here's a preview of what they look like:

Lucene HNSW

JVector

sam-herman

Thank you very much for the contribution @RKSPD !

sam-herman · 2025-07-09T15:33:26Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/jvector/JVectorIndexWriter.java

+import org.apache.lucene.store.IndexOutput;
+
+/**
+ * JVectorRandomAccessWriter is a wrapper around IndexOutput that implements RandomAccessWriter.


nit: one tiny nit to change in the comment (note to self, should be fixed in plugin too), it implements jvector IndexWriter class of jVector. We now have a split between RandomAccessWriter and IndexWriter

rikhil konduru and others added 2 commits July 1, 2025 12:01

Implemented JVector Codec

8f827b2

Merge branch 'apache:main' into jvector

9dc3642

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jul 2, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jul 2, 2025

github-actions bot added the module:sandbox label Jul 2, 2025

Updated build.gradle to pass checks

6bf1138

RKSPD mentioned this pull request Jul 2, 2025

Integrate a JVector codec for KNN searches #14681

Open

sam-herman approved these changes Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add JVector Codec to Lucene for ANN Searches #14892

Add JVector Codec to Lucene for ANN Searches #14892

Uh oh!

RKSPD commented Jul 2, 2025 •

edited

Loading

Uh oh!

RKSPD commented Jul 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

RKSPD commented Jul 2, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

benwtrent commented Jul 2, 2025

Uh oh!

RKSPD commented Jul 2, 2025

Uh oh!

benwtrent commented Jul 2, 2025

Uh oh!

msokolov commented Jul 3, 2025

Uh oh!

RKSPD commented Jul 7, 2025 •

edited

Loading

Uh oh!

sam-herman left a comment

Uh oh!

sam-herman Jul 9, 2025

Uh oh!

Uh oh!

Add JVector Codec to Lucene for ANN Searches #14892

Are you sure you want to change the base?

Add JVector Codec to Lucene for ANN Searches #14892

Uh oh!

Conversation

RKSPD commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Dependency Information

Vector Codec – design highlights

Long-Term Considerations

Uh oh!

RKSPD commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Initial Benchmark Results

Small Corpus Testing (Wikipedia Cohere 768, 200k docs)

Test Specifications

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

RKSPD commented Jul 2, 2025

Testing JVectorCodec Using luceneutil-jvector

Prerequisites

Setup Instructions

1. Environment Preparation

2. Repository Cloning

3. Initial Setup and Data Download

4. Lucene Build

Running Performance Tests

5. Initial Test Run

6. Successful Test Execution

Configuration and Tuning

7. Parameter Customization

Merge Policy Configuration

Codec Configuration

Performance Test Parameters

Expected Outcomes

Troubleshooting

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

benwtrent commented Jul 2, 2025

Uh oh!

RKSPD commented Jul 2, 2025

Uh oh!

benwtrent commented Jul 2, 2025

Uh oh!

msokolov commented Jul 3, 2025

Uh oh!

RKSPD commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sam-herman left a comment

Choose a reason for hiding this comment

Uh oh!

sam-herman Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RKSPD commented Jul 2, 2025 •

edited

Loading

RKSPD commented Jul 2, 2025 •

edited

Loading

RKSPD commented Jul 7, 2025 •

edited

Loading