Skip to content

Add JVector Codec to Lucene for ANN Searches #14892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

RKSPD
Copy link

@RKSPD RKSPD commented Jul 2, 2025

Motivation

Lucene’s built‑in HNSW KnnVectorsFormat delivers strong recall/latency, but its index must reside entirely in RAM. As demand for vector datasets of larger dimensionality and greater index size increases, the cost of scaling systems like HNSW become prohibitively expensive.

JVector is a pure‑Java ANN engine that ultimately aims to merge DiskANN’s disk‑resident search with HNSW’s navigable‑small‑world graph. OpenSearch has successfully integrated JVector through the OpenSearch-JVector repository, but the current implementation contains several OpenSearch-specific dependencies.

Today, OpenSearch's implementation, and by extension this implementation, still loads the whole graph in RAM (like plain HNSW), but its public roadmap is moving toward split‑layer storage where only the upper graph levels live in memory and deeper layers + raw vectors remain on disk. As OpenSearch continues to develop new features and optimization to their codec, this implementation allows the continual development and testing of those features in Lucene itself. As such, with this PR, I will also include a link to a luceneutil-jvector repository that works with the proposed JVector codec without significant modifications.

Dependency Information

  • io.github.jbellis:jvector:4.0.0-beta.6 – the ANN engine (automatic module jvector)
  • org.agrona:agrona:1.20.0 – off-heap buffer utilities
  • org.apache.commons:commons-math3:3.6.1 – PQ math helpers
  • org.yaml:snakeyaml:2.4 – only needed if you load YAML tuning files
  • org.slf4j:slf4j-api:2.0.17 – logging façade (overrides JVector’s 2.0.16 to match the rest of Lucene)
  • All jars have matching LICENSE/NOTICE entries added under lucene/licenses/

Vector Codec – design highlights

Per-segment, per-field indexes
Each Lucene segment owns its own JVector graph index. The graph payloads live in a single *.data-jvector file and the per-field metadata lives in a companion *.meta-jvector file, mirroring Lucene’s existing *.vec/ *.vex layout

Bulk build at flush time
Vectors are streamed into the ordinary flat-vector writer while an in-memory OnHeapGraphIndex is built.
When the segment flushes, the whole graph (and optional Product-Quantization code-books) is handed to OnDiskSequentialGraphIndexWriter and serialized to disk in one pass

Single data file, concatenated fields
All field-specific graphs (and PQ blobs) are appended one after another inside *.data-jvector; their start-offsets, lengths and build parameters are recorded in *.meta-jvector so the reader can jump straight to the right slice

Zero-copy loading on open
JVectorReader memory-maps the data file and spawns a lightweight OnDiskGraphIndex for each field via ReaderSupplier. No temp files are created; the mmap’d bytes are shared across threads and searches

Pure-Java search path
At query time the float vector is passed directly to GraphSearcher (DiskANN-style). Results are optionally re-ranked with an exact scorer, then surfaced through a thin JVectorKnnCollector wrapper so the rest of Lucene sees a normal TopDocs

Ordinal → doc-ID mapping still in Lucene
JVector returns internal ordinals; we convert them to docIDs using Lucene’s existing ordinal map during collection.

Long-Term Considerations

Split-layer storage roadmap

  • JVector aims to only keep upper graph levels in RAM while deeper layers and raw vectors live on disk. Plan for API changes and configuration knobs as this feature stabilizes.

Backwards compatibility with previous JVector implementations

  • As the codec changes, there’s no guarantee whether indexes generated in past JVectorCodec implementations will work with new version of JVector.

@RKSPD
Copy link
Author

RKSPD commented Jul 2, 2025

Initial Benchmark Results

Small Corpus Testing (Wikipedia Cohere 768, 200k docs)

Results: Lucene
recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  overSample  vec_disk(MB)  vec_RAM(MB)  indexType
 0.803        2.800   2.343        0.837  200000   100     300       12         16     7 bits      8.46      23646.25            7.58             1          736.49       1.000       733.185      147.247       HNSW
 0.822        2.486   2.286        0.920  200000   100     300       12         20     7 bits      7.33      27273.97            7.45             1          736.76       1.000       733.185      147.247       HNSW
 0.857        2.657   2.429        0.914  200000   100     300       12         28     7 bits     13.64      14658.46            8.97             1          737.15       1.000       733.185      147.247       HNSW
 0.831        2.771   2.514        0.907  200000   100     300       16         16     7 bits      6.44      31075.20            7.42             1          736.61       1.000       733.185      147.247       HNSW
 0.846        2.857   2.571        0.900  200000   100     300       16         20     7 bits      7.19      27812.54            8.42             1          736.86       1.000       733.185      147.247       HNSW
 0.869        3.029   2.657        0.877  200000   100     300       16         28     7 bits      8.47      23626.70           10.04             1          737.17       1.000       733.185      147.247       HNSW
 0.847        2.829   2.486        0.879  200000   100     300       20         16     7 bits      6.11      32717.16            7.05             1          736.68       1.000       733.185      147.247       HNSW
 0.862        2.743   2.429        0.885  200000   100     300       20         20     7 bits      6.92      28893.38            8.13             1          736.88       1.000       733.185      147.247       HNSW
 0.883        3.086   2.743        0.889  200000   100     300       20         28     7 bits      7.94      25176.23            8.90             1          737.26       1.000       733.185      147.247       HNSW
 0.860        2.943   2.657        0.903  200000   100     300       24         16     7 bits      9.37      21342.44            7.21             1          736.69       1.000       733.185      147.247       HNSW
 0.880        3.371   3.143        0.932  200000   100     300       24         20     7 bits      7.77      25749.97            8.38             1          736.92       1.000       733.185      147.247       HNSW
 0.900        3.086   2.886        0.935  200000   100     300       24         28     7 bits      8.37      23900.57            9.70             1          737.29       1.000       733.185      147.247       HNSW
Results: JVector
recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.877        3.943   3.714        0.942  200000   100     300       12         16     7 bits     12.94      15458.34          101.28             1         1197.28       733.185      147.247       HNSW
 0.901        3.771   3.629        0.962  200000   100     300       12         20     7 bits     13.89      14394.70          123.37             1         1197.28       733.185      147.247       HNSW
 0.913        3.457   3.314        0.959  200000   100     300       12         28     7 bits     18.52      10802.05          136.17             1         1197.28       733.185      147.247       HNSW
 0.915        3.743   3.571        0.954  200000   100     300       16         16     7 bits     15.16      13193.48          118.83             1         1200.28       733.185      147.247       HNSW
 0.921        4.029   3.857        0.957  200000   100     300       16         20     7 bits     18.83      10620.22          134.91             1         1200.28       733.185      147.247       HNSW
 0.931        3.886   3.714        0.956  200000   100     300       16         28     7 bits     22.87       8746.61          174.35             1         1200.28       733.185      147.247       HNSW
 0.921        5.400   5.257        0.974  200000   100     300       20         16     7 bits     15.68      12758.36          126.82             1         1203.30       733.185      147.247       HNSW
 0.929        4.229   4.057        0.959  200000   100     300       20         20     7 bits     19.68      10161.57          152.86             1         1203.30       733.185      147.247       HNSW
 0.942        4.343   4.171        0.961  200000   100     300       20         28     7 bits     27.79       7197.35          212.50             1         1203.30       733.185      147.247       HNSW
 0.930        4.257   4.086        0.960  200000   100     300       24         16     7 bits     17.47      11449.51          131.11             1         1206.33       733.185      147.247       HNSW
 0.943        4.314   4.143        0.960  200000   100     300       24         20     7 bits     21.34       9371.63          162.54             1         1206.33       733.185      147.247       HNSW
 0.940        4.914   4.743        0.965  200000   100     300       24         28     7 bits     29.75       6722.24          235.78             1         1206.33       733.185      147.247       HNSW

Link to dataset straight from luceneutil.

Test Specifications

Hardware:

  • Amazon EC2 m8gd.2xlarge
  • 8 core / 8 thread Gravitron 4 ARM Processor
  • 32GB RAM + 474GB SSD

Lucene Version Tested:

  • Lucene99HNSWScalarQuantizedVectorsFormat

JVector Config:

  • QUERY_OVER_FACTOR = 3
  • compressedBytes = 64

Benchmark Config:

  • numMergeWorker = 16
  • numMergeThread = 8
  • numSearchThread = 8
  • numIndexingThread = 8

Merge Policy:

  • ForceMergesOnlyMergePolicy (bundled with the OpenSearch-JVector project and used for testing)
  • Found that for small corpus testing < 10M num_docs, it provided more consistent indexing throughput benchmark results than TieredMergePolicy
  • However, overall indexing time + merge time was slower since with knnPerfTest, merge time makes indexing throughput slower

Notes:
In testing, I found these values allowed for high recall/indexing throughput/latency with similar latency to Lucene. In future testing, need guidance on whether to turn on/off oversample to test JVector since knnPerfTest allows for codec-agnostic oversampling as a parameter.

Copy link

github-actions bot commented Jul 2, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@RKSPD
Copy link
Author

RKSPD commented Jul 2, 2025

Testing JVectorCodec Using luceneutil-jvector

This guide provides step-by-step instructions for benchmarking and testing JVectorCodec performance using the luceneutil-jvector testing framework.

Prerequisites

  • Java development environment with Gradle support
  • Python 3.x installed
  • Git installed
  • SSD storage recommended for optimal performance

Setup Instructions

1. Environment Preparation

Create a benchmark directory on an SSD for optimal I/O performance:

mkdir LUCENE_BENCH_HOME
cd LUCENE_BENCH_HOME

2. Repository Cloning

Clone the required repositories:

git clone https://github.com/RKSPD/lucene-jvector lucene_candidate
git clone https://github.com/RKSPD/luceneutil-jvector util

Note: The lucene-jvector repository contains the same code as the PR under review.

3. Initial Setup and Data Download

Navigate to the utilities directory and run the initial setup:

cd util
python3 src/python/initial_setup.py -d

This command will download the necessary test datasets. The download process may take some time depending on your internet connection.

4. Lucene Build

While the data is downloading, open a new terminal session and build Lucene:

cd LUCENE_BENCH_HOME/lucene_candidate
./gradlew build

Running Performance Tests

5. Initial Test Run

Once both the build and download processes are complete, navigate back to the utilities directory:

cd LUCENE_BENCH_HOME/util

Run the KNN performance test:

./gradlew runKnnPerfTest

Important: The first execution will fail as expected. This initial run generates the path definitions for your Lucene repository and determines the Lucene version.

6. Successful Test Execution

Run the performance test a second time:

./gradlew runKnnPerfTest

This execution should complete successfully and provide performance metrics.

Configuration and Tuning

7. Parameter Customization

To customize the testing parameters for your specific benchmarking needs:

Merge Policy Configuration

  • File: util/src/main/knn/KnnIndexer.java
  • Purpose: Configure the merge policy for index optimization

Codec Configuration

  • File: util/src/main/knn/KnnGraphTester.java
  • Method: getCodec()
  • Purpose: Specify which codec implementation to test

Performance Test Parameters

  • File: src/python/knnPerfTest.py
  • Section: params block
  • Purpose: Adjust various performance testing parameters including:
    • Vector dimensions
    • Index size
    • Query parameters
    • Recall targets
    • Other algorithm-specific settings

Expected Outcomes

Upon successful completion, you will have:

  • A fully configured benchmarking environment
  • Performance metrics comparing JVectorCodec against baseline implementations
  • Configurable parameters for comprehensive testing scenarios

Troubleshooting

  • Ensure sufficient disk space for dataset downloads and index generation
  • Verify Java and Python environments are properly configured
  • Check network connectivity if initial setup fails during download phase
  • Confirm SSD usage for optimal I/O performance during benchmarking

Copy link

github-actions bot commented Jul 2, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@benwtrent
Copy link
Member

@RKSPD Those benchmarking results are interesting. Is the Lucene clause just regular Lucene HNSW or the Lucene integration of JVector?

@RKSPD
Copy link
Author

RKSPD commented Jul 2, 2025

@benwtrent It's a Lucene99HnswScalarQuantizedVectorsFormat. Based on the parameters used with knnPerfTest.py, different codecs are supported with the benchmark so it's not immediately clear without checking the code... I updated my benchmark comment with benchmark and machine specs!

@benwtrent
Copy link
Member

Thank you @RKSPD !

Wrestling with Lucene Util can be frustrating at times but it's useful once you get the hang of it :)

@msokolov
Copy link
Contributor

msokolov commented Jul 3, 2025

BTW if you use a current checkout of luceneutil knnPerfTest.py will produce an HTML file with a graph of the test run - would love to see that here if possible?

@RKSPD
Copy link
Author

RKSPD commented Jul 7, 2025

BTW if you use a current checkout of luceneutil knnPerfTest.py will produce an HTML file with a graph of the test run - would love to see that here if possible?

@msokolov Just updated my luceneutil, currently rerunning tests and will get those results to you asap!

Edit: I added the result html files as a Github Gist. Here's a preview of what they look like:

Lucene HNSW
image

JVector
image

Copy link

@sam-herman sam-herman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the contribution @RKSPD !

import org.apache.lucene.store.IndexOutput;

/**
* JVectorRandomAccessWriter is a wrapper around IndexOutput that implements RandomAccessWriter.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one tiny nit to change in the comment (note to self, should be fixed in plugin too), it implements jvector IndexWriter class of jVector. We now have a split between RandomAccessWriter and IndexWriter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants