-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add JVector Codec to Lucene for ANN Searches #14892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Initial Benchmark ResultsSmall Corpus Testing (Wikipedia Cohere 768, 200k docs)
Link to dataset straight from luceneutil. Test SpecificationsHardware:
Lucene Version Tested:
JVector Config:
Benchmark Config:
Merge Policy:
Notes: |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Testing JVectorCodec Using luceneutil-jvectorThis guide provides step-by-step instructions for benchmarking and testing JVectorCodec performance using the luceneutil-jvector testing framework. Prerequisites
Setup Instructions1. Environment PreparationCreate a benchmark directory on an SSD for optimal I/O performance:
2. Repository CloningClone the required repositories:
Note: The 3. Initial Setup and Data DownloadNavigate to the utilities directory and run the initial setup:
This command will download the necessary test datasets. The download process may take some time depending on your internet connection. 4. Lucene BuildWhile the data is downloading, open a new terminal session and build Lucene:
Running Performance Tests5. Initial Test RunOnce both the build and download processes are complete, navigate back to the utilities directory:
Run the KNN performance test:
Important: The first execution will fail as expected. This initial run generates the path definitions for your Lucene repository and determines the Lucene version. 6. Successful Test ExecutionRun the performance test a second time:
This execution should complete successfully and provide performance metrics. Configuration and Tuning7. Parameter CustomizationTo customize the testing parameters for your specific benchmarking needs: Merge Policy Configuration
Codec Configuration
Performance Test Parameters
Expected OutcomesUpon successful completion, you will have:
Troubleshooting
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@RKSPD Those benchmarking results are interesting. Is the |
@benwtrent It's a Lucene99HnswScalarQuantizedVectorsFormat. Based on the parameters used with knnPerfTest.py, different codecs are supported with the benchmark so it's not immediately clear without checking the code... I updated my benchmark comment with benchmark and machine specs! |
Thank you @RKSPD ! Wrestling with Lucene Util can be frustrating at times but it's useful once you get the hang of it :) |
BTW if you use a current checkout of |
@msokolov Just updated my luceneutil, currently rerunning tests and will get those results to you asap! Edit: I added the result html files as a Github Gist. Here's a preview of what they look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the contribution @RKSPD !
import org.apache.lucene.store.IndexOutput; | ||
|
||
/** | ||
* JVectorRandomAccessWriter is a wrapper around IndexOutput that implements RandomAccessWriter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one tiny nit to change in the comment (note to self, should be fixed in plugin too), it implements jvector IndexWriter
class of jVector. We now have a split between RandomAccessWriter
and IndexWriter
Motivation
Lucene’s built‑in HNSW KnnVectorsFormat delivers strong recall/latency, but its index must reside entirely in RAM. As demand for vector datasets of larger dimensionality and greater index size increases, the cost of scaling systems like HNSW become prohibitively expensive.
JVector is a pure‑Java ANN engine that ultimately aims to merge DiskANN’s disk‑resident search with HNSW’s navigable‑small‑world graph. OpenSearch has successfully integrated JVector through the OpenSearch-JVector repository, but the current implementation contains several OpenSearch-specific dependencies.
Today, OpenSearch's implementation, and by extension this implementation, still loads the whole graph in RAM (like plain HNSW), but its public roadmap is moving toward split‑layer storage where only the upper graph levels live in memory and deeper layers + raw vectors remain on disk. As OpenSearch continues to develop new features and optimization to their codec, this implementation allows the continual development and testing of those features in Lucene itself. As such, with this PR, I will also include a link to a luceneutil-jvector repository that works with the proposed JVector codec without significant modifications.
Dependency Information
io.github.jbellis:jvector:4.0.0-beta.6
– the ANN engine (automatic modulejvector
)org.agrona:agrona:1.20.0
– off-heap buffer utilitiesorg.apache.commons:commons-math3:3.6.1
– PQ math helpersorg.yaml:snakeyaml:2.4
– only needed if you load YAML tuning filesorg.slf4j:slf4j-api:2.0.17
– logging façade (overrides JVector’s 2.0.16 to match the rest of Lucene)lucene/licenses/
Vector Codec – design highlights
Per-segment, per-field indexes
Each Lucene segment owns its own JVector graph index. The graph payloads live in a single *.data-jvector file and the per-field metadata lives in a companion *.meta-jvector file, mirroring Lucene’s existing *.vec/ *.vex layout
Bulk build at flush time
Vectors are streamed into the ordinary flat-vector writer while an in-memory OnHeapGraphIndex is built.
When the segment flushes, the whole graph (and optional Product-Quantization code-books) is handed to OnDiskSequentialGraphIndexWriter and serialized to disk in one pass
Single data file, concatenated fields
All field-specific graphs (and PQ blobs) are appended one after another inside *.data-jvector; their start-offsets, lengths and build parameters are recorded in *.meta-jvector so the reader can jump straight to the right slice
Zero-copy loading on open
JVectorReader memory-maps the data file and spawns a lightweight OnDiskGraphIndex for each field via ReaderSupplier. No temp files are created; the mmap’d bytes are shared across threads and searches
Pure-Java search path
At query time the float vector is passed directly to GraphSearcher (DiskANN-style). Results are optionally re-ranked with an exact scorer, then surfaced through a thin JVectorKnnCollector wrapper so the rest of Lucene sees a normal TopDocs
Ordinal → doc-ID mapping still in Lucene
JVector returns internal ordinals; we convert them to docIDs using Lucene’s existing ordinal map during collection.
Long-Term Considerations
Split-layer storage roadmap
Backwards compatibility with previous JVector implementations