Skip to content

Implement off-heap quantized scoring #14863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kaivalnp
Copy link
Contributor

Description

Off-heap scoring for quantized vectors! Related to #13515

This scorer is in-line with Lucene99MemorySegmentFlatVectorsScorer, and will automatically be used with PanamaVectorizationProvider (i.e. on adding jdk.incubator.vector). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here..

I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@kaivalnp
Copy link
Contributor Author

I ran some benchmarks on Cohere vectors (768d) for 7-bit and 4-bit (compressed) quantization..

main without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.860        2.815   2.806        0.997  100000   100      50       64        250     7 bits     44.07       2269.17           46.79             1          373.72       366.592       73.624       HNSW
 0.545        3.193   3.185        0.997  100000   100      50       64        250     4 bits     47.26       2115.95           50.04             1          338.13       329.971       37.003       HNSW

main with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.863        1.904   1.886        0.991  100000   100      50       64        250     7 bits     28.65       3490.65           29.66             1          373.69       366.592       73.624       HNSW
 0.545        1.313   1.305        0.994  100000   100      50       64        250     4 bits     22.86       4373.88           17.84             1          338.13       329.971       37.003       HNSW

This PR without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        2.774   2.765        0.997  100000   100      50       64        250     7 bits     44.60       2242.00           46.71             1          373.73       366.592       73.624       HNSW
 0.545        3.147   3.139        0.997  100000   100      50       64        250     4 bits     47.93       2086.51           50.20             1          338.11       329.971       37.003       HNSW

This PR with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        1.612   1.603        0.994  100000   100      50       64        250     7 bits     22.99       4349.53           24.78             1          373.70       366.592       73.624       HNSW
 0.545        1.277   1.269        0.994  100000   100      50       64        250     4 bits     21.60       4630.49           17.41             1          338.11       329.971       37.003       HNSW

I did see slight fluctuation across runs, but the search time was ~10% faster for 7-bit and very slightly faster for 4-bit (compressed). Indexing and force merge times have improved by ~15%

@kaivalnp
Copy link
Contributor Author

FYI I observed a strange phenomenon where if the query vector is on heap like:

this.query = MemorySegment.ofArray(targetBytes);

instead of the current off-heap implementation in this PR:

this.query = Arena.ofAuto().allocateFrom(JAVA_BYTE, targetBytes);

..then we see a performance regression:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        3.043   3.034        0.997  100000   100      50       64        250     7 bits     23.25       4301.82           25.29             1          373.70       366.592       73.624       HNSW
 0.545        2.060   2.049        0.995  100000   100      50       64        250     4 bits     22.19       4506.33           17.99             1          338.17       329.971       37.003       HNSW

Maybe I'm missing something obvious, but I haven't found the root cause yet..

@ChrisHegarty
Copy link
Contributor

..then we see a performance regression:
...
Maybe I'm missing something obvious, but I haven't found the root cause yet..

yeah. I've seen similar before. You might be hitting a problem with the loop bound not being hoisted. I will try to take a look.

@kaivalnp
Copy link
Contributor Author

Thanks @ChrisHegarty! I saw that we use a heap-backed MemorySegment while scoring byte vectors -- so I opened #14874 to investigate if we can improve performance by moving to an off-heap query

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants