Skip to content

Fix off-heap byte vector scoring at query time #14874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kaivalnp
Copy link
Contributor

@kaivalnp kaivalnp commented Jun 30, 2025

Description

In #14863, I noticed a regression in computing vector scores if the query vector was present on heap (i.e. allocated using MemorySegment.ofArray) but the doc was on RAM

We do have a JMH benchmark for testing the off-heap scoring performance -- but it only tests the indexing case (using an UpdateableRandomVectorScorer), when both vectors are on RAM (mmap-ed input)

I also added benchmark functions to demonstrate the searching case (using a RandomVectorScorer)

main:

Benchmark                                               (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductIndexingDefault     1024  thrpt   15  2.299 ± 0.017  ops/us
VectorScorerBenchmark.binaryDotProductIndexingMemSeg      1024  thrpt   15  7.533 ± 0.107  ops/us
VectorScorerBenchmark.binaryDotProductSearchingDefault    1024  thrpt   15  2.332 ± 0.017  ops/us
VectorScorerBenchmark.binaryDotProductSearchingMemSeg     1024  thrpt   15  2.069 ± 0.069  ops/us

This PR:

Benchmark                                               (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductIndexingDefault     1024  thrpt   15  2.295 ± 0.012  ops/us
VectorScorerBenchmark.binaryDotProductIndexingMemSeg      1024  thrpt   15  7.551 ± 0.027  ops/us
VectorScorerBenchmark.binaryDotProductSearchingDefault    1024  thrpt   15  2.341 ± 0.019  ops/us
VectorScorerBenchmark.binaryDotProductSearchingMemSeg     1024  thrpt   15  4.241 ± 0.064  ops/us

We see ~2x improvement in vector scoring time! (and the off-heap computation is actually slower than on-heap on main?)
I'm not sure if this is specific to my machine, or something in general..

Pasting CPU details in case it is relevant:

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Stepping:            r1p1
BogoMIPS:            2100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-63
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng

and java --version:

openjdk 24.0.1 2025-04-15
OpenJDK Runtime Environment Corretto-24.0.1.9.1 (build 24.0.1+9-FR)
OpenJDK 64-Bit Server VM Corretto-24.0.1.9.1 (build 24.0.1+9-FR, mixed mode, sharing)

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jun 30, 2025

Here's another set of benchmarks on an x86 machine (results are similar to aarch64 results above^)

main:

Benchmark                                               (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductIndexingDefault     1024  thrpt   15   2.129 ± 0.006  ops/us
VectorScorerBenchmark.binaryDotProductIndexingMemSeg      1024  thrpt   15  10.411 ± 0.065  ops/us
VectorScorerBenchmark.binaryDotProductSearchingDefault    1024  thrpt   15   2.130 ± 0.006  ops/us
VectorScorerBenchmark.binaryDotProductSearchingMemSeg     1024  thrpt   15   2.215 ± 0.071  ops/us

This PR:

Benchmark                                               (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductIndexingDefault     1024  thrpt   15   2.133 ± 0.004  ops/us
VectorScorerBenchmark.binaryDotProductIndexingMemSeg      1024  thrpt   15  10.659 ± 0.106  ops/us
VectorScorerBenchmark.binaryDotProductSearchingDefault    1024  thrpt   15   2.130 ± 0.008  ops/us
VectorScorerBenchmark.binaryDotProductSearchingMemSeg     1024  thrpt   15   4.317 ± 0.317  ops/us

CPU details:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3043.199
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

and java --version:

openjdk 24.0.1 2025-04-15
OpenJDK Runtime Environment Corretto-24.0.1.9.1 (build 24.0.1+9-FR)
OpenJDK 64-Bit Server VM Corretto-24.0.1.9.1 (build 24.0.1+9-FR, mixed mode, sharing)

@msokolov
Copy link
Contributor

wow, interesting, nice find! If you have time, it would also be great to see some knnPerTest.py results comparing before/after this change, in addition to the microbenchmarking, just to make extra sure

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 1, 2025

see some knnPerTest.py results comparing before/after

I tried this on 300d byte vectors generated using:

./gradlew vectors-300

..and the results are strange!


main without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.834        0.507   0.498        0.982  100000   100      50       64        250         no     23.20       4310.16            0.00             1           31.91       114.441      114.441       HNSW

main with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.835        0.449   0.441        0.982  100000   100      50       64        250         no     15.36       6511.69            0.00             1           31.91       114.441      114.441       HNSW

This PR without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.839        0.524   0.516        0.985  100000   100      50       64        250         no     22.71       4402.57            0.00             1           31.93       114.441      114.441       HNSW

This PR with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.834        0.832   0.825        0.992  100000   100      50       64        250         no     15.43       6479.20            0.00             1           31.92       114.441      114.441       HNSW

We see ~85% regression in the vectorized implementation after making these changes


I understand there is a cost to allocate a native MemorySegment, copy the query vector, and GC overhead, but does the regression seem a bit too high?

Attaching the flame graph comparison as well..
Screenshot 2025-07-01 at 12 22 11 PM

Any leads to debug would be appreciated!

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 1, 2025

One possible explanation (from the memory profiler) for dot products being expensive is that the conversion from 8 bytes to integers is expensive:
Screenshot 2025-07-01 at 12 28 20 PM

@msokolov
Copy link
Contributor

msokolov commented Jul 3, 2025

I would be curious to see if the results hold up for 768d vectors. Maybe 300 not being divisible by 8 is problematic, although I don't believe it. But ... the results look so weird, I just want to try some other random thing and see if it holds up?

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 3, 2025

if the results hold up for 768d vectors

Thanks @msokolov I quantized the Cohere 768d vectors by:

  1. normalizing
  2. scaling each dimension by 256
  3. clipping between [-128, 127]

i.e. basically replicating this snippet

But something really strange is happening!

I first ran a set of benchmarks using -reindex to create fresh indices and ensure that the indexing times are not adversely affected:

main (run 1)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.940        1.568   1.567        1.000  200000   100      50       64        250         no     63.52       3148.47           23.21             1          152.11       585.938      585.938       HNSW

This PR (run 2)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.939        2.894   2.894        1.000  200000   100      50       64        250         no     37.69       5306.87           12.99             1          152.09       585.938      585.938       HNSW

This felt a bit strange because of a higher latency at query time (conflicting with JMH results) + lower indexing time in this PR -- so I ran a follow-up without -reindex to search from a built index (basically re-using the index created in run 2 for runs 3 and 4):

main (run 3)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.939        1.602   1.601        0.999  200000   100      50       64        250         no      0.00      Infinity            0.12             1          152.09       585.938      585.938       HNSW

This PR (run 4)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.939        1.090   1.090        1.000  200000   100      50       64        250         no      0.00      Infinity            0.12             1          152.09       585.938      585.938       HNSW

The performance in main is unchanged -- but we see significant improvement after moving the query off-heap!
I don't have a good explanation for this, I'll try to replicate on 300d vectors to see if this holds true!

Meanwhile, it'd be great if you could reproduce the benchmarks like I did :)

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 3, 2025

Okay I ran things slightly differently for 300d vectors. All runs are without -reindex, but I'm deleting the index between runs of main and this PR to create a fresh one

main (run 1, no pre-existing index)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.808        0.635   0.635        1.000  200000   100      50       64        250         no     38.55       5188.07            0.00             1           64.00       228.882      228.882       HNSW

main (run 2, use same index as run 1)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.808        0.635   0.634        0.999  200000   100      50       64        250         no      0.00      Infinity            0.12             1           64.00       228.882      228.882       HNSW

This PR (run 3, no pre-existing index)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.809        0.977   0.978        1.001  200000   100      50       64        250         no     40.11       4985.91            0.00             1           63.95       228.882      228.882       HNSW

This PR (run 4, use same index as run 3)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.809        0.491   0.491        1.000  200000   100      50       64        250         no      0.00      Infinity            0.12             1           63.95       228.882      228.882       HNSW

The results seem consistent with what we see above, the changes in this PR cause a regression if used immediately after indexing, but a speedup if used on an existing index?


Edit: The indexing time had not improved if we started fresh -- so I kept the old index (from run 4) around and did another benchmark (run 5) on this PR using -reindex and:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.809        1.626   1.626        1.000  200000   100      50       64        250         no     24.59       8133.39            0.00             1           64.00       228.882      228.882       HNSW

..we see faster indexing times! (but the slow query-time issue still persists)

I ran a similar setup on main as well -- kept the old index (from run 5) around and ran another benchmark (run 6) using -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.806        0.477   0.477        0.999  200000   100      50       64        250         no     24.50       8163.60            0.00             1           63.98       228.882      228.882       HNSW

..and both query / indexing times have improved!

Maybe it's something related to OS caching, I'm still confused :)

@msokolov
Copy link
Contributor

msokolov commented Jul 9, 2025

I was able to reproduce the weird behavior where the first run produces higher latency than subsequent runs. I also noticed netCPU and avgCpuCount were both negative in that first run?? I'm suspecting a bug in KnnGraphTester that may be related to having run the indexing (or maybe having computed the exact knn results) in the same run?

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 9, 2025

I also noticed netCPU and avgCpuCount were both negative in that first run??

Yeah I noticed this too! I think it was because of computing exact KNN results, I didn't see the same issue on re-indexing..

@msokolov
Copy link
Contributor

msokolov commented Jul 9, 2025

My weird results (w/768d cohere vectors). In sum, it looks to me as if we have some noise immediately after reindexing. Either this is a measurement artifact in luceneutil or it is a startup transient related to page caching / preloading / madvise / something. But after things settle doen this is clearly a bug improvement and we should merge, while also trying to understand this measurement problem -- which is pre-existsing and should not block this change.

run 1

baseline

first line does reindex -- and then latency and netCPU increase? This has nothing to do with this PR.

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited
 0.987       17.861      17.850                 0.999  500000   100      50       32                           250         no         25663
 0.987       25.272      25.256                 0.999  500000   100      50       32                           250         no         25663
 0.987       24.355      24.340                 0.999  500000   100      50       32                           250         no         25663

candidate

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited
 0.987       13.261  13.251        0.999              500000   100      50                  32                 250              no    25663
 0.987       13.366  13.356        0.999              500000   100      50                  32                 250              no    25663

run 2

candidate (reindex each time)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited
0.983       57.972       57.959        1.000         500000   100      50                    32        250                      no    25648
0.980       55.219       55.205        1.000         500000   100      50                    32        250                      no    25802
0.986       93.766       93.752        1.000         500000   100      50                    32        250                      no    25630

no reindex

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited
0.986       13.170       13.159        0.999         500000   100      50                    32        250                      no    25630
0.986       13.411       13.400        0.999         500000   100      50                    32        250                      no    25630
0.986       13.535       13.523        0.999         500000   100      50                    32        250                      no    25630

baseline (no reindex)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited
 0.986       26.048      26.033        0.999         500000   100      50                   32        250                       no    25630
 0.986       25.954      25.938        0.999         500000   100      50                   32        250                       no    25630

@msokolov
Copy link
Contributor

msokolov commented Jul 9, 2025

I tried another test:

  1. knnGraphTest.py with -reindex and -stats (no searching)
  2. knnGraphTest.py with no -reindex and -search

With the idea that if there was some kind of OS paging thing it would affect this in the same way as the normal -reindex + -search, but it did not: I saw the latency improvement (around ~13 ms). So this makes me think something happens in the JVM that is doing the indexing that predisposes it to have worse search performance? Some kind of hotspot something or other? Next I want to try looking at profiler, but it is a bit tricky since the index + search profile is going to be muddied by the indexing part

@msokolov
Copy link
Contributor

I did some deep-diving with profiler and I realized that when indexing, we call these dotProduct methods in a different context in which all of the vectors are on-heap. I'm surmising that this induces hotspot compiler to make some choices that end up being bad later when half of the vectors (the document ones) are off-heap; ie while searching. To test this theory I cloned the dotProduct(ByteSegment, ByteSegment) and called the new dotProductWTF(ByteSegment, ByteSegment) from dotProduct(byte[], byte[]). This did indeed lead to even greater speedups (another 2x, astoundingly), although strangely, a small but significant drop in recall?? I'm now trying again with single-threaded merges to see if somehow maybe increased lock contention during merging is leading to the recall drop. Hmm actually I wonder if it's simply that there are a different number of segments being produced.

Separately, this whole line of attack made me wonder if maybe we ought to be storing our vectors differently while indexing ... instead of a List<byte[]> or List<float[]> we might prefer to preallocate a larger array and write to it, advancing a pointer as we go. If we did this we could also maybe preallocate a MemorySegment along with it to use for scoring??. I don't know what impact if any this might have on the whole hotspot discovery, but it is becoming clear to me that the way we manage memory can have a major impact on the performance here in nonobvious ways.

I'm also still somewhat confused about the status of the whole Paname vectorization code. I guess it is still incubating, so we have decided not to require it, and our code paths are conditional on whether it is enabled or not. I think this conditionalization itself may be problematic from a performance perspective, although this is more a suspicion than anything, but I wonder if it would be possible to just require users to enable the incubating API be enabled at this point? It is in its umpteenth uncubation cycle, seems like it should be safe.

@msokolov
Copy link
Contributor

Separately, I tried using the Arena.ofAuto().allocateFrom() construct in the on-heap case that is used during indexing and this made indexing incredibly slow. I guess it is because we force the allocation of many many small memory segments that have to be cleaned up by the garbage collector. Possibly during search, when the memory is off-heap, this is faster because the allocation is more lightweight? In any case it points to the need to have separate code paths for on-heap and off-heap cases, or else more explicit management of the lifecycle of these memory segments, rather than leaving it up to GC as happens with ofAuto().

@msokolov
Copy link
Contributor

OK I discovered the loss of recall was due to a silly bug. After fixing that, these are the results I'm seeing with the addition of a separate code path for dotProduct(byte[], byte[]):

candidate (reindex)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  num_segments
0.979        7.794   7.795        1.000  500000   100      50       32        250         no    25766    497.73             3
0.979       13.193  13.194        1.000  500000   100      50       32        250         no    25766      0.00             3
0.979       13.487  13.489        1.000  500000   100      50       32        250         no    25766      0.00             3

baseline

0.979       24.282  24.280        1.000  500000   100      50       32        250         no    25766      0.00             3
0.979       25.850  25.848        1.000  500000   100      50       32        250         no    25766      0.00             3

baseline (reindex)

 0.986       16.265  16.265        1.000  500000   100      50       32        250         no    25702    441.32             3
 0.986       24.319  24.316        1.000  500000   100      50       32        250         no    25702      0.00             3
 0.986       24.319  24.316        1.000  500000   100      50       32        250         no    25702      0.00             3

candidate

 0.986       13.406  13.407        1.000  500000   100      50       32        250         no    25702      0.00             3
 0.986       13.088  13.089        1.000  500000   100      50       32        250         no    25702      0.00             3
 0.986       13.319  13.320        1.000  500000   100      50       32        250         no    25702      0.00             3

so now the situation is reversed and we see faster searching after indexing in the same JVM.

@msokolov
Copy link
Contributor

what I did:

@@ -305,7 +306,36 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
 
   @Override
   public int dotProduct(byte[] a, byte[] b) {
-    return dotProduct(MemorySegment.ofArray(a), MemorySegment.ofArray(b));
+    return dotProductWTF(MemorySegment.ofArray(a), MemorySegment.ofArray(b));
+  }
+
+  static int dotProductWTF(MemorySegment a, MemorySegment b) {
+    assert a.byteSize() == b.byteSize();
+    int i = 0;
+    int res = 0;
+
+    // only vectorize if we'll at least enter the loop a single time, and we have at least 128-bit
+    // vectors (256-bit on intel to dodge performance landmines)
+    if (a.byteSize() >= 16 && PanamaVectorConstants.HAS_FAST_INTEGER_VECTORS) {
+      // compute vectorized dot product consistent with VPDPBUSD instruction
+      if (VECTOR_BITSIZE >= 512) {
+        i += BYTE_SPECIES.loopBound(a.byteSize());
+        res += dotProductBody512(a, b, i);
+      } else if (VECTOR_BITSIZE == 256) {
+        i += BYTE_SPECIES.loopBound(a.byteSize());
+        res += dotProductBody256(a, b, i);
+      } else {
+        // tricky: we don't have SPECIES_32, so we workaround with "overlapping read"
+        i += ByteVector.SPECIES_64.loopBound(a.byteSize() - ByteVector.SPECIES_64.length());
+        res += dotProductBody128(a, b, i);
+      }
+    }
+
+    // scalar tail
+    for (; i < a.byteSize(); i++) {
+      res += b.get(JAVA_BYTE, i) * a.get(JAVA_BYTE, i);
+    }
+    return res;
   }

Where dotProductWTF is just a clone of dotProduct

@msokolov
Copy link
Contributor

BTW the above results were on ARM/Graviton 2. I also tried on an Intel laptp and got speedups, although not as much, and the weird faster search after indexing also persists here

## candidate
 0.985        3.263   3.256        0.998  500000   100      50       32        250         no    560.67        891.80             0            0.00         0.000        0.000       HNSW
 0.985        4.255   4.242        0.997  500000   100      50       32        250         no      0.00      Infinity             0            0.00         0.000        0.000       HNSW
 0.985        4.084   4.071        0.997  500000   100      50       32        250         no      0.00      Infinity             0            0.00         0.000        0.000       HNSW
## baseline
 0.985        5.756   5.744        0.998  500000   100      50       32        250         no      0.00      Infinity             0            0.00         0.000        0.000       HNSW
 0.985        5.696   5.684        0.998  500000   100      50       32        250         no      0.00      Infinity             0            0.00         0.000        0.000       HNSW

@vigyasharma
Copy link
Contributor

Wow, these are very impressive gains! Nice find @kaivalnp.

So the key change is in Arena.ofAuto().allocateFrom(JAVA_BYTE, queryVector); which allocates an off heap MemorySegment for the query vector? Do we have an intuition on why it creates such dramatic (4x?!) speedup?

My understanding was that off heap document vectors helped by avoiding a copy back into the heap, plus avoiding the cost of reallocation and copy if some of them got garbage collected. But doesn't this change add a copy, by copying the byte[] queryVector from heap to the allocated off-heap segment? Also, since the query vector is only used during the lifetime of the query, I would've thought keeping it on heap should be okay?

...

Separately, I tried using the Arena.ofAuto().allocateFrom() construct in the on-heap case that is used during indexing and this made indexing incredibly slow. I guess it is because we force the allocation of many many small memory segments that have to be cleaned up by the garbage collector.

@msokolov : so indexing is faster if vectors are "on heap", but search is faster if vectors are "off heap".. Or do you think it's mainly because of the ofAuto() which defers gc to jvm? Can you share the code path for indexing that you modified?

...

Where dotProductWTF is just a clone of dotProduct

I'm confused, if dotProductWTF and dotProduct are exactly identical, why did dotProductWTF fix the 'search after indexing' case?

...
Separately, this PR holds up in real world benchmarks and looks good to merge. There are some interesting spin off issues here that can make indexing faster.

vigyasharma
vigyasharma previously approved these changes Jul 10, 2025
@msokolov
Copy link
Contributor

My understanding was that off heap document vectors helped by avoiding a copy back into the heap, plus avoiding the cost of reallocation and copy if some of them got garbage collected. But doesn't this change add a copy, by copying the byte[] queryVector from heap to the allocated off-heap segment? Also, since the query vector is only used during the lifetime of the query, I would've thought keeping it on heap should be okay?

It is confusing to me too. I think to understand it we need to decompile and look at the instructions that are generated -- after hotspot does its work. Maybe we are bypassing memory barriers that get applied to on-heap arrays? I am really not sure.

I'm confused, if dotProductWTF and dotProduct are exactly identical, why did dotProductWTF fix the 'search after indexing' case?

The idea behind this was to create two separate code paths: one used during indexing (when both arrays are on-heap) and another one used during search, when one array is one-heap and the others are off-heap (memory mapped from disk). This seems to enable hotspot to separately optimize these two code paths.

There is yet another mystery here, which is: why, after adding this hotspot hackery, do we see even faster performance on the query path when it is preceded by an indexing workflow than we do when it is not (although it's still faster than the baseline).

@vigyasharma
Copy link
Contributor

This seems to enable hotspot to separately optimize these two code paths.

Ah okay! That makes sense.

There is yet another mystery here, which is: why, after adding this hotspot hackery, do we see even faster performance on the query path when it is preceded by an indexing workflow than we do when it is not (although it's still faster than the baseline).

Could indexing have caused some document vectors to already be loaded in memory? Does it create any off heap vector values?

@msokolov
Copy link
Contributor

I'm not really comfortable pushing the PR as it is given that it makes searching slower in the benchmark where we reindex first, and I think we should understand the hotspot hack a little better before we push that, because it's really kind of gross and feels like voodoo to me. I'm really not sure I see at all why forking this method makes a difference -- it doesn't really do very much work in its body. OTOH maybe hotspot keeps track of some stack frame context so it is actually influencing the rewrite of methods called by this (the dotProductBody methods)??

@msokolov
Copy link
Contributor

I did dump the generated machine instructions and while it's hard to really understand what is going on in there, there is clearly a major difference between the code being emitted for dotProductBody256 and dotProductBody256 @ 10 when the hack is in place vs not. So I think the theory that hot spot is optimizing differently in these two cases seems correct

@msokolov
Copy link
Contributor

msokolov commented Jul 11, 2025

This got me wondering -- what is hotspot doing with the code we have on main? By default, knnPerfTest.py runs 1000 iterations, plus we have a warmup phase that runs the same number of iters as the "main" phase, so a total of 2000 by default).

On Intel CPU with preferredBitSize=256

recall  latency(ms)  netCPU  avgCpuCount    nDoc  nIters
 0.982        5.500   5.100        0.927  500000   10
 0.991        3.510   3.450        0.983  500000   100
 0.979        5.851   5.834        0.997  500000   1000
 0.975        5.891   5.881        0.998  500000   2500

On ARM with preferredBitSize=128

recall  latency(ms)  netCPU  avgCpuCount    nDoc  nIters
 0.986        9.300   9.400        1.011  500000     10
 0.989        8.010   8.010        1.000  500000     100
 0.986       25.860  25.858        1.000  500000     1000
 0.983        7.628   7.628        1.000  500000     2500

So I think the bad performance we measured may just be a hotspot glitch around 1-2000 iterations, and in fact this PR is making things worse.

Now I tested this PR + the weird hack (on ARM)

Results:
recall  latency(ms)  netCPU  avgCpuCount    nDoc  nIters
 0.986       15.900  15.900        1.000  500000     10
 0.989       14.270  14.280        1.001  500000     100
 0.986       13.219  13.220        1.000  500000     1000
 0.983       13.508  13.508        1.000  500000     2500

and finally this PR (no weird hotspot hack):

recall  latency(ms)  netCPU  avgCpuCount    nDoc  nIters
 0.986       16.000  16.000        1.000  500000   10
 0.989       14.640  14.650        1.001  500000   100
 0.986       13.381  13.382        1.000  500000   1000
 0.983       13.794  13.794        1.000  500000   2500

net-net I don't think we should merge this

@kaivalnp
Copy link
Contributor Author

Thanks for all the deep-dive here @msokolov!

I tried running a set of benchmarks (Cohere, 768d, byte vectors) on niter=100,500,1000,5000,10000,50000 (only search existing index, no reindex) to test how the compiler optimizes over larger runs..

main:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  num_segments
 0.968        2.790   2.710        0.971  100000   100      50       64        250         no             1
 0.964        2.692   2.674        0.993  100000   100      50       64        250         no             1
 0.963        2.685   2.677        0.997  100000   100      50       64        250         no             1
 0.963        2.599   2.597        0.999  100000   100      50       64        250         no             1
 0.962        2.552   2.550        0.999  100000   100      50       64        250         no             1
 0.962        2.590   2.589        1.000  100000   100      50       64        250         no             1

This PR:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  num_segments
 0.968        2.050   1.960        0.956  100000   100      50       64        250         no             1
 0.964        1.868   1.850        0.990  100000   100      50       64        250         no             1
 0.963        1.894   1.885        0.995  100000   100      50       64        250         no             1
 0.963        1.770   1.769        0.999  100000   100      50       64        250         no             1
 0.962        1.762   1.761        1.000  100000   100      50       64        250         no             1
 0.962        1.742   1.741        1.000  100000   100      50       64        250         no             1

There's a possibility that the candidate (i.e. this PR) has an inherent benefit due to being run later in time (so vectors are more likely to be loaded into RAM) -- so I ran baseline (i.e. main) immediately afterwards:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  num_segments
 0.968        2.850   2.770        0.972  100000   100      50       64        250         no             1
 0.964        2.712   2.694        0.993  100000   100      50       64        250         no             1
 0.963        2.640   2.632        0.997  100000   100      50       64        250         no             1
 0.963        2.588   2.586        0.999  100000   100      50       64        250         no             1
 0.962        2.561   2.559        0.999  100000   100      50       64        250         no             1
 0.962        2.550   2.549        1.000  100000   100      50       64        250         no             1

For some reason, the changes in this PR are still better on my machine :/

I think we should understand the hotspot hack a little better before we push that, because it's really kind of gross and feels like voodoo to me

+1, not looking to merge this until we find out why we're seeing a difference in performance (seems counterintuitive as we're doing more work but seeing better latency!) -- when we (1) create a fresh index, (2) reindex, (3) search an existing index, (4) for different parameters, (5) across different machines?

Performance seems tied to the HotSpot compiler -- is there a way to make optimizations more deterministic? (or at least, explicit)

On a related note, benchmark runs have been so wildly fluctuating -- I wonder if we should set larger defaults for reliable numbers..

@msokolov
Copy link
Contributor

I'm curious if you have a Graviton, and which generation? If you run without -quiet you can captyre where it says preferredBitSize=xxx and that should also be indicative I think

@kaivalnp
Copy link
Contributor Author

I'm curious if you have a Graviton, and which generation?

I have a Graviton3 (m7g.16xlarge), lscpu output here

where it says preferredBitSize=xxx

Here's what the PanamaVectorizationProvider says:

Jul 11, 2025 9:14:01 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider logIncubatorSetup
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled

@vigyasharma
Copy link
Contributor

Thanks for the analysis @msokolov , agree with not merging until we understand what's happening.

Now I tested this PR + the weird hack (on ARM)
...
and finally this PR (no weird hotspot hack):

Latency looks similar in both runs. Did the hotspot hack stop working?

@vigyasharma vigyasharma self-requested a review July 11, 2025 22:03
@vigyasharma vigyasharma dismissed their stale review July 11, 2025 22:04

Let's figure out why the hotspot hack works and get it working deterministically before merging this.

@vigyasharma
Copy link
Contributor

Another question: is the weird "search after indexing" regression specific only to this PR? I believe our current hypothesis is that hotspot sees document vectors on heap during indexing and does something weird (code optimization, memory management, ...) which impacts search.

Is this in any way related to the query vector being off-heap (and why?), or is it an existing regression? Just thinking out loud to see what others have observed, I'll also run the benchmark myself to check this.

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 11, 2025

is the weird "search after indexing" regression specific only to this PR?

It's slightly different here -- I tried the following on main

When a fresh index is created (run 1):

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.962        2.348   2.345        0.999  100000   100      50       64        250         no     11.56       8651.27           50.78             1           77.48       292.969      292.969       HNSW

Then a run without -reindex (run 2, same index as run 1):

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.962        2.544   2.542        0.999  100000   100      50       64        250         no      0.00      Infinity            0.12             1           77.48       292.969      292.969       HNSW

Then a run with -reindex (run 3, keeping index from run 1 around):

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.963        1.801   1.800        1.000  100000   100      50       64        250         no     11.51       8687.34           50.71             1           77.46       292.969      292.969       HNSW

And finally a run without -reindex (run 4, same index as run 3):

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.963        2.491   2.490        0.999  100000   100      50       64        250         no      0.00      Infinity            0.12             1           77.46       292.969      292.969       HNSW

This behavior of seeing a performance improvement after re-indexing (but not in a fresh index) has been consistent for me (see this previous result)

Edit: Sorry I pasted numbers with quantization earlier (was running something else for #14863), now updated to byte vectors + no quantization.. The trend is pretty consistent though!

@ChrisHegarty
Copy link
Contributor

I'm going to try to help determine what's going on here. There are a number of subtleties around these off-heap scorers.

@kaivalnp
Copy link
Contributor Author

FYI I was trying to see the compiled code for the dot product function using -XX:CompileCommand=print,*PanamaVectorUtilSupport.dotProductBody256 and saw that performance improved on adding that flag (no reindexing in any run)

This PR without the flag:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.962        1.711   1.711        1.000  100000   100      50       64        250         no      0.00      Infinity            0.12             1           77.44       292.969      292.969       HNSW

This PR with the flag:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.962        1.171   1.172        1.001  100000   100      50       64        250         no      0.00      Infinity            0.12             1           77.44       292.969      292.969       HNSW

main without the flag:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.962        2.503   2.502        0.999  100000   100      50       64        250         no      0.00      Infinity            0.12             1           77.44       292.969      292.969       HNSW

main with the flag:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.962        1.194   1.192        0.999  100000   100      50       64        250         no      0.00      Infinity            0.12             1           77.44       292.969      292.969       HNSW

Perhaps the flag is forcing the function to be optimized by the JVM?
In this case, it should denote the latency in a long-running application? (when the function is fully optimized by the compiler..)

If so, I don't see a lot of value in merging this PR (there isn't a large improvement) -- but I'd love to make our benchmarks more robust and report something representative of a long-running application! (these non-deterministic compiler optimizations are too trappy)

@kaivalnp
Copy link
Contributor Author

To look into compiler inlining, I added -XX:CompileCommand=PrintInlining,*PanamaVectorUtilSupport.* to the startup arguments and saw:

@ 113   jdk.incubator.vector.AbstractSpecies::loopBound (9 bytes)   force inline by annotation
  @ 5   jdk.incubator.vector.VectorIntrinsics::roundDown (23 bytes)   force inline by annotation
@ 125   org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody256 (110 bytes)   failed to inline: callee is too large

..in a run without -reindex

However if I performed a run with -reindex, I saw lines like:

@ 113   jdk.incubator.vector.AbstractSpecies::loopBound (9 bytes)   force inline by annotation
  @ 5   jdk.incubator.vector.VectorIntrinsics::roundDown (23 bytes)   force inline by annotation
@ 125   org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody256 (110 bytes)   inline (hot)

..which means it is being inlined if the dot product function is hot enough


So, I tried to turn on explicit inlining using -XX:CompileCommand=inline,*PanamaVectorUtilSupport.* in some runs:

recall latency inlining reindex
0.962 2.540 no no
0.962 1.210 yes no
0.962 1.829 no yes
0.962 1.808 yes yes

I was able to squeeze performance in the run without -reindex but not if we create an index, I'll try to dig deeper on why this is happening..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants