-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix off-heap byte vector scoring at query time #14874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Here's another set of benchmarks on an
This PR:
CPU details:
and
|
wow, interesting, nice find! If you have time, it would also be great to see some knnPerTest.py results comparing before/after this change, in addition to the microbenchmarking, just to make extra sure |
One possible explanation (from the memory profiler) for dot products being expensive is that the conversion from 8 bytes to integers is expensive: |
I would be curious to see if the results hold up for 768d vectors. Maybe 300 not being divisible by 8 is problematic, although I don't believe it. But ... the results look so weird, I just want to try some other random thing and see if it holds up? |
Thanks @msokolov I quantized the Cohere 768d vectors by:
i.e. basically replicating this snippet But something really strange is happening! I first ran a set of benchmarks using
This PR (run 2)
This felt a bit strange because of a higher latency at query time (conflicting with JMH results) + lower indexing time in this PR -- so I ran a follow-up without
This PR (run 4)
The performance in Meanwhile, it'd be great if you could reproduce the benchmarks like I did :) |
Okay I ran things slightly differently for 300d vectors. All runs are without
This PR (run 3, no pre-existing index)
This PR (run 4, use same index as run 3)
The results seem consistent with what we see above, the changes in this PR cause a regression if used immediately after indexing, but a speedup if used on an existing index? Edit: The indexing time had not improved if we started fresh -- so I kept the old index (from run 4) around and did another benchmark (run 5) on this PR using
..we see faster indexing times! (but the slow query-time issue still persists) I ran a similar setup on
..and both query / indexing times have improved! Maybe it's something related to OS caching, I'm still confused :) |
I was able to reproduce the weird behavior where the first run produces higher latency than subsequent runs. I also noticed netCPU and avgCpuCount were both negative in that first run?? I'm suspecting a bug in KnnGraphTester that may be related to having run the indexing (or maybe having computed the exact knn results) in the same run? |
Yeah I noticed this too! I think it was because of computing exact KNN results, I didn't see the same issue on re-indexing.. |
My weird results (w/768d cohere vectors). In sum, it looks to me as if we have some noise immediately after reindexing. Either this is a measurement artifact in run 1baselinefirst line does reindex -- and then latency and netCPU increase? This has nothing to do with this PR.
candidate
run 2candidate (reindex each time)
no reindex
baseline (no reindex)
|
I tried another test:
With the idea that if there was some kind of OS paging thing it would affect this in the same way as the normal -reindex + -search, but it did not: I saw the latency improvement (around ~13 ms). So this makes me think something happens in the JVM that is doing the indexing that predisposes it to have worse search performance? Some kind of hotspot something or other? Next I want to try looking at profiler, but it is a bit tricky since the index + search profile is going to be muddied by the indexing part |
I did some deep-diving with profiler and I realized that when indexing, we call these dotProduct methods in a different context in which all of the vectors are on-heap. I'm surmising that this induces hotspot compiler to make some choices that end up being bad later when half of the vectors (the document ones) are off-heap; ie while searching. To test this theory I cloned the Separately, this whole line of attack made me wonder if maybe we ought to be storing our vectors differently while indexing ... instead of a I'm also still somewhat confused about the status of the whole Paname vectorization code. I guess it is still incubating, so we have decided not to require it, and our code paths are conditional on whether it is enabled or not. I think this conditionalization itself may be problematic from a performance perspective, although this is more a suspicion than anything, but I wonder if it would be possible to just require users to enable the incubating API be enabled at this point? It is in its umpteenth uncubation cycle, seems like it should be safe. |
Separately, I tried using the |
OK I discovered the loss of recall was due to a silly bug. After fixing that, these are the results I'm seeing with the addition of a separate code path for candidate (reindex)
baseline
baseline (reindex)
candidate
so now the situation is reversed and we see faster searching after indexing in the same JVM. |
what I did:
Where |
BTW the above results were on ARM/Graviton 2. I also tried on an Intel laptp and got speedups, although not as much, and the weird faster search after indexing also persists here
|
Wow, these are very impressive gains! Nice find @kaivalnp. So the key change is in My understanding was that off heap document vectors helped by avoiding a copy back into the heap, plus avoiding the cost of reallocation and copy if some of them got garbage collected. But doesn't this change add a copy, by copying the ...
@msokolov : so indexing is faster if vectors are "on heap", but search is faster if vectors are "off heap".. Or do you think it's mainly because of the ...
I'm confused, if ... |
It is confusing to me too. I think to understand it we need to decompile and look at the instructions that are generated -- after hotspot does its work. Maybe we are bypassing memory barriers that get applied to on-heap arrays? I am really not sure.
The idea behind this was to create two separate code paths: one used during indexing (when both arrays are on-heap) and another one used during search, when one array is one-heap and the others are off-heap (memory mapped from disk). This seems to enable hotspot to separately optimize these two code paths. There is yet another mystery here, which is: why, after adding this hotspot hackery, do we see even faster performance on the query path when it is preceded by an indexing workflow than we do when it is not (although it's still faster than the baseline). |
Ah okay! That makes sense.
Could indexing have caused some document vectors to already be loaded in memory? Does it create any off heap vector values? |
I'm not really comfortable pushing the PR as it is given that it makes searching slower in the benchmark where we reindex first, and I think we should understand the hotspot hack a little better before we push that, because it's really kind of gross and feels like voodoo to me. I'm really not sure I see at all why forking this method makes a difference -- it doesn't really do very much work in its body. OTOH maybe hotspot keeps track of some stack frame context so it is actually influencing the rewrite of methods called by this (the dotProductBody methods)?? |
I did dump the generated machine instructions and while it's hard to really understand what is going on in there, there is clearly a major difference between the code being emitted for |
This got me wondering -- what is hotspot doing with the code we have on main? By default, On Intel CPU with preferredBitSize=256
On ARM with preferredBitSize=128
So I think the bad performance we measured may just be a hotspot glitch around 1-2000 iterations, and in fact this PR is making things worse. Now I tested this PR + the weird hack (on ARM)
and finally this PR (no weird hotspot hack):
net-net I don't think we should merge this |
Thanks for all the deep-dive here @msokolov! I tried running a set of benchmarks (Cohere, 768d, byte vectors) on
This PR:
There's a possibility that the candidate (i.e. this PR) has an inherent benefit due to being run later in time (so vectors are more likely to be loaded into RAM) -- so I ran baseline (i.e.
For some reason, the changes in this PR are still better on my machine :/
+1, not looking to merge this until we find out why we're seeing a difference in performance (seems counterintuitive as we're doing more work but seeing better latency!) -- when we (1) create a fresh index, (2) reindex, (3) search an existing index, (4) for different parameters, (5) across different machines? Performance seems tied to the HotSpot compiler -- is there a way to make optimizations more deterministic? (or at least, explicit) On a related note, benchmark runs have been so wildly fluctuating -- I wonder if we should set larger defaults for reliable numbers.. |
I'm curious if you have a Graviton, and which generation? If you run without |
I have a Graviton3 (
Here's what the
|
Thanks for the analysis @msokolov , agree with not merging until we understand what's happening.
Latency looks similar in both runs. Did the hotspot hack stop working? |
Let's figure out why the hotspot hack works and get it working deterministically before merging this.
Another question: is the weird "search after indexing" regression specific only to this PR? I believe our current hypothesis is that hotspot sees document vectors on heap during indexing and does something weird (code optimization, memory management, ...) which impacts search. Is this in any way related to the query vector being off-heap (and why?), or is it an existing regression? Just thinking out loud to see what others have observed, I'll also run the benchmark myself to check this. |
It's slightly different here -- I tried the following on When a fresh index is created (run 1):
Then a run without
Then a run with
And finally a run without
This behavior of seeing a performance improvement after re-indexing (but not in a fresh index) has been consistent for me (see this previous result) Edit: Sorry I pasted numbers with quantization earlier (was running something else for #14863), now updated to byte vectors + no quantization.. The trend is pretty consistent though! |
I'm going to try to help determine what's going on here. There are a number of subtleties around these off-heap scorers. |
FYI I was trying to see the compiled code for the dot product function using This PR without the flag:
This PR with the flag:
Perhaps the flag is forcing the function to be optimized by the JVM? If so, I don't see a lot of value in merging this PR (there isn't a large improvement) -- but I'd love to make our benchmarks more robust and report something representative of a long-running application! (these non-deterministic compiler optimizations are too trappy) |
To look into compiler inlining, I added
..in a run without However if I performed a run with
..which means it is being inlined if the dot product function is hot enough So, I tried to turn on explicit inlining using
I was able to squeeze performance in the run without |
Description
In #14863, I noticed a regression in computing vector scores if the query vector was present on heap (i.e. allocated using
MemorySegment.ofArray
) but the doc was on RAMWe do have a JMH benchmark for testing the off-heap scoring performance -- but it only tests the indexing case (using an
UpdateableRandomVectorScorer
), when both vectors are on RAM (mmap-ed input)I also added benchmark functions to demonstrate the searching case (using a
RandomVectorScorer
)main
:This PR:
We see ~2x improvement in vector scoring time! (and the off-heap computation is actually slower than on-heap on
main
?)I'm not sure if this is specific to my machine, or something in general..
Pasting CPU details in case it is relevant:
and
java --version
: