Skip to content

Feature/scalar quantized off heap scoring #13497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

benwtrent
Copy link
Member

This adds off-heap scoring for our scalar quantization.

Opening as DRAFT as I still haven't fully tested out the performance characteristics. Opening early for discussion.

@benwtrent benwtrent added this to the 9.12.0 milestone Jun 17, 2024
@benwtrent
Copy link
Member Author

Half-byte is showing up as measurably slower with this change.

Candidate:

0.909	 0.54
0.911	 0.58
0.919	 0.88

baseline:

0.909	 0.30
0.911	 0.33
0.919	 0.47

Full-byte is slightly faster

candidate:

0.962	 0.41
0.966	 0.43
0.978	 0.66

baseline:

0.962	 0.47
0.966	 0.48
0.978	 0.73

@msokolov
Copy link
Contributor

are you reporting indexing times? query times?

@benwtrent
Copy link
Member Author

are you reporting indexing times? query times?

Query times, single segment, 10k docs of 1024 dims.

@benwtrent
Copy link
Member Author

Ok, I double checked, and indeed, half-byte is way slower when reading directly from memory segments instead of reading on heap.
memsegment_vs_baseline.zip

The flamegraphs are wildly different. So much more time is being spent reading from memory segment and then comparing the vectors

candidate (this PR):
image

baseline:

image

@benwtrent
Copy link
Member Author

@ChrisHegarty have you seen a significant performance regression on MemorySegments & JDK22?

Doing some testing, I updated my performance testing for this PR to use JDK22 and now it is WAY slower, more than 2x slower, even for full-byte.

For int7, this branch is marginally faster (20%) with JDK21, but basically 2x slower on JDK22.

I wonder if our off-heap scoring for byte vectors also suffers on JDK22. The quantized scorer for int7 is just using those same methods.

@benwtrent
Copy link
Member Author

To verify it wasn't some weird artifact in my code, I slightly changed it to where my execution path always reads the vectors on-heap and then wraps them in a memorysegment. Now JDK22 performs the same as JDK21 & the current baseline.

Its weird to me that reading from a memory segment onto ByteVector objects would be 2x slower on JDK22 than 21.

Regardless that its already much slower for the int4 case on both jdk 21 & 22.

@ChrisHegarty
Copy link
Contributor

Regardless that its already much slower for the int4 case on both jdk 21 & 22.

@benwtrent I was not aware, lemme take a look.

@kaivalnp
Copy link
Contributor

+1 to this feature

I work on Amazon product search, and in one of our searchers we see a high proportion of CPU cycles within HNSW search being spent in copying quantized vectors to heap:

Screenshot 2025-06-25 at 2 16 43 PM

Perhaps off-heap scoring could help us!

@benwtrent
Copy link
Member Author

@kaivalnp feel free to take my initial work here and dig in deeper.

I haven't benchmarked it recently on later JVMs to figure out why I was experiencing such a weird slowdown when going off heap :/

@kaivalnp
Copy link
Contributor

Thanks @benwtrent! I opened #14863

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants