GroupVarInt Encoding Implementation for HNSW Graphs #14932

aylonsk · 2025-07-10T15:10:20Z

Description

For HNSW Graphs, the alternate encoding I implemented was GroupVarInt encoding, which in theory should be less costly both in space and runtime. The pros of this encoding would be that it allocates all of the space for a group of 4 integers in advance, and that it can encode using all 8 bits per byte instead of the 7 for VarInt. The cons are that it can only encode integers (<=32bits), and uses the first byte to encode the size of each number. However, since we are using delta encoding to condense our integers, they will never be larger than 32bits, making this irrelevant.

github-actions · 2025-07-10T15:11:14Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

benwtrent · 2025-07-10T19:05:34Z

Hi @aylonsk ! Thank you for digging into this issue. I am sure you are still working on it, but I had some feedback:

It would be interesting to get statistics around resulting index size changes and performance changes (index & search). Lucene util is the preferred tool for this: GroupVarInt Encoding Implementation for HNSW Graphs #14932
As with most Lucene formats, changes like this need to be backwards compatible. Readers are loaded via their names. Consequently, users might have indices with the Lucene99Hnsw format name that do not have group-varint applied, and consequently cannot be read by your change here. There are a couple of options to handle this:
- Add versioning to the format
- Create a new format (Lucene103Hnsw...) and move Lucene99Hnsw... to the bwc formats package for readers (there are many example PRs in the past doing this).

Handling the format change can be complicated. So, my first step would be to justify the change with performance metrics. Then do all the complicated format stuff.

Good luck!

aylonsk · 2025-07-10T20:45:52Z

Thanks for your response! My apologies, I forgot to post my results from LuceneUtil.

Because I noticed variance between each run, I decided to test each set of hyperparameters 10 times and take the median for latency, netCPU, and AvgCpuCount. Therefore, my results aren't in the standard table format.

I ran 12 comparison tests in total, each a different combination of HPs. Here were the variables I kept the same: (topK=100, fanout=50, beamWidth=250, numSegments=1)

Here are some specific tests:

BENCHMARKS (10 runs per test):

Base HP’s: nDocs=500,000, maxConn=64, quantized=no, numSegments=1

Baseline:
Recall: 0.832
Latency (Median): 0.73 ms
NetCPU (Median) 0.708 ms
AvgCPUCount (Median): 0.973 ms
Index Size: 220.55MB
Vec Disk/Vec RAM: 190.735MB

Candidate:
Recall: 0.835
Latency (Median): 0.7 ms
NetCPU (Median) 0.677 ms
AvgCPUCount (Median): 0.966 ms
Index Size: 220.12MB
Vec Disk/Vec RAM: 190.735MB

Latency Improvement: ~4.11% speedup

nDocs=500,000, maxConn=32, quantized=no, numSegments=1

Baseline:
Recall: 0.834
Latency (Median): 0.722 ms
NetCPU (Median): 0.701 ms
AvgCPUCount (Median): 0.966 ms
Index Size: 220.19MB
Vec Disk/Vec RAM: 190.735MB

Candidate:
Recall: 0.83
Latency (Median): 0.691 ms
NetCPU (Median): 0.665 ms
AvgCPUCount (Median): 0.96 ms
Index Size: 219.67MB
Vec Disk/Vec RAM: 190.735MB

Latency Improvement: ~4.3% speedup

nDocs=500,000, maxConn=32, quantized=7bits, numSegments=1

Baseline:
Recall: 0.671
Latency (Median): 1.2935 ms
NetCPU (Median): 1.2635 ms
AvgCpuCount (Median): 0.976 ms
Index Size: 255.74 ms
Vec Disk: 240.326MB
Vec RAM: 49.591MB

Candidate:
Recall: 0.696
Latency (Median): 1.2525 ms
NetCPU (Median): 1.192 ms
AvgCPUCount (Median): 0.974 ms
Index Size: 259.34MB
Vec Disk: 240.326MB
Vec RAM: 49.591MB

Latency Improvement: ~3.17% speedup

nDocs=2,000,000, maxConn=32, quantized=7bits, numSegments=1

Baseline:
Recall: 0.74
Latency (Median): 2.6675 ms
NetCPU (Median): 2.545 ms
AvgCpuCount (Median): 0.969 ms
Index Size: 1049.52MB
Vec Disk: 961.30MB
Vec RAM: 198.364MB

Candidate:
Recall: 0.717
Latency (Median): 2.521 ms
NetCPU (Median): 2.398 ms
AvgCPUCount (Median): 0.98 ms
Index Size: 1043.27MB
Vec Disk: 961.304MB
Vec RAM: 198.364MB

Latency Improvement: 5.49% speedup

nDocs=100,000, maxConn=64, quantized=7bits, numSegments=1

Baseline:
Recall: 0.848
Latency (Median): 2.305
NetCPU (Median): 2.2575
AvgCpuCount (Median): 0.976
Index Size: 51.52MB
Vec Disk: 48.07MB
Vec RAM: 9.918MB

Candidate:
Recall: 0.848
Latency (Median): 1.85 ms
NetCPU (Median): 1.80 ms
AvgCPUCount (Median): 0.974 ms
Index Size: 51.52MB
Vec Disk: 48.07MB
Vec RAM: 9.918MB

Latency Improvement: ~18.1% speedup

While the degree of improvement varied between tests, all tests except 1 showed improvement in latency over the baseline. Considering how simple and non-intrusive this implementation is, I think it would be an easy net benefit.

Thank you for letting me know about the backwards compatibility requirement. I will look into fixing that tomorrow.

benwtrent · 2025-07-10T20:48:47Z

@aylonsk great looking numbers! I expect for cheaper vector ops (e.g. single bit quantization), the impact is even higher.

initial commit

723dcf6

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jul 10, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jul 10, 2025

github-actions bot added the module:core/codecs label Jul 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GroupVarInt Encoding Implementation for HNSW Graphs #14932

GroupVarInt Encoding Implementation for HNSW Graphs #14932

Uh oh!

aylonsk commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

benwtrent commented Jul 10, 2025

Uh oh!

aylonsk commented Jul 10, 2025

Uh oh!

benwtrent commented Jul 10, 2025

Uh oh!

Uh oh!

GroupVarInt Encoding Implementation for HNSW Graphs #14932

Are you sure you want to change the base?

GroupVarInt Encoding Implementation for HNSW Graphs #14932

Uh oh!

Conversation

aylonsk commented Jul 10, 2025

Description

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

benwtrent commented Jul 10, 2025

Uh oh!

aylonsk commented Jul 10, 2025

Uh oh!

benwtrent commented Jul 10, 2025

Uh oh!

Uh oh!