Skip to content

Fix HNSW index crashes and incorrect query results#181

Merged
ruvnet merged 1 commit intoruvnet:mainfrom
grparry:fix/hnsw-index-crashes
Feb 20, 2026
Merged

Fix HNSW index crashes and incorrect query results#181
ruvnet merged 1 commit intoruvnet:mainfrom
grparry:fix/hnsw-index-crashes

Conversation

@grparry
Copy link
Copy Markdown

@grparry grparry commented Feb 17, 2026

Hi! I've been evaluating ruvector as a pgvector alternative for a project and ran into a few issues with the HNSW index access method on PostgreSQL 17. I traced them down and put together fixes — hope these are helpful.

Summary

Six bugs fixed in the HNSW access method (hnsw_am.rs), plus one shared fix in ivfflat_am.rs:

  1. SIGSEGV on repeated queriesbeginscan must allocate xs_orderbyvals and xs_orderbynulls arrays (like GiST and SP-GiST do). RelationGetIndexScan allocates the scan descriptor but not these arrays, so the executor was writing distance values to stale/uninitialized memory. First query in a backend often worked; second+ crashed at addresses like 0x1.

  2. Empty HNSW graphconnect_node_to_neighbors was a TODO stub. Implemented bidirectional connections (new node → neighbors, each neighbor → new node) with pruning when neighbor count exceeds M (upper layers) or M0 (layer 0).

  3. Wrong distance metrichnsw_build used HnswConfig::default() which hardcodes DistanceMetric::Euclidean, regardless of the operator class (e.g. ruvector_cosine_ops). Added metric_from_index() that reads the opclass support function name via index_getprocid + get_func_name.

  4. Wrong result orderingBinaryHeap::into_iter().take(k) returns k arbitrary items from the heap's backing array, not the k smallest. Removed .take(k) before the sort+truncate.

  5. xs_recheckorderby must be false — Setting it to true triggers PG17's IndexNextWithReorder distance comparison, which raises "index returned tuples in wrong order" on harmless floating-point differences between index-computed and operator-computed distances.

  6. Use-after-free in endscan — Added null check and null-out of scan->opaque to prevent double-free if endscan is called after a rescan that resets the state.

Testing

  • Built and tested on PostgreSQL 17.7 with pgrx 0.12.9
  • Verified with 5d and 8d vectors, dataset sizes 15 and 100
  • 20+ consecutive HNSW queries per session with no crashes
  • HNSW results match sequential scan ground truth across all test queries

Test plan

  • Build with cargo pgrx install --release
  • Create table with ruvector(N) column and HNSW index using ruvector_cosine_ops
  • Run SELECT ... ORDER BY embedding <=> query::ruvector LIMIT k multiple times in the same psql session
  • Compare HNSW results against sequential scan (SET enable_indexscan = off)
  • Verify no crashes on repeated queries

🤖 Generated with Claude Code

Six bugs fixed in the HNSW access method:

1. SIGSEGV on repeated queries: beginscan must allocate xs_orderbyvals
   and xs_orderbynulls arrays (like GiST/SP-GiST do). Without this,
   the executor writes distance values to stale palloc memory, causing
   segfaults on the second+ query in the same backend.

2. Empty HNSW graph: connect_node_to_neighbors was a no-op TODO stub.
   Implemented bidirectional connections (new→neighbors, neighbor→new)
   with pruning at M/M0 capacity limits.

3. Wrong distance metric: hnsw_build hardcoded DistanceMetric::Euclidean
   regardless of the operator class used (e.g. ruvector_cosine_ops).
   Added metric_from_index() to read the metric from the opclass
   support function via index_getprocid + get_func_name.

4. Wrong result ordering: BinaryHeap::into_iter().take(k) returns k
   arbitrary items, not the k closest. Removed .take(k) before sort.

5. xs_recheckorderby must be false: setting it to true triggers PG17's
   IndexNextWithReorder distance comparison, which errors on harmless
   floating-point differences between index-stored and heap vectors.

6. Use-after-free in endscan: added null check and null-out of
   scan->opaque to prevent double-free across rescans.

Also applied the same xs_orderbyvals fix to ivfflat_ambeginscan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ruvnet ruvnet merged commit 72c2a8e into ruvnet:main Feb 20, 2026
ruvnet added a commit that referenced this pull request Feb 20, 2026
Merges PR #181 fixes:
- SIGSEGV on repeated queries (xs_orderbyvals allocation)
- Bidirectional HNSW connections with pruning
- Correct distance metric from operator class
- Sorted result ordering (into_sorted_vec)
- xs_recheckorderby=false for PG17
- Null-safe endscan (use-after-free fix)
- Bounds checks for page boundary reads
- Non-kNN scan fallback (COUNT/WHERE IS NOT NULL)
- Dimension extraction from atttypmod

Co-Authored-By: claude-flow <ruv@ruv.net>
Copy link
Copy Markdown
Owner

ruvnet commented Feb 20, 2026

Thanks @grparry for this excellent contribution! All 6 fixes have been reviewed, confirmed correct, and merged to main.

Summary of fixes:

  1. SIGSEGV on repeated queriesxs_orderbyvals/xs_orderbynulls allocation in beginscan
  2. Empty graph on insert — Bidirectional HNSW connections with pruning
  3. Wrong distance metricmetric_from_index() reads opclass support function
  4. Wrong result orderinginto_sorted_vec() replaces arbitrary BinaryHeap iteration
  5. xs_recheckorderby — Set to false to avoid PG17 false positive
  6. Use-after-free in endscan — Null check + null-out of scan->opaque

We also merged your fixes with the additional bounds-check and non-kNN scan improvements that were on main.

Published as:

Great work — these are critical stability fixes for production HNSW workloads. 🙏

@grparry grparry deleted the fix/hnsw-index-crashes branch February 24, 2026 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants