Skip to content

Add Distance Metric to VINFO and update diskann-garnet#1505

Merged
kevin-montrose merged 1 commit intovectorApiPoCfrom
users/tiagonapoli/vinfo-distance-metric
Jan 22, 2026
Merged

Add Distance Metric to VINFO and update diskann-garnet#1505
kevin-montrose merged 1 commit intovectorApiPoCfrom
users/tiagonapoli/vinfo-distance-metric

Conversation

@tiagonapoli
Copy link
Collaborator

No description provided.

@tiagonapoli tiagonapoli force-pushed the users/tiagonapoli/vinfo-distance-metric branch from 8c8c1ea to 518aeb7 Compare January 21, 2026 15:03
@kevin-montrose kevin-montrose merged commit 259513f into vectorApiPoC Jan 22, 2026
1 check passed
@kevin-montrose kevin-montrose deleted the users/tiagonapoli/vinfo-distance-metric branch January 22, 2026 15:16
kevin-montrose added a commit that referenced this pull request Feb 6, 2026
* stopgap commit; sketch out a multi-key read_callback that both prefetches and doesn't copy

* stopgap commit: add missing file

* stopgap commit; correctly set count

* stopgap commit; rework for IReadArgBatch

* stopgap commit; fixup bugs with IReadArgBatch implementation

* stopgap commit; correctly capture callback and callback context on VectorReadBatch

* small refactor to avoid extra accesses and recalcs

* bump diskann-garnet

* rework session tracking so we can set it high up during adds, which lets read/write/delete callbacks working _during_ index creation; this will simplify restoration and cleanup some stuff on the DiskANN side

* micro optimization around allocating space for vector set index reads

* stopgap commit; start working on recovering indexes from disk / without AOF

* suppress for now

* remove temporary copies and allocations from VADD replication

* fix replication tests by pausing for VADDs to also catch up

* bump diskann-garnet to 1.0.4

* 1.0.4 has issues, rolling back to 1.0.3

* DRY up index reading to simplify recreation and prepare for shared lock sharding

* extend locking DRY'ing to replication

* bump to 1.0.5

* sketch out sharded read locks

* bump to 1.0.8

* rework replication to (probably) fix a bad pointer on passed SpanBytes

* implement (sort of) VEMB for debugging purposes

* stopgap commit; get some stopwatch based logging in for diagnostics

* Revert "stopgap commit; get some stopwatch based logging in for diagnostics"

This reverts commit 0aa68d1.

* less naive prefetch approach, working in batches of 12 and only if we have a batch in the first place

* JIT may not be smart enough to elide these bounds checks, so just go unsafe

* bump diskann and garnet release version

* fail deadly while upstream Entra fixes are rolling out

* memory corruption bug somewhere - kick up DiskANN in the optimistic hope it was in there

* change stress amounts

* diskann is hard assuming 75 for now, so change tests accordingly

* more bounds checking, more logging, let's find this corruption

* sketch out VREM

* DRY up dimension calculation on VADD

* don't return success if delete didn't do anything

* tweak library resolution logic; when hosted as a service on Linux, current directory is / which does not play nice with this path style; instead base on location of assemblies if initial lookup fails

* bump version

* be more defensive, though shouldn't really matter; also log more on faulting

* Revert "rework replication to (probably) fix a bad pointer on passed SpanBytes"

This reverts commit 6d144ac.

* Revert "fix replication tests by pausing for VADDs to also catch up"

This reverts commit 333b4e1.

* Revert "remove temporary copies and allocations from VADD replication"

This reverts commit 9104b92.

* after reverting replication optimizations, bump version

* ruled out corruption, remove all these bounds checks and other validation

* bump diskann-garnet; VREM implemented and VREM replication tested

* deleting a vector set causes its internal values to be cleanedup (very slowly, but still)

* fix DEL replays w.r.t. vector sets

* more bits for diskann in context

* diskann-garnet to .12, attributes now managed on that side

* exclude vector set data from a number of places; get most (all?) tests passing

* fixes for recovery, more tests for recovery, diskann-garnet needs some changes to complete the rest of this

* temp hack around a re-entrancy issue

* hack harder

* fix recovery test

* bump to .13

* bump diskann-garnet to fix bugs

* restart cleanups upon recovery

* start a design doc now that we're mostly nailed down the PoC

* Remove dead code; we're not using multiinsert right now, and won't for the foreseeable future

* remove more dead code

* finish up first draft of vector-sets.md

* fixup some links

* naturally, a typo in the first two lines

* formatting

* typos

* more typos

* note migration is still a WIP

* remove hack from index creation

* remove hack from index recreation

* expand tests

* fix tests

* fix tests

* sketch out rmw callback for DiskANN

* don't roll version back

* fix a bunch of typos

* more corrections and cleanup upon review

* move VectorManager onto GarnetDatabase, preparing for multi-DB testing

* mention docs

* implement copy-update functions, I seem to have misunderstood the point of these

* knock our remainder of recreate tests

* track hash slots with vector set metadata

* add (failing) basic migration test

* stopgap commit; sketch out and document the migration flow

* stopgap commit; primary -> primary for _hash slots_ works; replicas don't see the changes, which is unfortunate but not unexpected; key migrations not yet implemented

* stopgap commit; all Vector Set tests passing, though there's still migration work to be done

* fix tests

* replicas now follow migrated primary Vector Sets; needs a lot more testing, but all tests pass right now

* migrate ... keys implemented, which wraps up migration (in theory)

* fix tests; all tests passing now

* test moving multiple vector sets to a primary that already has vector sets

* more vector set migration tests, and fixes

* Rework timeouts for some cluster migration tests

* stopgap commit; lots of hackery to try and make writes during migrations not fail

* stopgap commit; this appears to work, need to stress and remove lots of logging

* stopgap commit; remove a bunch of hackery and logging

* stress test is still a bit flaky, but there are common non-Vector Set failures that can be excluded

* note blocking during migrations in vector-sets.md

* restore AAD, this is long since debuged

* knock a number of hacks out

* remove another hack

* hide Vector Sets behind a feature flag - flag defaults on for tests, but is off in defaults.config

* dry up exclusive lock acquisition

* split VectorManager up to make easier to review

* knock out more todos

* cleanup after migration failures

* this TODO is invalid

* don't bump version

* implement ReadWithPrefetch (pulled off of vectorApiPoC work)

* revert change to NativeStorageDevice, not needed as part of Vector Sets

* formatting

* actually bump to latest internal, rather than leaving this stashed

* move MGET (normal and scatter-gather) onto ReadWithPrefetch

* address feedback

* document that 4-bytes before key for RMW callback is required

* move method to migration partial

* correctly update session metrics with new MGET impls

* stopgap commit; sketch out alternative locking scheme to replace object store locks

* tweaks to locking impl after some benchmarking

* clarify docs, naming, and the 'why' of some optimizations in new locking proposal

* handle feedback; rather than process number, use a thread static which saves off managed thread id - good enough in practice, and cheap everywhere

* formatting

* fix merge

* fix website build

* address feedback; generalize vector set locks, move and rename

* bump DiskANN integration to 1.0.16 to fix Linux issue

* GH actions are hitting disk throttle issues in this test, so attempt to remove some pressure

* Revert "GH actions are hitting disk throttle issues in this test, so attempt to remove some pressure"

This reverts commit d2e139a.

* another attempt at taking IO pressure off GH linux tests

* helped some, but more explicit throttling required

* explicit throttling works some of the time, but still fails occasionally - try just slowing writes down on GitHub

* move off ObjectStore in preparation of retargeting PR against storev2 work

* Update libs/server/Storage/Session/MainStore/VectorStoreOps.cs

Co-authored-by: Tiago Nápoli <napoli.tiago96@gmail.com>

* address feedback; remove dead code

* address feedback; correct comment, denote missing WRONGTYPE behavior

* address feedback; remove dead code

* address feedback; remove commented out usings

* cleanup Vector Set dev docs

* knock out a TODO

* address feedback; remove dead code

* address feedback; fixes in migration logic around failures

* harden RepeatedVectorSetDeletes test; fix a math issue in the 'WRONGTYPE' path

* wrongtype check leaving a null in some cases; wasn't possible before because of shared locking context, removing Tsavorite locks made the bug possible

* log more on this failure, as only happening in GH

* add missing logic when outside mutable region

* address feedback; some fixes around locking

* remove accidental using

* correctly assert; fix typo

* rework WRONGTYPE logic

* deal with consequences of WRONGTYPE cleanup

* stopgap commit; start working on fixing failed deletions with a (failing) test

* address feedback; goto is a smell here

* address feedback; add missing asserts

* stopgap commit; start tracking in progress deletes, not actually handling them just yet

* address feedback; DRY up and remove some unused bytes around completing pending operations in VectorManager

* stopgap commit; recover partial deletes on startup, still needs testing

* stopgap commit; fail Vector Set commands sensibly if operating on a partially deleted Vector Set

* stopgap commit; start testing partial deletion recovery scenario

* stopgap commit; partial delete recovery working, needs docs and checks that cleanup of elements happens in all cases

* stopgap commit; delete recovery working and tested, still needs documentation update

* wrap up delete rework, update documentation

* address feedback; as cluster expected DB=0 for now, add a check for it

* address feedback; parameter names and docs

* address feedback; there's a race here, fix it but this needs to be cleaned up into a utility type

* factor counting logic out into its own reusable type CountingEventSlim

* fix formatting

* fix auth test

* address feedback; hide StoreWrapper from rest of AofProcessor

* stopgap commit; stop VectorManager background tasks if ReplicaReplayTask is stopped, still needs some explicit tests

* fix injection tests in RELEASE

* this is TODONE

* address feedback; cancellation of replay task also spins down (but allows future restart) VectorManager tasks

* address feedback; as Tiago noted, Redis _does_ allow empty vector set keys - document that that is a divergence, validate, and note for future fixing

* address flaky test

* fix more test flakiness

* another flaky test fix - common theme here is that raising an error kinda kills the connections; makes sense, SE.Redis will resume later and Redis isn't durable in ways these tests implicitly assumed

* these tests are super chatty, suppress logging for PR CI purposes

* address feedback; sizeof(byte) is unnecessary as namespace is part of the key, this revealed a bug in InPlaceUpdates too which has been fixed

* grow and shrink records during vector set delete tracking

* fix WRONGTYPE for Vector Set ops on non-Vector Set keys - still cancellation based which is a bit hacky, but we can at least test for regressions as we move towards store v2

* Implement VINFO (#1469)

* dotnet format

* typo

* validate that values will fit in store before calling into DiskANN

* validate minimum page size (16k for now, 8k is actual internal minimum but that ignored headers) iff vector set preview is turned on

* disable low memory in DiskANNServiceTests

* bump up to 1.0.17

* update tests for variable length

* fix VEMB, fix tests; all non-cluster tests passing

* bump to 1.0.18

* repro cluster regression in non-cluster context; VREM is leaving data around

* add (failing) tests and calls for check_internal_id_valid(...) against DiskANN

* bump to 1.0.19 which implements check_internal_id_valid; there are still some failures due to internal id reuse, adding dedicated (and failing) tests for that

* undo prefix managling on DiskANN provided buffers

* Update diskann-garnet and add distance metric to VINFO (#1505)

Co-authored-by: tiagonapoli <tiagonapoli@microsoft.com>

* wire distance metric up throughout

* fixes around distance metric addition

* Implement VGETATTR (#1474)

* Implement VGETATTR

* Fix test

* update comments

* Minor refactor on test

* ADdress comments

* Fix format

* Fix build

* Handle unknown GarnetStatus

---------

Co-authored-by: tiagonapoli <tiagonapoli@microsoft.com>

* minor cleanup as part of syncing into dev targetting branch

* Create Grid tests for DiskANN (#1513)

* DiskANN grid E2E tests

* nit

* Address comments

---------

Co-authored-by: tiagonapoli <tiagonapoli@microsoft.com>

* dotnet format

* bump to 1.0.20

* slap Allure attributes everywhere new

* missed an allure attribute

* missed an allure inherit

* missed more allure inherits

* fix output_ids length when initial size is too small

* fix output_distances length when initial size is too small

* pin buffers when they're too large to allocate on the stack

* bump MinimumSpacePerId to 4 + 8

* typos

* fix a bunch of unused variables and minor errors

---------

Co-authored-by: Tiago Nápoli <napoli.tiago96@gmail.com>
Co-authored-by: tiagonapoli <tiagonapoli@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants