feat: support IVF_* vector indexes via ALTER TABLE ... CREATE INDEX#601
Open
sezruby wants to merge 2 commits into
Open
feat: support IVF_* vector indexes via ALTER TABLE ... CREATE INDEX#601sezruby wants to merge 2 commits into
sezruby wants to merge 2 commits into
Conversation
Author
|
For visibility — quick diff vs #479 (which this builds on, same author crediting via Co-Authored-By):
Same core design (driver-trained centroids/codebook → fragment-parallel segment build → atomic segment commit). If #479 lands first, this becomes a smaller cleanup PR; if this lands first, #479 is superseded. |
Contributor
|
@sezruby Could you fix the conflict? |
Adds ivf_flat, ivf_pq, ivf_sq as CREATE INDEX methods using lance-core's
multi-segment commit API. The driver pre-trains IVF centroids (and PQ
codebook for IVF_PQ) once via VectorTrainer, then each Spark task calls
createIndex(withFragmentIds([fid])) with those shared artifacts. The driver
collects the uncommitted per-fragment segments and publishes them atomically
under one logical index name via commitExistingIndexSegments — same pattern
as the existing zonemap distributed-build path.
Pre-trained centroids are required: lance-core's distributed build path
rejects per-fragment-trained centroids. Sharing centroids across segments
also keeps them in the same query-time compatibility group.
User-facing:
ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding)
WITH (num_partitions=256, num_sub_vectors=16, num_bits=8, metric_type='l2');
Grammar is unchanged; IndexUtils.buildIndexType learns three new cases.
Replace-on-recreate is preserved by commitExistingIndexSegments's overlap-aware
behavior (same-fragment segments are dropped in the commit transaction).
DistanceType is threaded through to VectorTrainer.trainIvfCentroids and
trainPqCodebook (lance 7 honors metric_type end-to-end; the lance 6 JNI
hardcoded MetricType::L2). use_residual is rejected with a clear error —
lance-core's training path doesn't honor it, and silently dropping the flag
would degrade recall.
Tests:
- 5 happy-path cases for IVF_FLAT/IVF_PQ/IVF_SQ creation, recreate-replaces,
and missing-num_partitions failure.
- metric_type='cosine' is exercised on IVF_SQ (regression guard for the
lance 6 silent-L2 bug fixed in lance 7).
- use_residual rejection is asserted explicitly.
Closes lance-format#165 in part. Builds on lance-format#479 (jiaoew1991) — rebased onto current main,
which already provides commitIndexSegments and a vector-friendly Kryo serializer
for Index, simplifying the original PR by dropping its LanceIndexHandle shim.
Co-Authored-By: Enwei Jiao <jiaoew2011@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace test now captures segment UUIDs after first run, asserts disjointness after recreate, and confirms segments cover every fragment exactly once. Mirrors testRepeatedCreateZonemapIndexReplacesExistingSegments. - Subtype assertion via describeIndices().getIndexType(): manifest Index#indexType() reports umbrella VECTOR for all IVF variants, so a regression that builds IVF_FLAT when the user asked for IVF_PQ would have been invisible. Asserting "IVF_PQ"/"IVF_FLAT"/"IVF_SQ" string from describeIndices catches that. - Recall test: clustered embeddings (8 clusters x 64 rows, 8-dim, sigma=0.1) with metric_type='cosine' on IVF_PQ. Queries each cluster centroid via VECTOR_SEARCH, asserts top-K returns mostly the matching cluster (recall >= 0.5). This is the test that actually proves metric_type is honored end-to-end — without it the lance 6 silent-L2 bug would have been invisible to CI. - Multi-column rejection test (mirrors testZonemapRejectsMultipleColumns). - Bad metric_type assertThrows. 9 new vector tests total: FLAT/PQ/SQ creation + subtype, recreate-replaces, recall, multi-column rejection, missing-num_partitions, bad metric, use_residual. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8da1b01 to
53caaff
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
ivf_flat,ivf_pq,ivf_sqasCREATE INDEXmethods, using lance-core's multi-segment commit API to publish per-fragment segments atomically under one logical index name.Closes #165 in part. Builds on #479 (@jiaoew1991) — rebased onto current
main, which already providescommitIndexSegmentsand a vector-friendly Kryo serializer forIndex, simplifying the original PR by dropping itsLanceIndexHandleshim.Design
createIndex+withFragmentIds) requires precomputed IVF centroids — and for IVF_PQ, a precomputed PQ codebook. The driver callsVectorTrainer.trainIvfCentroids/trainPqCodebookonce; every per-fragment task uses the same artifacts. This also keeps all segments in the same query-time compatibility group.dataset.createIndex(IndexOptions.builder(...).withFragmentIds([fid]).build())and returns an uncommittedIndexsegment.commitExistingIndexSegments(indexName, column, segments)— samecommitIndexSegmentshelper already used by the zonemap path. Replace-on-recreate is handled by lance-core's overlap-aware atomic add/remove in the segment commit.IndexUtils.isVectorIndexpredicate dispatches IVF_* alongsideuseLogicalSegmentCommit(zonemap) into a vector-only branch inAddIndexExec.run.Correctness
metric_type(l2/cosine/dot/hamming) is parsed on the driver and threaded into bothtrainIvfCentroidsandtrainPqCodebookviaDistanceType— lance 7 honors it end-to-end. (lance 6 JNI hardcodedMetricType::L2, which would have mademetric_type='cosine'silently degrade recall — see #479 review. Lance 7 fixes that invector_trainer.rsandVectorTrainer.java.)use_residualis rejected with a clear error: lance 7'svector_trainer.rs#build_ivf_params_from_javadoesn't pass the flag to Rust, andIvfBuildParamsno longer carries the field. Silently dropping it would degrade PQ recall, so we fail-fast instead.What's not in this PR
hnsw_m,hnsw_ef_construction, etc.OPTIMIZE INDEXfor incremental builds over new fragments.Test plan
make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 -Dtest='AddIndexTest'— 30 tests pass (21 existing + 9 new vector tests: FLAT/PQ/SQ create + subtype assertion viadescribeIndices, recreate-replaces with UUID disjointness, IVF_PQ + cosine recall ≥ 0.5 on clustered data, multi-column rejection, missing-num_partitions, bad metric_type, use_residual rejection)make lint— checkstyle + spotless clean🤖 Generated with Claude Code