Skip to content

feat: support IVF_* vector indexes via ALTER TABLE ... CREATE INDEX#601

Open
sezruby wants to merge 2 commits into
lance-format:mainfrom
sezruby:feat/ivf-create-index
Open

feat: support IVF_* vector indexes via ALTER TABLE ... CREATE INDEX#601
sezruby wants to merge 2 commits into
lance-format:mainfrom
sezruby:feat/ivf-create-index

Conversation

@sezruby

@sezruby sezruby commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Adds ivf_flat, ivf_pq, ivf_sq as CREATE INDEX methods, using lance-core's multi-segment commit API to publish per-fragment segments atomically under one logical index name.

ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding)
  WITH (num_partitions=256, num_sub_vectors=16, num_bits=8, metric_type='l2');

Closes #165 in part. Builds on #479 (@jiaoew1991) — rebased onto current main, which already provides commitIndexSegments and a vector-friendly Kryo serializer for Index, simplifying the original PR by dropping its LanceIndexHandle shim.

Design

  • Driver-side training. lance-core's distributed build (createIndex + withFragmentIds) requires precomputed IVF centroids — and for IVF_PQ, a precomputed PQ codebook. The driver calls VectorTrainer.trainIvfCentroids / trainPqCodebook once; every per-fragment task uses the same artifacts. This also keeps all segments in the same query-time compatibility group.
  • Per-fragment tasks. Each Spark task calls dataset.createIndex(IndexOptions.builder(...).withFragmentIds([fid]).build()) and returns an uncommitted Index segment.
  • Driver commit. Driver collects segments and calls commitExistingIndexSegments(indexName, column, segments) — same commitIndexSegments helper already used by the zonemap path. Replace-on-recreate is handled by lance-core's overlap-aware atomic add/remove in the segment commit.
  • Routing. A new IndexUtils.isVectorIndex predicate dispatches IVF_* alongside useLogicalSegmentCommit (zonemap) into a vector-only branch in AddIndexExec.run.

Correctness

  • metric_type (l2 / cosine / dot / hamming) is parsed on the driver and threaded into both trainIvfCentroids and trainPqCodebook via DistanceType — lance 7 honors it end-to-end. (lance 6 JNI hardcoded MetricType::L2, which would have made metric_type='cosine' silently degrade recall — see #479 review. Lance 7 fixes that in vector_trainer.rs and VectorTrainer.java.)
  • use_residual is rejected with a clear error: lance 7's vector_trainer.rs#build_ivf_params_from_java doesn't pass the flag to Rust, and IvfBuildParams no longer carries the field. Silently dropping it would degrade PQ recall, so we fail-fast instead.

What's not in this PR

  • IVF_HNSW_FLAT / IVF_HNSW_SQ / IVF_HNSW_PQ — supported by lance-core but need separate verification that the segment-commit path handles HNSW. Small follow-up.
  • IVF_RQ — same.
  • More WITH-args: hnsw_m, hnsw_ef_construction, etc.
  • OPTIMIZE INDEX for incremental builds over new fragments.

Test plan

  • make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 -Dtest='AddIndexTest' — 30 tests pass (21 existing + 9 new vector tests: FLAT/PQ/SQ create + subtype assertion via describeIndices, recreate-replaces with UUID disjointness, IVF_PQ + cosine recall ≥ 0.5 on clustered data, multi-column rejection, missing-num_partitions, bad metric_type, use_residual rejection)
  • make lint — checkstyle + spotless clean

🤖 Generated with Claude Code

@github-actions github-actions Bot added the enhancement New feature or request label Jun 9, 2026
@sezruby

sezruby commented Jun 9, 2026

Copy link
Copy Markdown
Author

For visibility — quick diff vs #479 (which this builds on, same author crediting via Co-Authored-By):

#479 this PR
Base lance 6.0.0-beta.2, pre-zonemap main lance 7.0.0-rc.1, current main
Index cross-executor transport custom LanceIndexHandle shim (~80 LOC) — Index wasn't Kryo-serializable in 6.x direct encode/decode[Index] — main's Kryo serializer (added with zonemap) handles it
Commit path inline dropIndex + commitExistingIndexSegments (two manifests, brief no-index window) reuses main's commitIndexSegments helper; lance-core's overlap-aware add+remove makes it single-commit atomic
metric_type honored at training no — lance 6 JNI hardcoded MetricType::L2 (review thread). metric_type='cosine' would silently build L2 centroids → degraded recall yes — lance 7's vector_trainer.rs reads distance_type_jstr; DistanceType is threaded into trainIvfCentroids / trainPqCodebook
use_residual threaded through executor builder, but Rust IvfBuildParams doesn't carry the field — silently dropped rejected at parse time with clear error, since lance-core training doesn't honor it
metric_type validation lazy (executor-side) eager (driver-side VectorIndexSpec.fromArgs)

Same core design (driver-trained centroids/codebook → fragment-parallel segment build → atomic segment commit). If #479 lands first, this becomes a smaller cleanup PR; if this lands first, #479 is superseded.

@sezruby sezruby marked this pull request as ready for review June 9, 2026 20:27
@jiaoew1991

Copy link
Copy Markdown
Contributor

@sezruby Could you fix the conflict?

sezruby and others added 2 commits June 11, 2026 20:06
Adds ivf_flat, ivf_pq, ivf_sq as CREATE INDEX methods using lance-core's
multi-segment commit API. The driver pre-trains IVF centroids (and PQ
codebook for IVF_PQ) once via VectorTrainer, then each Spark task calls
createIndex(withFragmentIds([fid])) with those shared artifacts. The driver
collects the uncommitted per-fragment segments and publishes them atomically
under one logical index name via commitExistingIndexSegments — same pattern
as the existing zonemap distributed-build path.

Pre-trained centroids are required: lance-core's distributed build path
rejects per-fragment-trained centroids. Sharing centroids across segments
also keeps them in the same query-time compatibility group.

User-facing:

  ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding)
    WITH (num_partitions=256, num_sub_vectors=16, num_bits=8, metric_type='l2');

Grammar is unchanged; IndexUtils.buildIndexType learns three new cases.
Replace-on-recreate is preserved by commitExistingIndexSegments's overlap-aware
behavior (same-fragment segments are dropped in the commit transaction).

DistanceType is threaded through to VectorTrainer.trainIvfCentroids and
trainPqCodebook (lance 7 honors metric_type end-to-end; the lance 6 JNI
hardcoded MetricType::L2). use_residual is rejected with a clear error —
lance-core's training path doesn't honor it, and silently dropping the flag
would degrade recall.

Tests:
- 5 happy-path cases for IVF_FLAT/IVF_PQ/IVF_SQ creation, recreate-replaces,
  and missing-num_partitions failure.
- metric_type='cosine' is exercised on IVF_SQ (regression guard for the
  lance 6 silent-L2 bug fixed in lance 7).
- use_residual rejection is asserted explicitly.

Closes lance-format#165 in part. Builds on lance-format#479 (jiaoew1991) — rebased onto current main,
which already provides commitIndexSegments and a vector-friendly Kryo serializer
for Index, simplifying the original PR by dropping its LanceIndexHandle shim.

Co-Authored-By: Enwei Jiao <jiaoew2011@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace test now captures segment UUIDs after first run, asserts
  disjointness after recreate, and confirms segments cover every fragment
  exactly once. Mirrors testRepeatedCreateZonemapIndexReplacesExistingSegments.
- Subtype assertion via describeIndices().getIndexType(): manifest
  Index#indexType() reports umbrella VECTOR for all IVF variants, so a
  regression that builds IVF_FLAT when the user asked for IVF_PQ would have
  been invisible. Asserting "IVF_PQ"/"IVF_FLAT"/"IVF_SQ" string from
  describeIndices catches that.
- Recall test: clustered embeddings (8 clusters x 64 rows, 8-dim, sigma=0.1)
  with metric_type='cosine' on IVF_PQ. Queries each cluster centroid via
  VECTOR_SEARCH, asserts top-K returns mostly the matching cluster
  (recall >= 0.5). This is the test that actually proves metric_type is
  honored end-to-end — without it the lance 6 silent-L2 bug would have
  been invisible to CI.
- Multi-column rejection test (mirrors testZonemapRejectsMultipleColumns).
- Bad metric_type assertThrows.

9 new vector tests total: FLAT/PQ/SQ creation + subtype, recreate-replaces,
recall, multi-column rejection, missing-num_partitions, bad metric, use_residual.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sezruby sezruby force-pushed the feat/ivf-create-index branch from 8da1b01 to 53caaff Compare June 12, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support build IVF index distributively in Spark

2 participants