Skip to content

Storages: Reusing distance results when scan vector index #10103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 16, 2025

Conversation

EricZequan
Copy link
Contributor

@EricZequan EricZequan commented Apr 14, 2025

What problem does this PR solve?

Issue Number: ref #9032

Problem Summary:

What is changed and how it works?

This PR introduces an optimization to the vector search for typical queries like:

select id from t1 order by vec_cosine_distance(vec, '[1,1,1]') limit 10;

In this query, currently (before this PR) we do the following things:

  • Lookup vector index to filter out the rowids of Top10 vectors
  • Lookup vector index to fill the vector data of these 10 vectors
  • Storage layer produces columns vec and id (with only 10 rows), as if there is no vector index
  • Compute layer computes distance based on vec and perform a TopN by distance, as if there is no vector index

This is because currently ANNQueryInfo is only a hint. It can be ignored, or not taking effect (for example, Vector Index is not finished building), while still keeps all results correct.

The optimization in this PR touches the plan in TiDB side:

  • bool enable_distance_proj is added in ANNQueryInfo, indicate whether storage layer should additionally produce a distance column.
  • ColumnInfo column is added in ANNQueryInfo, provide the info of vector column for reading.

When enable_distance_proj == true, TiDB will remove the vector column in tableScan plan, just use a Nullable(float32) column to instead of reading vector column.
TiFlash storage must:

  • For dmfile and columnfile, read the index according to the column ID -2000 and fill it with the distance result.
  • For data in memory, read the vector data according to the column ID -2000, construct a constColumn (filled with ref_vec in annqueryInfo), calculate the distance and fill it with the distance column.

A typical plan looks like this:

mysql> explain select /*+ read_from_storage(tiflash[t1]) */ id from t1 order by vec_cosine_distance(vec, '[1,1,1]') limit 10;
+--------------------------------+---------+--------------+------------------------------------+------------------------------------------------------------------------------------+
| id                             | estRows | task         | access object                      | operator info                                                                      |
+--------------------------------+---------+--------------+------------------------------------+------------------------------------------------------------------------------------+
| TopN_11                        | 3.00    | root         |                                    | Column#11, offset:0, count:10                                                      |
| └─TableReader_25               | 3.00    | root         |                                    | MppVersion: 2, data:ExchangeSender_24                                              |
|   └─ExchangeSender_24          | 3.00    | mpp[tiflash] |                                    | ExchangeType: PassThrough                                                          |
|     └─TopN_23                  | 3.00    | mpp[tiflash] |                                    | Column#11, offset:0, count:10                                                      |                                                   |
|       └─TableFullScan_21     | 3.00    | mpp[tiflash] | table:t1, index:idx_embedding(vec) | keep order:false, stats:pseudo, annIndex:COSINE(vec..[1,1,1], limit:10)->Column#11 |
+--------------------------------+---------+--------------+------------------------------------+------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)

With this design, TiFlash could do the following things, when there is only stable data, make things faster:

  • Lookup vector index to filter out the rowids and distances of TopN vectors
  • Storage layer produces columns id and distance (with only N rows). The vec is no need to read.
  • Compute layer perform a TopN by distance

As you can see, we eliminated data read for vector columns, and reduced distance computation.

When there is delta layers, or index building is not finished, we cannot lookup a vector index. In this case, corresponding rows of distance column will be filled according to vec_xxx_distance function. The storage layer will compute this process, while compute layer no need to calculate the distance just perform a TopN.


Performance
ref: https://github.com/zilliztech/VectorDBBench
500K/1536d dataset, only stable for test.

version 8.5 this PR INC
qps 200.2039 315.6918 57.68% ⬆️
latency 0.0372 0.0208 44.08% ⬆️

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: “EricZequan” <zequany33@gmail.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 14, 2025
@breezewish
Copy link
Member

Please fix the format.

Copy link
Member

@breezewish breezewish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest looks good

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 14, 2025
Copy link
Contributor

ti-chi-bot bot commented Apr 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: breezewish, Lloyd-Pottiger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Lloyd-Pottiger,breezewish]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 14, 2025
Copy link
Contributor

ti-chi-bot bot commented Apr 14, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-04-14 10:50:49.809919133 +0000 UTC m=+2685543.494155226: ☑️ agreed by breezewish.
  • 2025-04-14 10:57:06.22933461 +0000 UTC m=+2685919.913570703: ☑️ agreed by Lloyd-Pottiger.

Signed-off-by: “EricZequan” <zequany33@gmail.com>
@breezewish
Copy link
Member

/merge

EricZequan and others added 3 commits April 15, 2025 10:10
Signed-off-by: “EricZequan” <zequany33@gmail.com>
Signed-off-by: “EricZequan” <zequany33@gmail.com>
Copy link
Contributor

ti-chi-bot bot commented Apr 15, 2025

@EricZequan: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@breezewish
Copy link
Member

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2025
Signed-off-by: “EricZequan” <zequany33@gmail.com>
@EricZequan
Copy link
Contributor Author

/retest

@breezewish
Copy link
Member

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2025
@breezewish
Copy link
Member

/hold

Need to include a bug fix.

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2025
EricZequan and others added 4 commits April 15, 2025 21:49
Signed-off-by: “EricZequan” <zequany33@gmail.com>
…pick-cse395

Signed-off-by: “EricZequan” <zequany33@gmail.com>
Signed-off-by: “EricZequan” <zequany33@gmail.com>
@EricZequan
Copy link
Contributor Author

@breezewish update the bug fix, PTAL~

@breezewish
Copy link
Member

@EricZequan Please update the comment to use the new one. The rest looks good.

breezewish and others added 2 commits April 16, 2025 10:23
Signed-off-by: “EricZequan” <zequany33@gmail.com>
@EricZequan
Copy link
Contributor Author

@breezewish updated~

Signed-off-by: “EricZequan” <zequany33@gmail.com>
@breezewish
Copy link
Member

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2025
@breezewish
Copy link
Member

/merge

@ti-chi-bot ti-chi-bot bot merged commit ed9408b into pingcap:master Apr 16, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants