-
Notifications
You must be signed in to change notification settings - Fork 411
Storages: Reusing distance results when scan vector index #10103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: “EricZequan” <zequany33@gmail.com>
Please fix the format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest looks good
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: breezewish, Lloyd-Pottiger The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Signed-off-by: “EricZequan” <zequany33@gmail.com>
/merge |
e4a5379
to
0e29a68
Compare
Signed-off-by: “EricZequan” <zequany33@gmail.com>
@EricZequan: Your PR was out of date, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests
If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/hold |
/retest |
/unhold |
/hold Need to include a bug fix. |
Signed-off-by: “EricZequan” <zequany33@gmail.com>
…pick-cse395 Signed-off-by: “EricZequan” <zequany33@gmail.com>
Signed-off-by: “EricZequan” <zequany33@gmail.com>
@breezewish update the bug fix, PTAL~ |
@EricZequan Please update the comment to use the new one. The rest looks good. |
Signed-off-by: “EricZequan” <zequany33@gmail.com>
@breezewish updated~ |
Signed-off-by: “EricZequan” <zequany33@gmail.com>
/unhold |
/merge |
What problem does this PR solve?
Issue Number: ref #9032
Problem Summary:
What is changed and how it works?
This PR introduces an optimization to the vector search for typical queries like:
In this query, currently (before this PR) we do the following things:
vec
andid
(with only 10 rows), as if there is no vector indexdistance
based onvec
and perform a TopN bydistance
, as if there is no vector indexThis is because currently ANNQueryInfo is only a hint. It can be ignored, or not taking effect (for example, Vector Index is not finished building), while still keeps all results correct.
The optimization in this PR touches the plan in TiDB side:
bool enable_distance_proj
is added inANNQueryInfo
, indicate whether storage layer should additionally produce a distance column.ColumnInfo column
is added inANNQueryInfo
, provide the info of vector column for reading.When
enable_distance_proj == true
, TiDB will remove the vector column in tableScan plan, just use aNullable(float32)
column to instead of reading vector column.TiFlash storage must:
A typical plan looks like this:
With this design, TiFlash could do the following things, when there is only stable data, make things faster:
id
anddistance
(with only N rows). Thevec
is no need to read.distance
As you can see, we eliminated data read for vector columns, and reduced distance computation.
When there is delta layers, or index building is not finished, we cannot lookup a vector index. In this case, corresponding rows of
distance
column will be filled according tovec_xxx_distance
function. The storage layer will compute this process, while compute layer no need to calculate the distance just perform a TopN.Performance
ref: https://github.com/zilliztech/VectorDBBench
500K/1536d dataset, only stable for test.
Check List
Tests
Side effects
Documentation
Release note