-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update vector search docs #18779
base: master
Are you sure you want to change the base?
update vector search docs #18779
Conversation
[LGTM Timeline notifier]Timeline:
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/approve cancel |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
## Prerequisites | ||
|
||
To complete this tutorial, you need: | ||
|
||
- [Python 3.8 or higher](https://www.python.org/downloads/) installed. | ||
- [Git](https://git-scm.com/downloads) installed. | ||
- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. | ||
- A TiDB cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specify that v8.4.0 or higher is required for TiDB Self-Managed clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Oreoxmt Good idea. How about we add that in L45?
> | ||
> This section is only applicable to [TiDB Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-serverless) clusters. | ||
|
||
Define a 3-dimensional vector column and optimize it with a [vector search index](https://docs.pingcap.com/tidbcloud/vector-search-index) (HNSW index). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define a 3-dimensional vector column and optimize it with a [vector search index](https://docs.pingcap.com/tidbcloud/vector-search-index) (HNSW index). | |
Define a 3-dimensional vector column and optimize it with a [vector search index](/vector-search-index.md) (HNSW index). |
# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" | ||
``` | ||
|
||
If you are running TiDB on your local machine, `HOST` is `127.0.0.1` by default. The initial `PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are running TiDB on your local machine, `HOST` is `127.0.0.1` by default. The initial `PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field. | |
If you are running TiDB on your local machine, `<HOST>` is `127.0.0.1` by default. The initial `<PASSWORD>` is empty, so if you are starting the cluster for the first time, you can omit this field. |
|
||
If you are running TiDB on your local machine, `HOST` is `127.0.0.1` by default. The initial `PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field. | ||
|
||
The following are descriptions for each parameter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following are descriptions for each parameter: | |
The following are descriptions for each placeholder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"parameter" here matches "connection parameters" in L124
- `<HOST>`: The host of the TiDB cluster. | ||
- `<PORT>`: The port of the TiDB cluster. | ||
- `<USER>`: The username to connect to the TiDB cluster. | ||
- `<PASSWORD>`: The password to connect to the TiDB cluster. | ||
- `<DATABASE>`: The name of the database you want to connect to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `<HOST>`: The host of the TiDB cluster. | |
- `<PORT>`: The port of the TiDB cluster. | |
- `<USER>`: The username to connect to the TiDB cluster. | |
- `<PASSWORD>`: The password to connect to the TiDB cluster. | |
- `<DATABASE>`: The name of the database you want to connect to. | |
- `<USER>`: The username to connect to the TiDB cluster. | |
- `<PASSWORD>`: The password to connect to the TiDB cluster. | |
- `<HOST>`: The host of the TiDB cluster. | |
- `<PORT>`: The port of the TiDB cluster. | |
- `<DATABASE>`: The name of the database you want to connect to. |
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Co-authored-by: Aolin <aolinz@outlook.com>
TOC.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ORM Libraries | ||
- [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md) | ||
- [peewee](/vector-search-integrate-with-peewee.md) | ||
- [Django ORM](/vector-search-integrate-with-django-orm.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [Django ORM](/vector-search-integrate-with-django-orm.md) | |
- [Django](/vector-search-integrate-with-django-orm.md) |
|
||
## Create the HNSW vector index | ||
|
||
[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy (> 98% in typical cases). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy (> 98% in typical cases). | |
[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy, up to 98% in specific cases. |
- Cosine Distance: `((VEC_COSINE_DISTANCE(embedding)))` | ||
- L2 Distance: `((VEC_L2_DISTANCE(embedding)))` | ||
|
||
The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimensions. | |
The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimension. |
LIMIT 10 | ||
``` | ||
|
||
You must use the same distance metric as you have defined when creating the vector index if you want to utilize the index in vector search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一句中文没有,而且意思和 L77 似乎相同?好像可以删掉。
You must use the same distance metric as you have defined when creating the vector index if you want to utilize the index in vector search. |
|
||
Explanation of some important fields: | ||
|
||
- `vector_index.load.total`: The total duration of loading index. This field could be larger than actual query time because multiple vector indexes may be loaded in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `vector_index.load.total`: The total duration of loading index. This field could be larger than actual query time because multiple vector indexes may be loaded in parallel. | |
- `vector_index.load.total`: The total duration of loading index. This field might be larger than the actual query time because multiple vector indexes might be loaded in parallel. |
- `vector_index.load.from_s3`: Number of indexes loaded from S3. | ||
- `vector_index.load.from_disk`: Number of indexes loaded from disk. The index was already downloaded from S3 previously. | ||
- `vector_index.load.from_cache`: Number of indexes loaded from cache. The index was already downloaded from S3 previously. | ||
- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field could be larger than actual query time because multiple vector indexes might be searched in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field could be larger than actual query time because multiple vector indexes might be searched in parallel. | |
- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field might be larger than the actual query time because multiple vector indexes might be searched in parallel. |
First-time contributors' checklist
What is changed, added or deleted? (Required)
This PR moves 15 vector search docs from the tidb-cloud folder to the vector-search folder to so they can be reused by TiDB self-managed docs.
Which TiDB version(s) do your changes apply to? (Required)
Tips for choosing the affected version(s):
By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.
For details, see tips for choosing the affected versions.
What is the related PR or file link(s)?
Do your changes match any of the following descriptions?