Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update vector search docs #18779

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

qiancai
Copy link
Collaborator

@qiancai qiancai commented Sep 2, 2024

First-time contributors' checklist

What is changed, added or deleted? (Required)

This PR moves 15 vector search docs from the tidb-cloud folder to the vector-search folder to so they can be reused by TiDB self-managed docs.

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions.

  • master (the latest development version)
  • v8.4 (TiDB 8.4 versions)
  • v8.3 (TiDB 8.3 versions)
  • v8.2 (TiDB 8.2 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)
  • v5.3 (TiDB 5.3 versions)

What is the related PR or file link(s)?

  • This PR is translated from:
  • Other reference link(s):

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

@ti-chi-bot ti-chi-bot bot added missing-translation-status This PR does not have translation status info. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 2, 2024
@qiancai qiancai added the translation/no-need No need to translate this PR. label Sep 2, 2024
@ti-chi-bot ti-chi-bot bot removed the missing-translation-status This PR does not have translation status info. label Sep 2, 2024
@qiancai qiancai changed the base branch from master to v8.4-vector-search September 2, 2024 09:54
@qiancai qiancai self-assigned this Sep 2, 2024
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 3, 2024
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Sep 3, 2024
Copy link

ti-chi-bot bot commented Sep 3, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-09-03 02:39:01.388127072 +0000 UTC m=+325665.906179996: ☑️ agreed by Oreoxmt.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 3, 2024
@qiancai qiancai changed the base branch from v8.4-vector-search to master September 3, 2024 07:00
@qiancai qiancai added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 3, 2024
@qiancai qiancai changed the title reuse vector search docs as a base update vector search docs Sep 3, 2024
Copy link

ti-chi-bot bot commented Sep 3, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from qiancai, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Oreoxmt
Copy link
Collaborator

Oreoxmt commented Sep 3, 2024

/approve cancel

@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 6, 2024
Copy link

ti-chi-bot bot commented Sep 6, 2024

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@qiancai qiancai added v8.4 This PR/issue applies to TiDB v8.4. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. labels Sep 14, 2024

## Prerequisites

To complete this tutorial, you need:

- [Python 3.8 or higher](https://www.python.org/downloads/) installed.
- [Git](https://git-scm.com/downloads) installed.
- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one.
- A TiDB cluster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify that v8.4.0 or higher is required for TiDB Self-Managed clusters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Oreoxmt Good idea. How about we add that in L45?

vector-search-get-started-using-sql.md Outdated Show resolved Hide resolved
vector-search-get-started-using-sql.md Outdated Show resolved Hide resolved
vector-search-get-started-using-sql.md Outdated Show resolved Hide resolved
vector-search-get-started-using-python.md Outdated Show resolved Hide resolved
vector-search-improve-performance.md Outdated Show resolved Hide resolved
>
> This section is only applicable to [TiDB Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-serverless) clusters.

Define a 3-dimensional vector column and optimize it with a [vector search index](https://docs.pingcap.com/tidbcloud/vector-search-index) (HNSW index).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Define a 3-dimensional vector column and optimize it with a [vector search index](https://docs.pingcap.com/tidbcloud/vector-search-index) (HNSW index).
Define a 3-dimensional vector column and optimize it with a [vector search index](/vector-search-index.md) (HNSW index).

# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test"
```

If you are running TiDB on your local machine, `HOST` is `127.0.0.1` by default. The initial `PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you are running TiDB on your local machine, `HOST` is `127.0.0.1` by default. The initial `PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field.
If you are running TiDB on your local machine, `<HOST>` is `127.0.0.1` by default. The initial `<PASSWORD>` is empty, so if you are starting the cluster for the first time, you can omit this field.


If you are running TiDB on your local machine, `HOST` is `127.0.0.1` by default. The initial `PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field.

The following are descriptions for each parameter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following are descriptions for each parameter:
The following are descriptions for each placeholder:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"parameter" here matches "connection parameters" in L124

Comment on lines 135 to 139
- `<HOST>`: The host of the TiDB cluster.
- `<PORT>`: The port of the TiDB cluster.
- `<USER>`: The username to connect to the TiDB cluster.
- `<PASSWORD>`: The password to connect to the TiDB cluster.
- `<DATABASE>`: The name of the database you want to connect to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `<HOST>`: The host of the TiDB cluster.
- `<PORT>`: The port of the TiDB cluster.
- `<USER>`: The username to connect to the TiDB cluster.
- `<PASSWORD>`: The password to connect to the TiDB cluster.
- `<DATABASE>`: The name of the database you want to connect to.
- `<USER>`: The username to connect to the TiDB cluster.
- `<PASSWORD>`: The password to connect to the TiDB cluster.
- `<HOST>`: The host of the TiDB cluster.
- `<PORT>`: The port of the TiDB cluster.
- `<DATABASE>`: The name of the database you want to connect to.

vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-improve-performance.md Outdated Show resolved Hide resolved
vector-search-improve-performance.md Outdated Show resolved Hide resolved
vector-search-improve-performance.md Outdated Show resolved Hide resolved
vector-search-improve-performance.md Outdated Show resolved Hide resolved
vector-search-improve-performance.md Outdated Show resolved Hide resolved
vector-search-limitations.md Outdated Show resolved Hide resolved
github-actions bot pushed a commit to qiancai/pingcap-docsite-preview that referenced this pull request Oct 17, 2024
github-actions bot pushed a commit to qiancai/pingcap-docsite-preview that referenced this pull request Oct 17, 2024
github-actions bot pushed a commit to qiancai/pingcap-docsite-preview that referenced this pull request Oct 17, 2024
TOC.md Outdated
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- ORM Libraries
- [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md)
- [peewee](/vector-search-integrate-with-peewee.md)
- [Django ORM](/vector-search-integrate-with-django-orm.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Django ORM](/vector-search-integrate-with-django-orm.md)
- [Django](/vector-search-integrate-with-django-orm.md)


## Create the HNSW vector index

[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy (> 98% in typical cases).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy (> 98% in typical cases).
[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy, up to 98% in specific cases.

- Cosine Distance: `((VEC_COSINE_DISTANCE(embedding)))`
- L2 Distance: `((VEC_L2_DISTANCE(embedding)))`

The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimensions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimensions.
The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimension.

LIMIT 10
```

You must use the same distance metric as you have defined when creating the vector index if you want to utilize the index in vector search.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一句中文没有,而且意思和 L77 似乎相同?好像可以删掉。

Suggested change
You must use the same distance metric as you have defined when creating the vector index if you want to utilize the index in vector search.


Explanation of some important fields:

- `vector_index.load.total`: The total duration of loading index. This field could be larger than actual query time because multiple vector indexes may be loaded in parallel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `vector_index.load.total`: The total duration of loading index. This field could be larger than actual query time because multiple vector indexes may be loaded in parallel.
- `vector_index.load.total`: The total duration of loading index. This field might be larger than the actual query time because multiple vector indexes might be loaded in parallel.

- `vector_index.load.from_s3`: Number of indexes loaded from S3.
- `vector_index.load.from_disk`: Number of indexes loaded from disk. The index was already downloaded from S3 previously.
- `vector_index.load.from_cache`: Number of indexes loaded from cache. The index was already downloaded from S3 previously.
- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field could be larger than actual query time because multiple vector indexes might be searched in parallel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field could be larger than actual query time because multiple vector indexes might be searched in parallel.
- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field might be larger than the actual query time because multiple vector indexes might be searched in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. translation/no-need No need to translate this PR. v8.4 This PR/issue applies to TiDB v8.4.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants