From 21cf22d48f1647bd15d8d823594b1ca73aa7bdf1 Mon Sep 17 00:00:00 2001 From: Docsite Preview Bot <> Date: Thu, 17 Oct 2024 05:41:17 +0000 Subject: [PATCH] Preview PR https://github.com/pingcap/docs/pull/18779 and this preview is triggered from commit https://github.com/pingcap/docs/pull/18779/commits/cb7e4dc1319617b1d525d083235bdd53d4099a08 --- .../en/tidb/master/TOC-tidb-cloud.md | 760 ++++++++++++++++++ markdown-pages/en/tidb/master/TOC.md | 64 +- .../master/br/backup-and-restore-overview.md | 144 ++++ .../en/tidb/master/dm/dm-overview.md | 95 +++ .../tidb/master/ticdc/ticdc-compatibility.md | 74 ++ .../data-service-manage-endpoint.md | 478 +++++++++++ .../tidb-cloud/tidb-cloud-release-notes.md | 349 ++++++++ .../en/tidb/master/tiflash-upgrade-guide.md | 133 +++ .../master/tiflash/tiflash-configuration.md | 371 +++++++++ .../tidb/master/vector-search-data-types.md | 254 ++++++ .../vector-search-functions-and-operators.md | 292 +++++++ .../vector-search-get-started-using-python.md | 251 ++++++ .../vector-search-get-started-using-sql.md | 195 +++++ .../vector-search-improve-performance.md | 52 ++ .../en/tidb/master/vector-search-index.md | 303 +++++++ ...vector-search-integrate-with-django-orm.md | 300 +++++++ ...-search-integrate-with-jinaai-embedding.md | 295 +++++++ .../vector-search-integrate-with-langchain.md | 655 +++++++++++++++ ...vector-search-integrate-with-llamaindex.md | 328 ++++++++ .../vector-search-integrate-with-peewee.md | 290 +++++++ ...vector-search-integrate-with-sqlalchemy.md | 259 ++++++ .../vector-search-integration-overview.md | 79 ++ .../tidb/master/vector-search-limitations.md | 67 ++ .../en/tidb/master/vector-search-overview.md | 88 ++ 24 files changed, 6165 insertions(+), 11 deletions(-) create mode 100644 markdown-pages/en/tidb/master/TOC-tidb-cloud.md create mode 100644 markdown-pages/en/tidb/master/br/backup-and-restore-overview.md create mode 100644 markdown-pages/en/tidb/master/dm/dm-overview.md create mode 100644 markdown-pages/en/tidb/master/ticdc/ticdc-compatibility.md create mode 100644 markdown-pages/en/tidb/master/tidb-cloud/data-service-manage-endpoint.md create mode 100644 markdown-pages/en/tidb/master/tidb-cloud/tidb-cloud-release-notes.md create mode 100644 markdown-pages/en/tidb/master/tiflash-upgrade-guide.md create mode 100644 markdown-pages/en/tidb/master/tiflash/tiflash-configuration.md create mode 100644 markdown-pages/en/tidb/master/vector-search-data-types.md create mode 100644 markdown-pages/en/tidb/master/vector-search-functions-and-operators.md create mode 100644 markdown-pages/en/tidb/master/vector-search-get-started-using-python.md create mode 100644 markdown-pages/en/tidb/master/vector-search-get-started-using-sql.md create mode 100644 markdown-pages/en/tidb/master/vector-search-improve-performance.md create mode 100644 markdown-pages/en/tidb/master/vector-search-index.md create mode 100644 markdown-pages/en/tidb/master/vector-search-integrate-with-django-orm.md create mode 100644 markdown-pages/en/tidb/master/vector-search-integrate-with-jinaai-embedding.md create mode 100644 markdown-pages/en/tidb/master/vector-search-integrate-with-langchain.md create mode 100644 markdown-pages/en/tidb/master/vector-search-integrate-with-llamaindex.md create mode 100644 markdown-pages/en/tidb/master/vector-search-integrate-with-peewee.md create mode 100644 markdown-pages/en/tidb/master/vector-search-integrate-with-sqlalchemy.md create mode 100644 markdown-pages/en/tidb/master/vector-search-integration-overview.md create mode 100644 markdown-pages/en/tidb/master/vector-search-limitations.md create mode 100644 markdown-pages/en/tidb/master/vector-search-overview.md diff --git a/markdown-pages/en/tidb/master/TOC-tidb-cloud.md b/markdown-pages/en/tidb/master/TOC-tidb-cloud.md new file mode 100644 index 0000000..144979b --- /dev/null +++ b/markdown-pages/en/tidb/master/TOC-tidb-cloud.md @@ -0,0 +1,760 @@ + + + +- [Docs Home](https://docs.pingcap.com/) +- About TiDB Cloud + - [What is TiDB Cloud](/tidb-cloud/tidb-cloud-intro.md) + - [Architecture](/tidb-cloud/tidb-cloud-intro.md#architecture) + - [High Availability](/tidb-cloud/high-availability-with-multi-az.md) + - [MySQL Compatibility](/mysql-compatibility.md) + - [Roadmap](/tidb-cloud/tidb-cloud-roadmap.md) +- Get Started + - [Try Out TiDB Cloud](/tidb-cloud/tidb-cloud-quickstart.md) + - [Try Out TiDB + AI](/vector-search-get-started-using-python.md) + - [Try Out HTAP](/tidb-cloud/tidb-cloud-htap-quickstart.md) + - [Try Out TiDB Cloud CLI](/tidb-cloud/get-started-with-cli.md) + - [Perform a PoC](/tidb-cloud/tidb-cloud-poc.md) +- Develop Applications + - [Overview](/develop/dev-guide-overview.md) + - Quick Start + - [Build a TiDB Cloud Serverless Cluster](/develop/dev-guide-build-cluster-in-cloud.md) + - [CRUD SQL in TiDB](/develop/dev-guide-tidb-crud-sql.md) + - Connect to TiDB Cloud + - GUI Database Tools + - [JetBrains DataGrip](/develop/dev-guide-gui-datagrip.md) + - [DBeaver](/develop/dev-guide-gui-dbeaver.md) + - [VS Code](/develop/dev-guide-gui-vscode-sqltools.md) + - [MySQL Workbench](/develop/dev-guide-gui-mysql-workbench.md) + - [Navicat](/develop/dev-guide-gui-navicat.md) + - [Choose Driver or ORM](/develop/dev-guide-choose-driver-or-orm.md) + - BI + - [Looker Studio](/tidb-cloud/dev-guide-bi-looker-studio.md) + - Java + - [JDBC](/develop/dev-guide-sample-application-java-jdbc.md) + - [MyBatis](/develop/dev-guide-sample-application-java-mybatis.md) + - [Hibernate](/develop/dev-guide-sample-application-java-hibernate.md) + - [Spring Boot](/develop/dev-guide-sample-application-java-spring-boot.md) + - [Connection Pools and Connection Parameters](/develop/dev-guide-connection-parameters.md) + - Go + - [Go-MySQL-Driver](/develop/dev-guide-sample-application-golang-sql-driver.md) + - [GORM](/develop/dev-guide-sample-application-golang-gorm.md) + - Python + - [mysqlclient](/develop/dev-guide-sample-application-python-mysqlclient.md) + - [MySQL Connector/Python](/develop/dev-guide-sample-application-python-mysql-connector.md) + - [PyMySQL](/develop/dev-guide-sample-application-python-pymysql.md) + - [SQLAlchemy](/develop/dev-guide-sample-application-python-sqlalchemy.md) + - [peewee](/develop/dev-guide-sample-application-python-peewee.md) + - [Django](/develop/dev-guide-sample-application-python-django.md) + - Node.js + - [node-mysql2](/develop/dev-guide-sample-application-nodejs-mysql2.md) + - [mysql.js](/develop/dev-guide-sample-application-nodejs-mysqljs.md) + - [Prisma](/develop/dev-guide-sample-application-nodejs-prisma.md) + - [Sequelize](/develop/dev-guide-sample-application-nodejs-sequelize.md) + - [TypeORM](/develop/dev-guide-sample-application-nodejs-typeorm.md) + - [Next.js](/develop/dev-guide-sample-application-nextjs.md) + - [AWS Lambda](/develop/dev-guide-sample-application-aws-lambda.md) + - Ruby + - [mysql2](/develop/dev-guide-sample-application-ruby-mysql2.md) + - [Rails](/develop/dev-guide-sample-application-ruby-rails.md) + - [WordPress](/tidb-cloud/dev-guide-wordpress.md) + - Serverless Driver (Beta) + - [TiDB Cloud Serverless Driver](/tidb-cloud/serverless-driver.md) + - [Node.js Example](/tidb-cloud/serverless-driver-node-example.md) + - [Prisma Example](/tidb-cloud/serverless-driver-prisma-example.md) + - [Kysely Example](/tidb-cloud/serverless-driver-kysely-example.md) + - [Drizzle Example](/tidb-cloud/serverless-driver-drizzle-example.md) + - Third-Party Support + - [Third-Party Tools Supported by TiDB](/develop/dev-guide-third-party-support.md) + - [Known Incompatibility Issues with Third-Party Tools](/develop/dev-guide-third-party-tools-compatibility.md) + - Development Reference + - Design Database Schema + - [Overview](/develop/dev-guide-schema-design-overview.md) + - [Create a Database](/develop/dev-guide-create-database.md) + - [Create a Table](/develop/dev-guide-create-table.md) + - [Create a Secondary Index](/develop/dev-guide-create-secondary-indexes.md) + - Write Data + - [Insert Data](/develop/dev-guide-insert-data.md) + - [Update Data](/develop/dev-guide-update-data.md) + - [Delete Data](/develop/dev-guide-delete-data.md) + - [Periodically Delete Expired Data Using TTL (Time to Live)](/time-to-live.md) + - [Prepared Statements](/develop/dev-guide-prepared-statement.md) + - Read Data + - [Query Data from a Single Table](/develop/dev-guide-get-data-from-single-table.md) + - [Multi-table Join Queries](/develop/dev-guide-join-tables.md) + - [Subquery](/develop/dev-guide-use-subqueries.md) + - [Paginate Results](/develop/dev-guide-paginate-results.md) + - [Views](/develop/dev-guide-use-views.md) + - [Temporary Tables](/develop/dev-guide-use-temporary-tables.md) + - [Common Table Expression](/develop/dev-guide-use-common-table-expression.md) + - Read Replica Data + - [Follower Read](/develop/dev-guide-use-follower-read.md) + - [Stale Read](/develop/dev-guide-use-stale-read.md) + - [HTAP Queries](/develop/dev-guide-hybrid-oltp-and-olap-queries.md) + - Transaction + - [Overview](/develop/dev-guide-transaction-overview.md) + - [Optimistic and Pessimistic Transactions](/develop/dev-guide-optimistic-and-pessimistic-transaction.md) + - [Transaction Restraints](/develop/dev-guide-transaction-restraints.md) + - [Handle Transaction Errors](/develop/dev-guide-transaction-troubleshoot.md) + - Optimize + - [Overview](/develop/dev-guide-optimize-sql-overview.md) + - [SQL Performance Tuning](/develop/dev-guide-optimize-sql.md) + - [Best Practices for Performance Tuning](/develop/dev-guide-optimize-sql-best-practices.md) + - [Best Practices for Indexing](/develop/dev-guide-index-best-practice.md) + - Other Optimization Methods + - [Avoid Implicit Type Conversions](/develop/dev-guide-implicit-type-conversion.md) + - [Unique Serial Number Generation](/develop/dev-guide-unique-serial-number-generation.md) + - Troubleshoot + - [SQL or Transaction Issues](/develop/dev-guide-troubleshoot-overview.md) + - [Unstable Result Set](/develop/dev-guide-unstable-result-set.md) + - [Timeouts](/develop/dev-guide-timeouts-in-tidb.md) + - Development Guidelines + - [Object Naming Convention](/develop/dev-guide-object-naming-guidelines.md) + - [SQL Development Specifications](/develop/dev-guide-sql-development-specification.md) + - [Bookshop Example Application](/develop/dev-guide-bookshop-schema-design.md) +- Manage Cluster + - Plan Your Cluster + - [Select Your Cluster Tier](/tidb-cloud/select-cluster-tier.md) + - [Determine Your TiDB Size](/tidb-cloud/size-your-cluster.md) + - [TiDB Cloud Performance Reference](/tidb-cloud/tidb-cloud-performance-reference.md) + - Manage TiDB Cloud Serverless Clusters + - [Create a TiDB Cloud Serverless Cluster](/tidb-cloud/create-tidb-cluster-serverless.md) + - Connect to Your TiDB Cloud Serverless Cluster + - [Connection Overview](/tidb-cloud/connect-to-tidb-cluster-serverless.md) + - [Connect via Public Endpoint](/tidb-cloud/connect-via-standard-connection-serverless.md) + - [Connect via Private Endpoint](/tidb-cloud/set-up-private-endpoint-connections-serverless.md) + - Branch (Beta) + - [Overview](/tidb-cloud/branch-overview.md) + - [Manage Branches](/tidb-cloud/branch-manage.md) + - [GitHub Integration](/tidb-cloud/branch-github-integration.md) + - [Manage Spending Limit](/tidb-cloud/manage-serverless-spend-limit.md) + - [Back Up and Restore TiDB Cloud Serverless Data](/tidb-cloud/backup-and-restore-serverless.md) + - [Export Data from TiDB Cloud Serverless](/tidb-cloud/serverless-export.md) + - Manage TiDB Cloud Dedicated Clusters + - [Create a TiDB Cloud Dedicated Cluster](/tidb-cloud/create-tidb-cluster.md) + - Connect to Your TiDB Cloud Dedicated Cluster + - [Connection Method Overview](/tidb-cloud/connect-to-tidb-cluster.md) + - [Connect via Standard Connection](/tidb-cloud/connect-via-standard-connection.md) + - [Connect via Private Endpoint with AWS](/tidb-cloud/set-up-private-endpoint-connections.md) + - [Connect via Private Endpoint (Private Service Connect) with Google Cloud](/tidb-cloud/set-up-private-endpoint-connections-on-google-cloud.md) + - [Connect via VPC Peering](/tidb-cloud/set-up-vpc-peering-connections.md) + - [Connect via SQL Shell](/tidb-cloud/connect-via-sql-shell.md) + - [Scale a TiDB Cloud Dedicated Cluster](/tidb-cloud/scale-tidb-cluster.md) + - [Back Up and Restore TiDB Cloud Dedicated Data](/tidb-cloud/backup-and-restore.md) + - [Pause or Resume a TiDB Cloud Dedicated Cluster](/tidb-cloud/pause-or-resume-tidb-cluster.md) + - [Configure Maintenance Window](/tidb-cloud/configure-maintenance-window.md) + - Use an HTAP Cluster with TiFlash + - [TiFlash Overview](/tiflash/tiflash-overview.md) + - [Create TiFlash Replicas](/tiflash/create-tiflash-replicas.md) + - [Read Data from TiFlash](/tiflash/use-tidb-to-read-tiflash.md) + - [Use MPP Mode](/tiflash/use-tiflash-mpp-mode.md) + - [Use FastScan](/tiflash/use-fastscan.md) + - [Supported Push-down Calculations](/tiflash/tiflash-supported-pushdown-calculations.md) + - [TiFlash Query Result Materialization](/tiflash/tiflash-results-materialization.md) + - [TiFlash Late Materialization](/tiflash/tiflash-late-materialization.md) + - [Compatibility](/tiflash/tiflash-compatibility.md) + - [Pipeline Execution Model](/tiflash/tiflash-pipeline-model.md) + - Monitor and Alert + - [Overview](/tidb-cloud/monitor-tidb-cluster.md) + - [Built-in Metrics](/tidb-cloud/built-in-monitoring.md) + - [Built-in Alerting](/tidb-cloud/monitor-built-in-alerting.md) + - [Cluster Events](/tidb-cloud/tidb-cloud-events.md) + - [Third-Party Metrics Integrations (Beta)](/tidb-cloud/third-party-monitoring-integrations.md) + - Tune Performance + - [Overview](/tidb-cloud/tidb-cloud-tune-performance-overview.md) + - Analyze Performance + - [Use the Diagnosis Tab](/tidb-cloud/tune-performance.md) + - [Use Index Insight (Beta)](/tidb-cloud/index-insight.md) + - [Use Statement Summary Tables](/statement-summary-tables.md) + - SQL Tuning + - [Overview](/tidb-cloud/tidb-cloud-sql-tuning-overview.md) + - Understanding the Query Execution Plan + - [Overview](/explain-overview.md) + - [`EXPLAIN` Walkthrough](/explain-walkthrough.md) + - [Indexes](/explain-indexes.md) + - [Joins](/explain-joins.md) + - [MPP Queries](/explain-mpp.md) + - [Subqueries](/explain-subqueries.md) + - [Aggregation](/explain-aggregation.md) + - [Views](/explain-views.md) + - [Partitions](/explain-partitions.md) + - [Index Merge](/explain-index-merge.md) + - SQL Optimization Process + - [Overview](/sql-optimization-concepts.md) + - Logic Optimization + - [Overview](/sql-logical-optimization.md) + - [Subquery Related Optimizations](/subquery-optimization.md) + - [Column Pruning](/column-pruning.md) + - [Decorrelation of Correlated Subquery](/correlated-subquery-optimization.md) + - [Eliminate Max/Min](/max-min-eliminate.md) + - [Predicates Push Down](/predicate-push-down.md) + - [Partition Pruning](/partition-pruning.md) + - [TopN and Limit Push Down](/topn-limit-push-down.md) + - [Join Reorder](/join-reorder.md) + - [Derive TopN or Limit from Window Functions](/derive-topn-from-window.md) + - Physical Optimization + - [Overview](/sql-physical-optimization.md) + - [Index Selection](/choose-index.md) + - [Statistics](/statistics.md) + - [Extended Statistics](/extended-statistics.md) + - [Wrong Index Solution](/wrong-index-solution.md) + - [Distinct Optimization](/agg-distinct-optimization.md) + - [Cost Model](/cost-model.md) + - [Runtime Filter](/runtime-filter.md) + - [Prepared Execution Plan Cache](/sql-prepared-plan-cache.md) + - [Non-Prepared Execution Plan Cache](/sql-non-prepared-plan-cache.md) + - Control Execution Plans + - [Overview](/control-execution-plan.md) + - [Optimizer Hints](/optimizer-hints.md) + - [SQL Plan Management](/sql-plan-management.md) + - [The Blocklist of Optimization Rules and Expression Pushdown](/blocklist-control-plan.md) + - [Optimizer Fix Controls](/optimizer-fix-controls.md) + - [TiKV Follower Read](/follower-read.md) + - [Coprocessor Cache](/coprocessor-cache.md) + - Garbage Collection (GC) + - [Overview](/garbage-collection-overview.md) + - [Configuration](/garbage-collection-configuration.md) + - [Tune TiFlash Performance](/tiflash/tune-tiflash-performance.md) + - [Upgrade a TiDB Cluster](/tidb-cloud/upgrade-tidb-cluster.md) + - [Delete a TiDB Cluster](/tidb-cloud/delete-tidb-cluster.md) +- Migrate or Import Data + - [Overview](/tidb-cloud/tidb-cloud-migration-overview.md) + - Migrate Data into TiDB Cloud + - [Migrate Existing and Incremental Data Using Data Migration](/tidb-cloud/migrate-from-mysql-using-data-migration.md) + - [Migrate Incremental Data Using Data Migration](/tidb-cloud/migrate-incremental-data-from-mysql-using-data-migration.md) + - [Migrate and Merge MySQL Shards of Large Datasets](/tidb-cloud/migrate-sql-shards.md) + - [Migrate from On-Premises TiDB to TiDB Cloud](/tidb-cloud/migrate-from-op-tidb.md) + - [Migrate from MySQL-Compatible Databases Using AWS DMS](/tidb-cloud/migrate-from-mysql-using-aws-dms.md) + - [Migrate from Amazon RDS for Oracle Using AWS DMS](/tidb-cloud/migrate-from-oracle-using-aws-dms.md) + - Import Data into TiDB Cloud + - [Import Local Files](/tidb-cloud/tidb-cloud-import-local-files.md) + - [Import Sample Data (SQL File)](/tidb-cloud/import-sample-data.md) + - [Import CSV Files from Amazon S3 or GCS](/tidb-cloud/import-csv-files.md) + - [Import Apache Parquet Files from Amazon S3 or GCS](/tidb-cloud/import-parquet-files.md) + - [Import with MySQL CLI](/tidb-cloud/import-with-mysql-cli.md) + - Reference + - [Configure Amazon S3 Access and GCS Access](/tidb-cloud/config-s3-and-gcs-access.md) + - [Naming Conventions for Data Import](/tidb-cloud/naming-conventions-for-data-import.md) + - [CSV Configurations for Importing Data](/tidb-cloud/csv-config-for-import-data.md) + - [Troubleshoot Access Denied Errors during Data Import from Amazon S3](/tidb-cloud/troubleshoot-import-access-denied-error.md) + - [Precheck Errors, Migration Errors, and Alerts for Data Migration](/tidb-cloud/tidb-cloud-dm-precheck-and-troubleshooting.md) + - [Connect AWS DMS to TiDB Cloud clusters](/tidb-cloud/tidb-cloud-connect-aws-dms.md) +- Explore Data + - [Chat2Query (Beta) in SQL Editor](/tidb-cloud/explore-data-with-chat2query.md) +- Vector Search (Beta) + - [Overview](/vector-search-overview.md) + - Get Started + - [Get Started with SQL](/vector-search-get-started-using-sql.md) + - [Get Started with Python](/vector-search-get-started-using-python.md) + - Integrations + - [Overview](/vector-search-integration-overview.md) + - AI Frameworks + - [LlamaIndex](/vector-search-integrate-with-llamaindex.md) + - [Langchain](/vector-search-integrate-with-langchain.md) + - Embedding Models/Services + - [Jina AI](/vector-search-integrate-with-jinaai-embedding.md) + - ORM Libraries + - [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md) + - [peewee](/vector-search-integrate-with-peewee.md) + - [Django ORM](/vector-search-integrate-with-django-orm.md) + - Reference + - [Vector Data Types](/vector-search-data-types.md) + - [Vector Functions and Operators](/vector-search-functions-and-operators.md) + - [Vector Index](/vector-search-index.md) + - [Improve Performance](/vector-search-improve-performance.md) + - [Limitations](/vector-search-limitations.md) + - [Changelogs](/tidb-cloud/vector-search-changelogs.md) +- Data Service (Beta) + - [Overview](/tidb-cloud/data-service-overview.md) + - [Get Started](/tidb-cloud/data-service-get-started.md) + - Chat2Query API + - [Get Started](/tidb-cloud/use-chat2query-api.md) + - [Start Multi-round Chat2Query](/tidb-cloud/use-chat2query-sessions.md) + - [Use Knowledge Bases](/tidb-cloud/use-chat2query-knowledge.md) + - [Manage Data App](/tidb-cloud/data-service-manage-data-app.md) + - [Manage Endpoint](/tidb-cloud/data-service-manage-endpoint.md) + - [API Key](/tidb-cloud/data-service-api-key.md) + - [Custom Domain](/tidb-cloud/data-service-custom-domain.md) + - [Integrations](/tidb-cloud/data-service-integrations.md) + - [Run in Postman](/tidb-cloud/data-service-postman-integration.md) + - [Deploy Automatically with GitHub](/tidb-cloud/data-service-manage-github-connection.md) + - [Use OpenAPI Specification with Next.js](/tidb-cloud/data-service-oas-with-nextjs.md) + - [Data App Configuration Files](/tidb-cloud/data-service-app-config-files.md) + - [Response and Status Code](/tidb-cloud/data-service-response-and-status-code.md) +- Stream Data + - [Changefeed Overview](/tidb-cloud/changefeed-overview.md) + - [To MySQL Sink](/tidb-cloud/changefeed-sink-to-mysql.md) + - [To Kafka Sink](/tidb-cloud/changefeed-sink-to-apache-kafka.md) + - [To TiDB Cloud Sink](/tidb-cloud/changefeed-sink-to-tidb-cloud.md) + - [To Cloud Storage](/tidb-cloud/changefeed-sink-to-cloud-storage.md) +- Disaster Recovery + - [Recovery Group Overview](/tidb-cloud/recovery-group-overview.md) + - [Get Started](/tidb-cloud/recovery-group-get-started.md) + - [Failover and Reprotect Databases](/tidb-cloud/recovery-group-failover.md) + - [Delete a Recovery Group](/tidb-cloud/recovery-group-delete.md) +- Security + - Identity Access Control + - [Password Authentication](/tidb-cloud/tidb-cloud-password-authentication.md) + - [Basic SSO Authentication](/tidb-cloud/tidb-cloud-sso-authentication.md) + - [Organization SSO Authentication](/tidb-cloud/tidb-cloud-org-sso-authentication.md) + - [Identity Access Management](/tidb-cloud/manage-user-access.md) + - [OAuth 2.0](/tidb-cloud/oauth2.md) + - Network Access Control + - TiDB Cloud Serverless + - [Connect via Private Endpoint](/tidb-cloud/set-up-private-endpoint-connections-serverless.md) + - [TLS Connections to TiDB Cloud Serverless](/tidb-cloud/secure-connections-to-serverless-clusters.md) + - TiDB Cloud Dedicated + - [Configure an IP Access List](/tidb-cloud/configure-ip-access-list.md) + - [Connect via Private Endpoint with AWS](/tidb-cloud/set-up-private-endpoint-connections.md) + - [Connect via Private Endpoint (Private Service Connect) with Google Cloud](/tidb-cloud/set-up-private-endpoint-connections-on-google-cloud.md) + - [Connect via VPC Peering](/tidb-cloud/set-up-vpc-peering-connections.md) + - [TLS Connections to TiDB Cloud Dedicated](/tidb-cloud/tidb-cloud-tls-connect-to-dedicated.md) + - Data Access Control + - [Encryption at Rest Using Customer-Managed Encryption Keys](/tidb-cloud/tidb-cloud-encrypt-cmek.md) + - Database Access Control + - [Configure Cluster Security Settings](/tidb-cloud/configure-security-settings.md) + - Audit Management + - [Database Audit Logging](/tidb-cloud/tidb-cloud-auditing.md) + - [Console Audit Logging](/tidb-cloud/tidb-cloud-console-auditing.md) +- Billing + - [Invoices](/tidb-cloud/tidb-cloud-billing.md#invoices) + - [Billing Details](/tidb-cloud/tidb-cloud-billing.md#billing-details) + - [Cost Explorer](/tidb-cloud/tidb-cloud-billing.md#cost-explorer) + - [Billing Profile](/tidb-cloud/tidb-cloud-billing.md#billing-profile) + - [Credits](/tidb-cloud/tidb-cloud-billing.md#credits) + - [Payment Method Setting](/tidb-cloud/tidb-cloud-billing.md#payment-method) + - [Billing from AWS or GCP Marketplace](/tidb-cloud/tidb-cloud-billing.md#billing-from-aws-marketplace-or-google-cloud-marketplace) + - [Billing for Changefeed](/tidb-cloud/tidb-cloud-billing-ticdc-rcu.md) + - [Billing for Data Migration](/tidb-cloud/tidb-cloud-billing-dm.md) + - [Billing for Recovery Groups](/tidb-cloud/tidb-cloud-billing-recovery-group.md) + - [Manage Budgets](/tidb-cloud/tidb-cloud-budget.md) +- TiDB Cloud Partner Web Console + - [TiDB Cloud Partners](/tidb-cloud/tidb-cloud-partners.md) + - [MSP Customer](/tidb-cloud/managed-service-provider-customer.md) + - [Reseller's Customer](/tidb-cloud/cppo-customer.md) +- API + - [API Overview](/tidb-cloud/api-overview.md) + - API Reference + - v1beta1 + - [Billing](https://docs.pingcap.com/tidbcloud/api/v1beta1/billing) + - [Data Service](https://docs.pingcap.com/tidbcloud/api/v1beta1/dataservice) + - [IAM](https://docs.pingcap.com/tidbcloud/api/v1beta1/iam) + - [MSP](https://docs.pingcap.com/tidbcloud/api/v1beta1/msp) + - [v1beta](https://docs.pingcap.com/tidbcloud/api/v1beta) +- Integrations + - [Airbyte](/tidb-cloud/integrate-tidbcloud-with-airbyte.md) + - [Amazon AppFlow](/develop/dev-guide-aws-appflow-integration.md) + - [Cloudflare](/tidb-cloud/integrate-tidbcloud-with-cloudflare.md) + - [Datadog](/tidb-cloud/monitor-datadog-integration.md) + - [dbt](/tidb-cloud/integrate-tidbcloud-with-dbt.md) + - [Gitpod](/develop/dev-guide-playground-gitpod.md) + - [n8n](/tidb-cloud/integrate-tidbcloud-with-n8n.md) + - [Netlify](/tidb-cloud/integrate-tidbcloud-with-netlify.md) + - [New Relic](/tidb-cloud/monitor-new-relic-integration.md) + - [Prometheus and Grafana](/tidb-cloud/monitor-prometheus-and-grafana-integration.md) + - [ProxySQL](/develop/dev-guide-proxysql-integration.md) + - Terraform + - [Terraform Integration Overview](/tidb-cloud/terraform-tidbcloud-provider-overview.md) + - [Get TiDB Cloud Terraform Provider](/tidb-cloud/terraform-get-tidbcloud-provider.md) + - [Use Cluster Resource](/tidb-cloud/terraform-use-cluster-resource.md) + - [Use Backup Resource](/tidb-cloud/terraform-use-backup-resource.md) + - [Use Restore Resource](/tidb-cloud/terraform-use-restore-resource.md) + - [Use Import Resource](/tidb-cloud/terraform-use-import-resource.md) + - [Vercel](/tidb-cloud/integrate-tidbcloud-with-vercel.md) + - [Zapier](/tidb-cloud/integrate-tidbcloud-with-zapier.md) +- Reference + - TiDB Cluster Architecture + - [Overview](/tidb-architecture.md) + - [Storage](/tidb-storage.md) + - [Computing](/tidb-computing.md) + - [Scheduling](/tidb-scheduling.md) + - [TSO](/tso.md) + - [TiDB Cloud Dedicated Limitations and Quotas](/tidb-cloud/limitations-and-quotas.md) + - [TiDB Cloud Serverless Limitations](/tidb-cloud/serverless-limitations.md) + - [Limited SQL Features on TiDB Cloud](/tidb-cloud/limited-sql-features.md) + - [TiDB Limitations](/tidb-limitations.md) + - TiDB Distributed eXecution Framework (DXF) + - [Introduction](/tidb-distributed-execution-framework.md) + - [TiDB Global Sort](/tidb-global-sort.md) + - Benchmarks + - TiDB v8.1 + - [TPC-C Performance Test Report](/tidb-cloud/v8.1-performance-benchmarking-with-tpcc.md) + - [Sysbench Performance Test Report](/tidb-cloud/v8.1-performance-benchmarking-with-sysbench.md) + - TiDB v7.5 + - [TPC-C Performance Test Report](/tidb-cloud/v7.5-performance-benchmarking-with-tpcc.md) + - [Sysbench Performance Test Report](/tidb-cloud/v7.5-performance-benchmarking-with-sysbench.md) + - TiDB v7.1 + - [TPC-C Performance Test Report](/tidb-cloud/v7.1-performance-benchmarking-with-tpcc.md) + - [Sysbench Performance Test Report](/tidb-cloud/v7.1-performance-benchmarking-with-sysbench.md) + - TiDB v6.5 + - [TPC-C Performance Test Report](/tidb-cloud/v6.5-performance-benchmarking-with-tpcc.md) + - [Sysbench Performance Test Report](/tidb-cloud/v6.5-performance-benchmarking-with-sysbench.md) + - SQL + - [Explore SQL with TiDB](/basic-sql-operations.md) + - SQL Language Structure and Syntax + - Attributes + - [AUTO_INCREMENT](/auto-increment.md) + - [AUTO_RANDOM](/auto-random.md) + - [SHARD_ROW_ID_BITS](/shard-row-id-bits.md) + - [Literal Values](/literal-values.md) + - [Schema Object Names](/schema-object-names.md) + - [Keywords and Reserved Words](/keywords.md) + - [User-Defined Variables](/user-defined-variables.md) + - [Expression Syntax](/expression-syntax.md) + - [Comment Syntax](/comment-syntax.md) + - SQL Statements + - [Overview](/sql-statements/sql-statement-overview.md) + - [`ADMIN`](/sql-statements/sql-statement-admin.md) + - [`ADMIN CANCEL DDL`](/sql-statements/sql-statement-admin-cancel-ddl.md) + - [`ADMIN CHECKSUM TABLE`](/sql-statements/sql-statement-admin-checksum-table.md) + - [`ADMIN CHECK [TABLE|INDEX]`](/sql-statements/sql-statement-admin-check-table-index.md) + - [`ADMIN CLEANUP INDEX`](/sql-statements/sql-statement-admin-cleanup.md) + - [`ADMIN PAUSE DDL`](/sql-statements/sql-statement-admin-pause-ddl.md) + - [`ADMIN RECOVER INDEX`](/sql-statements/sql-statement-admin-recover.md) + - [`ADMIN RESUME DDL`](/sql-statements/sql-statement-admin-resume-ddl.md) + - [`ADMIN SHOW DDL [JOBS|JOB QUERIES]`](/sql-statements/sql-statement-admin-show-ddl.md) + - [`ALTER DATABASE`](/sql-statements/sql-statement-alter-database.md) + - [`ALTER INSTANCE`](/sql-statements/sql-statement-alter-instance.md) + - [`ALTER PLACEMENT POLICY`](/sql-statements/sql-statement-alter-placement-policy.md) + - [`ALTER RANGE`](/sql-statements/sql-statement-alter-range.md) + - [`ALTER RESOURCE GROUP`](/sql-statements/sql-statement-alter-resource-group.md) + - [`ALTER SEQUENCE`](/sql-statements/sql-statement-alter-sequence.md) + - `ALTER TABLE` + - [Overview](/sql-statements/sql-statement-alter-table.md) + - [`ADD COLUMN`](/sql-statements/sql-statement-add-column.md) + - [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) + - [`ALTER INDEX`](/sql-statements/sql-statement-alter-index.md) + - [`CHANGE COLUMN`](/sql-statements/sql-statement-change-column.md) + - [`COMPACT`](/sql-statements/sql-statement-alter-table-compact.md) + - [`DROP COLUMN`](/sql-statements/sql-statement-drop-column.md) + - [`DROP INDEX`](/sql-statements/sql-statement-drop-index.md) + - [`MODIFY COLUMN`](/sql-statements/sql-statement-modify-column.md) + - [`RENAME INDEX`](/sql-statements/sql-statement-rename-index.md) + - [`ALTER USER`](/sql-statements/sql-statement-alter-user.md) + - [`ANALYZE TABLE`](/sql-statements/sql-statement-analyze-table.md) + - [`BACKUP`](/sql-statements/sql-statement-backup.md) + - [`BATCH`](/sql-statements/sql-statement-batch.md) + - [`BEGIN`](/sql-statements/sql-statement-begin.md) + - [`CANCEL IMPORT JOB`](/sql-statements/sql-statement-cancel-import-job.md) + - [`COMMIT`](/sql-statements/sql-statement-commit.md) + - [`CREATE [GLOBAL|SESSION] BINDING`](/sql-statements/sql-statement-create-binding.md) + - [`CREATE DATABASE`](/sql-statements/sql-statement-create-database.md) + - [`CREATE INDEX`](/sql-statements/sql-statement-create-index.md) + - [`CREATE PLACEMENT POLICY`](/sql-statements/sql-statement-create-placement-policy.md) + - [`CREATE RESOURCE GROUP`](/sql-statements/sql-statement-create-resource-group.md) + - [`CREATE ROLE`](/sql-statements/sql-statement-create-role.md) + - [`CREATE SEQUENCE`](/sql-statements/sql-statement-create-sequence.md) + - [`CREATE TABLE LIKE`](/sql-statements/sql-statement-create-table-like.md) + - [`CREATE TABLE`](/sql-statements/sql-statement-create-table.md) + - [`CREATE USER`](/sql-statements/sql-statement-create-user.md) + - [`CREATE VIEW`](/sql-statements/sql-statement-create-view.md) + - [`DEALLOCATE`](/sql-statements/sql-statement-deallocate.md) + - [`DELETE`](/sql-statements/sql-statement-delete.md) + - [`DESC`](/sql-statements/sql-statement-desc.md) + - [`DESCRIBE`](/sql-statements/sql-statement-describe.md) + - [`DO`](/sql-statements/sql-statement-do.md) + - [`DROP [GLOBAL|SESSION] BINDING`](/sql-statements/sql-statement-drop-binding.md) + - [`DROP DATABASE`](/sql-statements/sql-statement-drop-database.md) + - [`DROP INDEX`](/sql-statements/sql-statement-drop-index.md) + - [`DROP PLACEMENT POLICY`](/sql-statements/sql-statement-drop-placement-policy.md) + - [`DROP RESOURCE GROUP`](/sql-statements/sql-statement-drop-resource-group.md) + - [`DROP ROLE`](/sql-statements/sql-statement-drop-role.md) + - [`DROP SEQUENCE`](/sql-statements/sql-statement-drop-sequence.md) + - [`DROP STATS`](/sql-statements/sql-statement-drop-stats.md) + - [`DROP TABLE`](/sql-statements/sql-statement-drop-table.md) + - [`DROP USER`](/sql-statements/sql-statement-drop-user.md) + - [`DROP VIEW`](/sql-statements/sql-statement-drop-view.md) + - [`EXECUTE`](/sql-statements/sql-statement-execute.md) + - [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) + - [`EXPLAIN`](/sql-statements/sql-statement-explain.md) + - [`FLASHBACK CLUSTER`](/sql-statements/sql-statement-flashback-cluster.md) + - [`FLASHBACK DATABASE`](/sql-statements/sql-statement-flashback-database.md) + - [`FLASHBACK TABLE`](/sql-statements/sql-statement-flashback-table.md) + - [`FLUSH PRIVILEGES`](/sql-statements/sql-statement-flush-privileges.md) + - [`FLUSH STATUS`](/sql-statements/sql-statement-flush-status.md) + - [`FLUSH TABLES`](/sql-statements/sql-statement-flush-tables.md) + - [`GRANT `](/sql-statements/sql-statement-grant-privileges.md) + - [`GRANT `](/sql-statements/sql-statement-grant-role.md) + - [`IMPORT INTO`](/sql-statements/sql-statement-import-into.md) + - [`INSERT`](/sql-statements/sql-statement-insert.md) + - [`KILL [TIDB]`](/sql-statements/sql-statement-kill.md) + - [`LOAD DATA`](/sql-statements/sql-statement-load-data.md) + - [`LOAD STATS`](/sql-statements/sql-statement-load-stats.md) + - [`LOCK STATS`](/sql-statements/sql-statement-lock-stats.md) + - [`LOCK TABLES` and `UNLOCK TABLES`](/sql-statements/sql-statement-lock-tables-and-unlock-tables.md) + - [`PREPARE`](/sql-statements/sql-statement-prepare.md) + - [`QUERY WATCH`](/sql-statements/sql-statement-query-watch.md) + - [`RECOVER TABLE`](/sql-statements/sql-statement-recover-table.md) + - [`RENAME TABLE`](/sql-statements/sql-statement-rename-table.md) + - [`RENAME USER`](/sql-statements/sql-statement-rename-user.md) + - [`REPLACE`](/sql-statements/sql-statement-replace.md) + - [`RESTORE`](/sql-statements/sql-statement-restore.md) + - [`REVOKE `](/sql-statements/sql-statement-revoke-privileges.md) + - [`REVOKE `](/sql-statements/sql-statement-revoke-role.md) + - [`ROLLBACK`](/sql-statements/sql-statement-rollback.md) + - [`SAVEPOINT`](/sql-statements/sql-statement-savepoint.md) + - [`SELECT`](/sql-statements/sql-statement-select.md) + - [`SET DEFAULT ROLE`](/sql-statements/sql-statement-set-default-role.md) + - [`SET [NAMES|CHARACTER SET]`](/sql-statements/sql-statement-set-names.md) + - [`SET PASSWORD`](/sql-statements/sql-statement-set-password.md) + - [`SET RESOURCE GROUP`](/sql-statements/sql-statement-set-resource-group.md) + - [`SET ROLE`](/sql-statements/sql-statement-set-role.md) + - [`SET TRANSACTION`](/sql-statements/sql-statement-set-transaction.md) + - [`SET [GLOBAL|SESSION] `](/sql-statements/sql-statement-set-variable.md) + - [`SHOW ANALYZE STATUS`](/sql-statements/sql-statement-show-analyze-status.md) + - [`SHOW [BACKUPS|RESTORES]`](/sql-statements/sql-statement-show-backups.md) + - [`SHOW [GLOBAL|SESSION] BINDINGS`](/sql-statements/sql-statement-show-bindings.md) + - [`SHOW BUILTINS`](/sql-statements/sql-statement-show-builtins.md) + - [`SHOW CHARACTER SET`](/sql-statements/sql-statement-show-character-set.md) + - [`SHOW COLLATION`](/sql-statements/sql-statement-show-collation.md) + - [`SHOW COLUMN_STATS_USAGE`](/sql-statements/sql-statement-show-column-stats-usage.md) + - [`SHOW COLUMNS FROM`](/sql-statements/sql-statement-show-columns-from.md) + - [`SHOW CREATE DATABASE`](/sql-statements/sql-statement-show-create-database.md) + - [`SHOW CREATE PLACEMENT POLICY`](/sql-statements/sql-statement-show-create-placement-policy.md) + - [`SHOW CREATE RESOURCE GROUP`](/sql-statements/sql-statement-show-create-resource-group.md) + - [`SHOW CREATE SEQUENCE`](/sql-statements/sql-statement-show-create-sequence.md) + - [`SHOW CREATE TABLE`](/sql-statements/sql-statement-show-create-table.md) + - [`SHOW CREATE USER`](/sql-statements/sql-statement-show-create-user.md) + - [`SHOW DATABASES`](/sql-statements/sql-statement-show-databases.md) + - [`SHOW ENGINES`](/sql-statements/sql-statement-show-engines.md) + - [`SHOW ERRORS`](/sql-statements/sql-statement-show-errors.md) + - [`SHOW FIELDS FROM`](/sql-statements/sql-statement-show-fields-from.md) + - [`SHOW GRANTS`](/sql-statements/sql-statement-show-grants.md) + - [`SHOW IMPORT JOB`](/sql-statements/sql-statement-show-import-job.md) + - [`SHOW INDEXES [FROM|IN]`](/sql-statements/sql-statement-show-indexes.md) + - [`SHOW MASTER STATUS`](/sql-statements/sql-statement-show-master-status.md) + - [`SHOW PLACEMENT`](/sql-statements/sql-statement-show-placement.md) + - [`SHOW PLACEMENT FOR`](/sql-statements/sql-statement-show-placement-for.md) + - [`SHOW PLACEMENT LABELS`](/sql-statements/sql-statement-show-placement-labels.md) + - [`SHOW PLUGINS`](/sql-statements/sql-statement-show-plugins.md) + - [`SHOW PRIVILEGES`](/sql-statements/sql-statement-show-privileges.md) + - [`SHOW PROCESSLIST`](/sql-statements/sql-statement-show-processlist.md) + - [`SHOW PROFILES`](/sql-statements/sql-statement-show-profiles.md) + - [`SHOW SCHEMAS`](/sql-statements/sql-statement-show-schemas.md) + - [`SHOW STATS_BUCKETS`](/sql-statements/sql-statement-show-stats-buckets.md) + - [`SHOW STATS_HEALTHY`](/sql-statements/sql-statement-show-stats-healthy.md) + - [`SHOW STATS_HISTOGRAMS`](/sql-statements/sql-statement-show-stats-histograms.md) + - [`SHOW STATS_LOCKED`](/sql-statements/sql-statement-show-stats-locked.md) + - [`SHOW STATS_META`](/sql-statements/sql-statement-show-stats-meta.md) + - [`SHOW STATS_TOPN`](/sql-statements/sql-statement-show-stats-topn.md) + - [`SHOW STATUS`](/sql-statements/sql-statement-show-status.md) + - [`SHOW TABLE NEXT_ROW_ID`](/sql-statements/sql-statement-show-table-next-rowid.md) + - [`SHOW TABLE REGIONS`](/sql-statements/sql-statement-show-table-regions.md) + - [`SHOW TABLE STATUS`](/sql-statements/sql-statement-show-table-status.md) + - [`SHOW TABLES`](/sql-statements/sql-statement-show-tables.md) + - [`SHOW [GLOBAL|SESSION] VARIABLES`](/sql-statements/sql-statement-show-variables.md) + - [`SHOW WARNINGS`](/sql-statements/sql-statement-show-warnings.md) + - [`SPLIT REGION`](/sql-statements/sql-statement-split-region.md) + - [`START TRANSACTION`](/sql-statements/sql-statement-start-transaction.md) + - [`TABLE`](/sql-statements/sql-statement-table.md) + - [`TRACE`](/sql-statements/sql-statement-trace.md) + - [`TRUNCATE`](/sql-statements/sql-statement-truncate.md) + - [`UNLOCK STATS`](/sql-statements/sql-statement-unlock-stats.md) + - [`UPDATE`](/sql-statements/sql-statement-update.md) + - [`USE`](/sql-statements/sql-statement-use.md) + - [`WITH`](/sql-statements/sql-statement-with.md) + - Data Types + - [Overview](/data-type-overview.md) + - [Default Values](/data-type-default-values.md) + - [Numeric Types](/data-type-numeric.md) + - [Date and Time Types](/data-type-date-and-time.md) + - [String Types](/data-type-string.md) + - [JSON Type](/data-type-json.md) + - Functions and Operators + - [Overview](/functions-and-operators/functions-and-operators-overview.md) + - [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) + - [Operators](/functions-and-operators/operators.md) + - [Control Flow Functions](/functions-and-operators/control-flow-functions.md) + - [String Functions](/functions-and-operators/string-functions.md) + - [Numeric Functions and Operators](/functions-and-operators/numeric-functions-and-operators.md) + - [Date and Time Functions](/functions-and-operators/date-and-time-functions.md) + - [Bit Functions and Operators](/functions-and-operators/bit-functions-and-operators.md) + - [Cast Functions and Operators](/functions-and-operators/cast-functions-and-operators.md) + - [Encryption and Compression Functions](/functions-and-operators/encryption-and-compression-functions.md) + - [Locking Functions](/functions-and-operators/locking-functions.md) + - [Information Functions](/functions-and-operators/information-functions.md) + - JSON Functions + - [Overview](/functions-and-operators/json-functions.md) + - [Functions That Create JSON](/functions-and-operators/json-functions/json-functions-create.md) + - [Functions That Search JSON](/functions-and-operators/json-functions/json-functions-search.md) + - [Functions That Modify JSON](/functions-and-operators/json-functions/json-functions-modify.md) + - [Functions That Return JSON](/functions-and-operators/json-functions/json-functions-return.md) + - [JSON Utility Functions](/functions-and-operators/json-functions/json-functions-utility.md) + - [Functions That Aggregate JSON](/functions-and-operators/json-functions/json-functions-aggregate.md) + - [Functions That Validate JSON](/functions-and-operators/json-functions/json-functions-validate.md) + - [Aggregate (GROUP BY) Functions](/functions-and-operators/aggregate-group-by-functions.md) + - [GROUP BY Modifiers](/functions-and-operators/group-by-modifier.md) + - [Window Functions](/functions-and-operators/window-functions.md) + - [Miscellaneous Functions](/functions-and-operators/miscellaneous-functions.md) + - [Precision Math](/functions-and-operators/precision-math.md) + - [Set Operations](/functions-and-operators/set-operators.md) + - [Sequence Functions](/functions-and-operators/sequence-functions.md) + - [List of Expressions for Pushdown](/functions-and-operators/expressions-pushed-down.md) + - [TiDB Specific Functions](/functions-and-operators/tidb-functions.md) + - [Clustered Indexes](/clustered-indexes.md) + - [Constraints](/constraints.md) + - [Generated Columns](/generated-columns.md) + - [SQL Mode](/sql-mode.md) + - [Table Attributes](/table-attributes.md) + - Transactions + - [Overview](/transaction-overview.md) + - [Isolation Levels](/transaction-isolation-levels.md) + - [Optimistic Transactions](/optimistic-transaction.md) + - [Pessimistic Transactions](/pessimistic-transaction.md) + - [Non-Transactional DML Statements](/non-transactional-dml.md) + - [Views](/views.md) + - [Partitioning](/partitioned-table.md) + - [Temporary Tables](/temporary-tables.md) + - [Cached Tables](/cached-tables.md) + - [FOREIGN KEY Constraints](/foreign-key.md) + - Character Set and Collation + - [Overview](/character-set-and-collation.md) + - [GBK](/character-set-gbk.md) + - Read Historical Data + - Use Stale Read (Recommended) + - [Usage Scenarios of Stale Read](/stale-read.md) + - [Perform Stale Read Using `As OF TIMESTAMP`](/as-of-timestamp.md) + - [Perform Stale Read Using `tidb_read_staleness`](/tidb-read-staleness.md) + - [Perform Stale Read Using `tidb_external_ts`](/tidb-external-ts.md) + - [Use the `tidb_snapshot` System Variable](/read-historical-data.md) + - [Placement Rules in SQL](/placement-rules-in-sql.md) + - System Tables + - `mysql` Schema + - [Overview](/mysql-schema/mysql-schema.md) + - [`user`](/mysql-schema/mysql-schema-user.md) + - INFORMATION_SCHEMA + - [Overview](/information-schema/information-schema.md) + - [`ANALYZE_STATUS`](/information-schema/information-schema-analyze-status.md) + - [`CHECK_CONSTRAINTS`](/information-schema/information-schema-check-constraints.md) + - [`CLIENT_ERRORS_SUMMARY_BY_HOST`](/information-schema/client-errors-summary-by-host.md) + - [`CLIENT_ERRORS_SUMMARY_BY_USER`](/information-schema/client-errors-summary-by-user.md) + - [`CLIENT_ERRORS_SUMMARY_GLOBAL`](/information-schema/client-errors-summary-global.md) + - [`CHARACTER_SETS`](/information-schema/information-schema-character-sets.md) + - [`CLUSTER_INFO`](/information-schema/information-schema-cluster-info.md) + - [`COLLATIONS`](/information-schema/information-schema-collations.md) + - [`COLLATION_CHARACTER_SET_APPLICABILITY`](/information-schema/information-schema-collation-character-set-applicability.md) + - [`COLUMNS`](/information-schema/information-schema-columns.md) + - [`DATA_LOCK_WAITS`](/information-schema/information-schema-data-lock-waits.md) + - [`DDL_JOBS`](/information-schema/information-schema-ddl-jobs.md) + - [`DEADLOCKS`](/information-schema/information-schema-deadlocks.md) + - [`ENGINES`](/information-schema/information-schema-engines.md) + - [`KEYWORDS`](/information-schema/information-schema-keywords.md) + - [`KEY_COLUMN_USAGE`](/information-schema/information-schema-key-column-usage.md) + - [`MEMORY_USAGE`](/information-schema/information-schema-memory-usage.md) + - [`MEMORY_USAGE_OPS_HISTORY`](/information-schema/information-schema-memory-usage-ops-history.md) + - [`PARTITIONS`](/information-schema/information-schema-partitions.md) + - [`PLACEMENT_POLICIES`](/information-schema/information-schema-placement-policies.md) + - [`PROCESSLIST`](/information-schema/information-schema-processlist.md) + - [`REFERENTIAL_CONSTRAINTS`](/information-schema/information-schema-referential-constraints.md) + - [`RESOURCE_GROUPS`](/information-schema/information-schema-resource-groups.md) + - [`RUNAWAY_WATCHES`](/information-schema/information-schema-runaway-watches.md) + - [`SCHEMATA`](/information-schema/information-schema-schemata.md) + - [`SEQUENCES`](/information-schema/information-schema-sequences.md) + - [`SESSION_VARIABLES`](/information-schema/information-schema-session-variables.md) + - [`SLOW_QUERY`](/information-schema/information-schema-slow-query.md) + - [`STATISTICS`](/information-schema/information-schema-statistics.md) + - [`TABLES`](/information-schema/information-schema-tables.md) + - [`TABLE_CONSTRAINTS`](/information-schema/information-schema-table-constraints.md) + - [`TABLE_STORAGE_STATS`](/information-schema/information-schema-table-storage-stats.md) + - [`TIDB_HOT_REGIONS_HISTORY`](/information-schema/information-schema-tidb-hot-regions-history.md) + - [`TIDB_INDEXES`](/information-schema/information-schema-tidb-indexes.md) + - [`TIDB_INDEX_USAGE`](/information-schema/information-schema-tidb-index-usage.md) + - [`TIDB_SERVERS_INFO`](/information-schema/information-schema-tidb-servers-info.md) + - [`TIDB_TRX`](/information-schema/information-schema-tidb-trx.md) + - [`TIFLASH_REPLICA`](/information-schema/information-schema-tiflash-replica.md) + - [`TIFLASH_SEGMENTS`](/information-schema/information-schema-tiflash-segments.md) + - [`TIFLASH_TABLES`](/information-schema/information-schema-tiflash-tables.md) + - [`TIKV_REGION_PEERS`](/information-schema/information-schema-tikv-region-peers.md) + - [`TIKV_REGION_STATUS`](/information-schema/information-schema-tikv-region-status.md) + - [`TIKV_STORE_STATUS`](/information-schema/information-schema-tikv-store-status.md) + - [`USER_ATTRIBUTES`](/information-schema/information-schema-user-attributes.md) + - [`USER_PRIVILEGES`](/information-schema/information-schema-user-privileges.md) + - [`VARIABLES_INFO`](/information-schema/information-schema-variables-info.md) + - [`VIEWS`](/information-schema/information-schema-views.md) + - PERFORMANCE_SCHEMA + - [Overview](/performance-schema/performance-schema.md) + - [`SESSION_CONNECT_ATTRS`](/performance-schema/performance-schema-session-connect-attrs.md) + - SYS + - [Overview](/sys-schema/sys-schema.md) + - [`schema_unused_indexes`](/sys-schema/sys-schema-unused-indexes.md) + - [Metadata Lock](/metadata-lock.md) + - [Use UUIDs](/best-practices/uuid.md) + - [TiDB Accelerated Table Creation](/accelerated-table-creation.md) + - [System Variables](/system-variables.md) + - [Server Status Variables](/status-variables.md) + - Storage Engines + - TiKV + - [TiKV Overview](/tikv-overview.md) + - [RocksDB Overview](/storage-engine/rocksdb-overview.md) + - TiFlash + - [TiFlash Overview](/tiflash/tiflash-overview.md) + - [Spill to Disk](/tiflash/tiflash-spill-disk.md) + - CLI + - [Overview](/tidb-cloud/cli-reference.md) + - auth + - [login](/tidb-cloud/ticloud-auth-login.md) + - [logout](/tidb-cloud/ticloud-auth-logout.md) + - serverless + - [create](/tidb-cloud/ticloud-cluster-create.md) + - [delete](/tidb-cloud/ticloud-cluster-delete.md) + - [describe](/tidb-cloud/ticloud-cluster-describe.md) + - [list](/tidb-cloud/ticloud-cluster-list.md) + - [update](/tidb-cloud/ticloud-serverless-update.md) + - [spending-limit](/tidb-cloud/ticloud-serverless-spending-limit.md) + - [region](/tidb-cloud/ticloud-serverless-region.md) + - [shell](/tidb-cloud/ticloud-serverless-shell.md) + - branch + - [create](/tidb-cloud/ticloud-branch-create.md) + - [delete](/tidb-cloud/ticloud-branch-delete.md) + - [describe](/tidb-cloud/ticloud-branch-describe.md) + - [list](/tidb-cloud/ticloud-branch-list.md) + - [shell](/tidb-cloud/ticloud-branch-shell.md) + - import + - [cancel](/tidb-cloud/ticloud-import-cancel.md) + - [describe](/tidb-cloud/ticloud-import-describe.md) + - [list](/tidb-cloud/ticloud-import-list.md) + - [start](/tidb-cloud/ticloud-import-start.md) + - export + - [create](/tidb-cloud/ticloud-serverless-export-create.md) + - [describe](/tidb-cloud/ticloud-serverless-export-describe.md) + - [list](/tidb-cloud/ticloud-serverless-export-list.md) + - [cancel](/tidb-cloud/ticloud-serverless-export-cancel.md) + - [download](/tidb-cloud/ticloud-serverless-export-download.md) + - [ai](/tidb-cloud/ticloud-ai.md) + - [completion](/tidb-cloud/ticloud-completion.md) + - config + - [create](/tidb-cloud/ticloud-config-create.md) + - [delete](/tidb-cloud/ticloud-config-delete.md) + - [describe](/tidb-cloud/ticloud-config-describe.md) + - [edit](/tidb-cloud/ticloud-config-edit.md) + - [list](/tidb-cloud/ticloud-config-list.md) + - [set](/tidb-cloud/ticloud-config-set.md) + - [use](/tidb-cloud/ticloud-config-use.md) + - project + - [list](/tidb-cloud/ticloud-project-list.md) + - [update](/tidb-cloud/ticloud-update.md) + - [help](/tidb-cloud/ticloud-help.md) + - [Table Filter](/table-filter.md) + - [Resource Control](/tidb-resource-control.md) + - [URI Formats of External Storage Services](/external-storage-uri.md) + - [DDL Execution Principles and Best Practices](/ddl-introduction.md) + - [Troubleshoot Inconsistency Between Data and Indexes](/troubleshoot-data-inconsistency-errors.md) + - [Support](/tidb-cloud/tidb-cloud-support.md) + - [Glossary](/tidb-cloud/tidb-cloud-glossary.md) +- FAQs + - [TiDB Cloud FAQs](/tidb-cloud/tidb-cloud-faq.md) + - [TiDB Cloud Serverless FAQs](/tidb-cloud/serverless-faqs.md) +- Release Notes + - [2024](/tidb-cloud/tidb-cloud-release-notes.md) + - [2023](/tidb-cloud/release-notes-2023.md) + - [2022](/tidb-cloud/release-notes-2022.md) + - [2021](/tidb-cloud/release-notes-2021.md) + - [2020](/tidb-cloud/release-notes-2020.md) +- Maintenance Notification + - [[2024-09-15] TiDB Cloud Console Maintenance Notification](/tidb-cloud/notification-2024-09-15-console-maintenance.md) + - [[2024-04-18] TiDB Cloud Data Migration (DM) Feature Maintenance Notification](/tidb-cloud/notification-2024-04-18-dm-feature-maintenance.md) + - [[2024-04-16] TiDB Cloud Monitoring Features Maintenance Notification](/tidb-cloud/notification-2024-04-16-monitoring-features-maintenance.md) + - [[2024-04-11] TiDB Cloud Data Migration (DM) Feature Maintenance Notification](/tidb-cloud/notification-2024-04-11-dm-feature-maintenance.md) + - [[2024-04-09] TiDB Cloud Monitoring Features Maintenance Notification](/tidb-cloud/notification-2024-04-09-monitoring-features-maintenance.md) + - [[2023-11-14] TiDB Cloud Dedicated Scale Feature Maintenance Notification](/tidb-cloud/notification-2023-11-14-scale-feature-maintenance.md) + - [[2023-09-26] TiDB Cloud Console Maintenance Notification](/tidb-cloud/notification-2023-09-26-console-maintenance.md) + - [[2023-08-31] TiDB Cloud Console Maintenance Notification](/tidb-cloud/notification-2023-08-31-console-maintenance.md) diff --git a/markdown-pages/en/tidb/master/TOC.md b/markdown-pages/en/tidb/master/TOC.md index bd6eccd..1e42f03 100644 --- a/markdown-pages/en/tidb/master/TOC.md +++ b/markdown-pages/en/tidb/master/TOC.md @@ -81,6 +81,24 @@ - [Follower Read](/develop/dev-guide-use-follower-read.md) - [Stale Read](/develop/dev-guide-use-stale-read.md) - [HTAP Queries](/develop/dev-guide-hybrid-oltp-and-olap-queries.md) + - Vector Search + - [Overview](/vector-search-overview.md) + - Get Started + - [Get Started with SQL](/vector-search-get-started-using-sql.md) + - [Get Started with Python](/vector-search-get-started-using-python.md) + - Integrations + - [Overview](/vector-search-integration-overview.md) + - AI Frameworks + - [LlamaIndex](/vector-search-integrate-with-llamaindex.md) + - [Langchain](/vector-search-integrate-with-langchain.md) + - Embedding Models/Services + - [Jina AI](/vector-search-integrate-with-jinaai-embedding.md) + - ORM Libraries + - [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md) + - [peewee](/vector-search-integrate-with-peewee.md) + - [Django ORM](/vector-search-integrate-with-django-orm.md) + - [Improve Performance](/vector-search-improve-performance.md) + - [Limitations](/vector-search-limitations.md) - Transaction - [Overview](/develop/dev-guide-transaction-overview.md) - [Optimistic and Pessimistic Transactions](/develop/dev-guide-optimistic-and-pessimistic-transaction.md) @@ -119,6 +137,7 @@ - [PD Microservices Topology](/pd-microservices-deployment-topology.md) - [TiProxy Topology](/tiproxy/tiproxy-deployment-topology.md) - [TiCDC Topology](/ticdc-deployment-topology.md) + - [TiDB Binlog Topology](/tidb-binlog-deployment-topology.md) - [TiSpark Topology](/tispark-deployment-topology.md) - [Cross-DC Topology](/geo-distributed-deployment-topology.md) - [Hybrid Topology](/hybrid-deployment-topology.md) @@ -157,14 +176,6 @@ - [Integrate with Confluent and Snowflake](/ticdc/integrate-confluent-using-ticdc.md) - [Integrate with Apache Kafka and Apache Flink](/replicate-data-to-kafka.md) - Maintain - - Security - - [Best Practices for TiDB Security Configuration](/best-practices-for-security-configuration.md) - - [Enable TLS Between TiDB Clients and Servers](/enable-tls-between-clients-and-servers.md) - - [Enable TLS Between TiDB Components](/enable-tls-between-components.md) - - [Generate Self-signed Certificates](/generate-self-signed-certificates.md) - - [Encryption at Rest](/encryption-at-rest.md) - - [Enable Encryption for Disk Spill](/enable-disk-spill-encrypt.md) - - [Log Redaction](/log-redaction.md) - Upgrade - [Use TiUP](/upgrade-tidb-using-tiup.md) - [Use TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/upgrade-a-tidb-cluster) @@ -611,6 +622,26 @@ - [Troubleshoot](/ticdc/troubleshoot-ticdc.md) - [FAQs](/ticdc/ticdc-faq.md) - [Glossary](/ticdc/ticdc-glossary.md) + - TiDB Binlog (Deprecated) + - [Overview](/tidb-binlog/tidb-binlog-overview.md) + - [Quick Start](/tidb-binlog/get-started-with-tidb-binlog.md) + - [Deploy](/tidb-binlog/deploy-tidb-binlog.md) + - [Maintain](/tidb-binlog/maintain-tidb-binlog-cluster.md) + - [Configure](/tidb-binlog/tidb-binlog-configuration-file.md) + - [Pump](/tidb-binlog/tidb-binlog-configuration-file.md#pump) + - [Drainer](/tidb-binlog/tidb-binlog-configuration-file.md#drainer) + - [Upgrade](/tidb-binlog/upgrade-tidb-binlog.md) + - [Monitor](/tidb-binlog/monitor-tidb-binlog-cluster.md) + - [Reparo](/tidb-binlog/tidb-binlog-reparo.md) + - [binlogctl](/tidb-binlog/binlog-control.md) + - [Binlog Consumer Client](/tidb-binlog/binlog-consumer-client.md) + - [TiDB Binlog Relay Log](/tidb-binlog/tidb-binlog-relay-log.md) + - [Bidirectional Replication Between TiDB Clusters](/tidb-binlog/bidirectional-replication-between-tidb-clusters.md) + - [Glossary](/tidb-binlog/tidb-binlog-glossary.md) + - Troubleshoot + - [Troubleshoot](/tidb-binlog/troubleshoot-tidb-binlog.md) + - [Handle Errors](/tidb-binlog/handle-tidb-binlog-errors.md) + - [FAQ](/tidb-binlog/tidb-binlog-faq.md) - PingCAP Clinic Diagnostic Service - [Overview](/clinic/clinic-introduction.md) - [Quick Start](/clinic/quick-start-with-clinic.md) @@ -689,6 +720,13 @@ - [TiFlash](/tiflash/monitor-tiflash.md) - [TiCDC](/ticdc/monitor-ticdc.md) - [Resource Control](/grafana-resource-control-dashboard.md) + - Security + - [Enable TLS Between TiDB Clients and Servers](/enable-tls-between-clients-and-servers.md) + - [Enable TLS Between TiDB Components](/enable-tls-between-components.md) + - [Generate Self-signed Certificates](/generate-self-signed-certificates.md) + - [Encryption at Rest](/encryption-at-rest.md) + - [Enable Encryption for Disk Spill](/enable-disk-spill-encrypt.md) + - [Log Redaction](/log-redaction.md) - Privileges - [Security Compatibility with MySQL](/security-compatibility-with-mysql.md) - [Privilege Management](/privilege-management.md) @@ -746,6 +784,8 @@ - [`CALIBRATE RESOURCE`](/sql-statements/sql-statement-calibrate-resource.md) - [`CANCEL IMPORT JOB`](/sql-statements/sql-statement-cancel-import-job.md) - [`COMMIT`](/sql-statements/sql-statement-commit.md) + - [`CHANGE DRAINER`](/sql-statements/sql-statement-change-drainer.md) + - [`CHANGE PUMP`](/sql-statements/sql-statement-change-pump.md) - [`CREATE BINDING`](/sql-statements/sql-statement-create-binding.md) - [`CREATE DATABASE`](/sql-statements/sql-statement-create-database.md) - [`CREATE INDEX`](/sql-statements/sql-statement-create-index.md) @@ -825,6 +865,7 @@ - [`SHOW CREATE TABLE`](/sql-statements/sql-statement-show-create-table.md) - [`SHOW CREATE USER`](/sql-statements/sql-statement-show-create-user.md) - [`SHOW DATABASES`](/sql-statements/sql-statement-show-databases.md) + - [`SHOW DRAINER STATUS`](/sql-statements/sql-statement-show-drainer-status.md) - [`SHOW ENGINES`](/sql-statements/sql-statement-show-engines.md) - [`SHOW ERRORS`](/sql-statements/sql-statement-show-errors.md) - [`SHOW FIELDS FROM`](/sql-statements/sql-statement-show-fields-from.md) @@ -839,6 +880,7 @@ - [`SHOW PRIVILEGES`](/sql-statements/sql-statement-show-privileges.md) - [`SHOW PROCESSLIST`](/sql-statements/sql-statement-show-processlist.md) - [`SHOW PROFILES`](/sql-statements/sql-statement-show-profiles.md) + - [`SHOW PUMP STATUS`](/sql-statements/sql-statement-show-pump-status.md) - [`SHOW SCHEMAS`](/sql-statements/sql-statement-show-schemas.md) - [`SHOW STATS_BUCKETS`](/sql-statements/sql-statement-show-stats-buckets.md) - [`SHOW STATS_HEALTHY`](/sql-statements/sql-statement-show-stats-healthy.md) @@ -870,6 +912,7 @@ - [Date and Time Types](/data-type-date-and-time.md) - [String Types](/data-type-string.md) - [JSON Type](/data-type-json.md) + - [Vector Types](/vector-search-data-types.md) - Functions and Operators - [Overview](/functions-and-operators/functions-and-operators-overview.md) - [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) @@ -883,6 +926,7 @@ - [Encryption and Compression Functions](/functions-and-operators/encryption-and-compression-functions.md) - [Locking Functions](/functions-and-operators/locking-functions.md) - [Information Functions](/functions-and-operators/information-functions.md) + - [Vector Functions and Operators](/vector-search-functions-and-operators.md) - JSON Functions - [Overview](/functions-and-operators/json-functions.md) - [Functions That Create JSON](/functions-and-operators/json-functions/json-functions-create.md) @@ -903,6 +947,7 @@ - [TiDB Specific Functions](/functions-and-operators/tidb-functions.md) - [Comparisons between Functions and Syntax of Oracle and TiDB](/oracle-functions-to-tidb.md) - [Clustered Indexes](/clustered-indexes.md) + - [Vector Index](/vector-search-index.md) - [Constraints](/constraints.md) - [Generated Columns](/generated-columns.md) - [SQL Mode](/sql-mode.md) @@ -970,7 +1015,6 @@ - [`TABLES`](/information-schema/information-schema-tables.md) - [`TABLE_CONSTRAINTS`](/information-schema/information-schema-table-constraints.md) - [`TABLE_STORAGE_STATS`](/information-schema/information-schema-table-storage-stats.md) - - [`TIDB_CHECK_CONSTRAINTS`](/information-schema/information-schema-tidb-check-constraints.md) - [`TIDB_HOT_REGIONS`](/information-schema/information-schema-tidb-hot-regions.md) - [`TIDB_HOT_REGIONS_HISTORY`](/information-schema/information-schema-tidb-hot-regions-history.md) - [`TIDB_INDEXES`](/information-schema/information-schema-tidb-indexes.md) @@ -1062,7 +1106,6 @@ - v7.6 - [7.6.0-DMR](/releases/release-7.6.0.md) - v7.5 - - [7.5.4](/releases/release-7.5.4.md) - [7.5.3](/releases/release-7.5.3.md) - [7.5.2](/releases/release-7.5.2.md) - [7.5.1](/releases/release-7.5.1.md) @@ -1085,7 +1128,6 @@ - v6.6 - [6.6.0-DMR](/releases/release-6.6.0.md) - v6.5 - - [6.5.11](/releases/release-6.5.11.md) - [6.5.10](/releases/release-6.5.10.md) - [6.5.9](/releases/release-6.5.9.md) - [6.5.8](/releases/release-6.5.8.md) diff --git a/markdown-pages/en/tidb/master/br/backup-and-restore-overview.md b/markdown-pages/en/tidb/master/br/backup-and-restore-overview.md new file mode 100644 index 0000000..516a0f0 --- /dev/null +++ b/markdown-pages/en/tidb/master/br/backup-and-restore-overview.md @@ -0,0 +1,144 @@ +--- +title: TiDB Backup & Restore Overview +summary: TiDB Backup & Restore (BR) ensures high availability of clusters and data safety. It supports disaster recovery with a short RPO, handles misoperations, and provides history data auditing. It is recommended to perform backup operations during off-peak hours and store backup data to compatible storage systems. BR supports full backup and log backup, as well as restoring data to any point in time. It is important to use BR of the same major version as the TiDB cluster for backup and restoration. +aliases: ['/docs/dev/br/backup-and-restore-tool/','/docs/dev/reference/tools/br/br/','/docs/dev/how-to/maintain/backup-and-restore/br/','/tidb/dev/backup-and-restore-tool/','/tidb/dev/point-in-time-recovery/'] +--- + +# TiDB Backup & Restore Overview + +Based on the Raft protocol and a reasonable deployment topology, TiDB realizes high availability of clusters. When a few nodes in the cluster fail, the cluster can still be available. On this basis, to further ensure data safety, TiDB provides the Backup & Restore (BR) feature as the last resort to recover data from natural disasters and misoperations. + +BR satisfies the following requirements: + +- Back up cluster data to a disaster recovery (DR) system with an RPO as short as 5 minutes, reducing data loss in disaster scenarios. +- Handle the cases of misoperations from applications by rolling back data to a time point before the error event. +- Perform history data auditing to meet the requirements of judicial supervision. +- Clone the production environment, which is convenient for troubleshooting, performance tuning, and simulation testing. + +## Before you use + +This section describes the prerequisites for using TiDB backup and restore, including restrictions, usage tips and compatibility issues. + +### Restrictions + +- PITR only supports restoring data to **an empty cluster**. +- PITR only supports cluster-level restore and does not support database-level or table-level restore. +- PITR does not support restoring the data of user tables or privilege tables from system tables. +- BR does not support running multiple backup tasks on a cluster **at the same time**. +- BR does not support running snapshot backup tasks and data restore tasks on a cluster **at the same time**. +- When a PITR is running, you cannot run a log backup task or use TiCDC to replicate data to a downstream cluster. + +### Some tips + +Snapshot backup: + +- It is recommended that you perform the backup operation during off-peak hours to minimize the impact on applications. +- It is recommended that you execute multiple backup or restore tasks one by one. Running multiple backup tasks in parallel leads to low performance. Worse still, a lack of collaboration between multiple tasks might result in task failures and affect cluster performance. + +Snapshot restore: + +- BR uses resources of the target cluster as much as possible. Therefore, it is recommended that you restore data to a new cluster or an offline cluster. Avoid restoring data to a production cluster. Otherwise, your application will be affected inevitably. + +Backup storage and network configuration: + +- It is recommended that you store backup data to a storage system that is compatible with Amazon S3, GCS, or Azure Blob Storage. +- You need to ensure that BR, TiKV, and the backup storage system have enough network bandwidth, and that the backup storage system can provide sufficient read and write performance (IOPS). Otherwise, they might become a performance bottleneck during backup and restore. + +## Use backup and restore + +The way to use BR varies with the deployment method of TiDB. This document introduces how to use the br command-line tool to back up and restore TiDB cluster data in an on-premise deployment. + +For information about how to use this feature in other deployment scenarios, see the following documents: + +- [Back Up and Restore TiDB Deployed on TiDB Cloud](https://docs.pingcap.com/tidbcloud/backup-and-restore): It is recommended that you create TiDB clusters on [TiDB Cloud](https://www.pingcap.com/tidb-cloud/?from=en). TiDB Cloud offers fully managed databases to let you focus on your applications. +- [Back Up and Restore Data Using TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/backup-restore-overview): If you deploy a TiDB cluster using TiDB Operator on Kubernetes, it is recommended to back up and restore data using Kubernetes CustomResourceDefinition (CRD). + +## BR features + +TiDB BR provides the following features: + +- Back up cluster data: You can back up full data (**full backup**) of the cluster at a certain time point, or back up the data changes in TiDB (**log backup**, in which log means KV changes in TiKV). + +- Restore backup data: + + - You can **restore a full backup** or **specific databases or tables** in a full backup. + - Based on backup data (full backup and log backup), you can restore the target cluster to any time point of the backup cluster. This type of restore is called point-in-time recovery, or PITR for short. + +### Back up cluster data + +Full backup backs up all data of a cluster at a specific time point. TiDB supports the following way of full backup: + +- Back up cluster snapshots: A snapshot of a TiDB cluster contains transactionally consistent data at a specific time. For details, see [Snapshot backup](/br/br-snapshot-guide.md#back-up-cluster-snapshots). + +Full backup occupies much storage space and contains only cluster data at a specific time point. If you want to choose the restore point as required, that is, to perform point-in-time recovery (PITR), you can use the following two ways of backup at the same time: + +- Start [log backup](/br/br-pitr-guide.md#start-log-backup). After log backup is started, the task keeps running on all TiKV nodes and backs up TiDB incremental data in small batches to the specified storage periodically. +- Perform snapshot backup regularly. Back up the full cluster data to the backup storage, for example, perform cluster snapshot backup at 0:00 AM every day. + +#### Backup performance and impact on TiDB clusters + +- When CPU and I/O resources are sufficient in the cluster, the snapshot backup has a limited impact on the TiDB cluster, generally staying below 20%. With appropriate configuration of the TiDB cluster, this impact can be further minimized to 10% or even less. When CPU and I/O resources are insufficient, you can adjust the TiKV configuration item [`backup.num-threads`](/tikv-configuration-file.md#num-threads-1) to change the number of worker threads used by the backup task to reduce the impact of the backup task on the TiDB cluster. The backup speed of a TiKV node is scalable and ranges from 50 MB/s to 100 MB/s. For more information, see [Backup performance and impact](/br/br-snapshot-guide.md#performance-and-impact-of-snapshot-backup). +- When there are only log backup tasks, the impact on the cluster is about 5%. Log backup flushes all the changes generated after the last refresh every 3-5 minutes to the backup storage, which can **achieve a Recovery Point Objective (RPO) as short as five minutes**. + +### Restore backup data + +Corresponding to the backup features, you can perform two types of restore: full restore and PITR. + +- Restore a full backup + + - Restore cluster snapshot backup: You can restore snapshot backup data to an empty cluster or a cluster that does not have data conflicts (with the same schema or tables). For details, see [Restore snapshot backup](/br/br-snapshot-guide.md#restore-cluster-snapshots). In addition, you can restore specific databases or tables from the backup data and filter out unwanted data. For details, see [Restore specific databases or tables from backup data](/br/br-snapshot-guide.md#restore-a-database-or-a-table). + +- Restore data to any point in time (PITR) + + - By running the `br restore point` command, you can restore the latest snapshot backup data before recovery time point and log backup data to a specified time. BR automatically determines the restore scope, accesses backup data, and restores data to the target cluster in turn. + +#### Restore performance and impact on TiDB clusters + +- Data restore is performed at a scalable speed. Generally, the speed is 100 MiB/s per TiKV node. For more details, see [Restore performance and impact](/br/br-snapshot-guide.md#performance-and-impact-of-snapshot-restore). +- On each TiKV node, PITR can restore log data at 30 GiB/h. For more details, see [PITR performance and impact](/br/br-pitr-guide.md#performance-capabilities-of-pitr). + +## Backup storage + +TiDB supports backing up data to Amazon S3, Google Cloud Storage (GCS), Azure Blob Storage, NFS, and other S3-compatible file storage services. For details, see the following documents: + +- [Specify backup storage in URI](/external-storage-uri.md) +- [Configure access privileges to backup storages](/br/backup-and-restore-storages.md#authentication) + +## Compatibility + +### Compatibility with other features + +Backup and restore might go wrong when some TiDB features are enabled or disabled. If these features are not consistently enabled or disabled during backup and restore, compatibility issues might occur. + +| Feature | Issue | Solution | +| ---- | ---- | ----- | +|GBK charset|| BR of versions earlier than v5.4.0 does not support restoring `charset=GBK` tables. No version of BR supports recovering `charset=GBK` tables to TiDB clusters earlier than v5.4.0. | +| Clustered index | [#565](https://github.com/pingcap/br/issues/565) | Make sure that the value of the `tidb_enable_clustered_index` global variable during restore is consistent with that during backup. Otherwise, data inconsistency might occur, such as `default not found` error and inconsistent data index. | +| New collation | [#352](https://github.com/pingcap/br/issues/352) | Make sure that the value of the `new_collation_enabled` variable in the `mysql.tidb` table during restore is consistent with that during backup. Otherwise, inconsistent data index might occur and checksum might fail to pass. For more information, see [FAQ - Why does BR report `new_collations_enabled_on_first_bootstrap` mismatch?](/faq/backup-and-restore-faq.md#why-is-new_collation_enabled-mismatch-reported-during-restore). | +| Global temporary tables | | Make sure that you are using v5.3.0 or a later version of BR to back up and restore data. Otherwise, an error occurs in the definition of the backed global temporary tables. | +| TiDB Lightning Physical Import| | If the upstream database uses the physical import mode of TiDB Lightning, data cannot be backed up in log backup. It is recommended to perform a full backup after the data import. For more information, see [When the upstream database imports data using TiDB Lightning in the physical import mode, the log backup feature becomes unavailable. Why?](/faq/backup-and-restore-faq.md#when-the-upstream-database-imports-data-using-tidb-lightning-in-the-physical-import-mode-the-log-backup-feature-becomes-unavailable-why).| +| TiCDC | | BR v8.2.0 and later: if the target cluster to be restored has a changefeed and the changefeed [CheckpointTS](/ticdc/ticdc-architecture.md#checkpointts) is earlier than the BackupTS, BR does not perform the restoration. BR versions before v8.2.0: if the target cluster to be restored has any active TiCDC changefeeds, BR does not perform the restoration. | +| Vector search | | Make sure that you are using v8.4.0 or a later version of BR to back up and restore data. Restoring tables with [vector data types](/vector-search-data-types.md) to TiDB clusters earlier than v8.4.0 is not supported. | + +### Version compatibility + +> **Note:** +> +> It is recommended to use the BR of the same major version as your TiDB cluster for backup and restoration. + +Before performing backup and restore, BR compares the TiDB cluster version with its own and checks their compatibility. If the versions are incompatible, BR reports an error and exits. To forcibly skip the version check, you can set `--check-requirements=false`. Note that skipping the version check might introduce incompatibility in data. + +Starting from v7.0.0, TiDB gradually supports performing backup and restore operations through SQL statements. Therefore, it is strongly recommended to use the BR tool of the same major version as the TiDB cluster when backing up and restoring cluster data, and avoid performing data backup and restore operations across major versions. This helps ensure smooth execution of restore operations and data consistency. Starting from v7.6.0, BR restores data in some `mysql` system tables by default, that is, the `--with-sys-table` option is set to `true` by default. When restoring data to a TiDB cluster with a different version, if you encounter an error similar to `[BR:Restore:ErrRestoreIncompatibleSys]incompatible system table` due to different schemas of system tables, you can set `--with-sys-table=false` to skip restoring the system tables and avoid this error. + +The compatibility information for BR before TiDB v6.6.0 is as follows: + +| Backup version (vertical) \ Restore version (horizontal) | Restore to TiDB v6.0 | Restore to TiDB v6.1 | Restore to TiDB v6.2 | Restore to TiDB v6.3, v6.4, or v6.5 | Restore to TiDB v6.6 | +| ---- | ---- | ---- | ---- | ---- | ---- | +| TiDB v6.0, v6.1, v6.2, v6.3, v6.4, or v6.5 snapshot backup | Compatible (known issue [#36379](https://github.com/pingcap/tidb/issues/36379): if backup data contains an empty schema, BR might report an error.) | Compatible | Compatible | Compatible | Compatible (BR must be v6.6) | +| TiDB v6.3, v6.4, v6.5, or v6.6 log backup| Incompatible | Incompatible | Incompatible | Compatible | Compatible | + +## See also + +- [TiDB Snapshot Backup and Restore Guide](/br/br-snapshot-guide.md) +- [TiDB Log Backup and PITR Guide](/br/br-pitr-guide.md) +- [Backup Storages](/br/backup-and-restore-storages.md) diff --git a/markdown-pages/en/tidb/master/dm/dm-overview.md b/markdown-pages/en/tidb/master/dm/dm-overview.md new file mode 100644 index 0000000..d3fef74 --- /dev/null +++ b/markdown-pages/en/tidb/master/dm/dm-overview.md @@ -0,0 +1,95 @@ +--- +title: TiDB Data Migration Overview +summary: Learn about the Data Migration tool, the architecture, the key components, and features. +aliases: ['/docs/tidb-data-migration/dev/overview/','/docs/tidb-data-migration/dev/feature-overview/','/tidb/dev/dm-key-features'] +--- + + + +# TiDB Data Migration Overview + + + +[TiDB Data Migration](https://github.com/pingcap/tiflow/tree/master/dm) (DM) is an integrated data migration task management platform, which supports the full data migration and the incremental data replication from MySQL-compatible databases (such as MySQL, MariaDB, and Aurora MySQL) into TiDB. It can help to reduce the operation cost of data migration and simplify the troubleshooting process. + +## Basic features + +- **Compatibility with MySQL.** DM is compatible with the MySQL protocol and most of the features and syntax of MySQL 5.7 and MySQL 8.0. +- **Replicating DML and DDL events.** It supports parsing and replicating DML and DDL events in MySQL binlog. +- **Migrating and merging MySQL shards.** DM supports migrating and merging multiple MySQL database instances upstream to one TiDB database downstream. It supports customizing replication rules for different migration scenarios. It can automatically detect and handle DDL changes of upstream MySQL shards, which greatly reduces the operational cost. +- **Various types of filters.** You can predefine event types, regular expressions, and SQL expressions to filter out MySQL binlog events during the data migration process. +- **Centralized management.** DM supports thousands of nodes in a cluster. It can run and manage a large number of data migration tasks concurrently. +- **Optimization of the third-party Online Schema Change process.** In the MySQL ecosystem, tools such as gh-ost and pt-osc are widely used. DM optimizes its change process to avoid unnecessary migration of intermediate data. For details, see [online-ddl](/dm/dm-online-ddl-tool-support.md). +- **High availability.** DM supports data migration tasks to be scheduled freely on different nodes. The running tasks are not affected when a small number of nodes crash. + +## Quick installation + +Run the following command to install DM: + +```shell +curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh +tiup install dm dmctl +``` + +## Usage restrictions + +Before using the DM tool, note the following restrictions: + ++ Database version requirements + + - MySQL version 5.6 ~ 8.0 + - MariaDB version >= 10.1.2 (experimental features) + + > **Note:** + > + > If there is a primary-secondary migration structure between the upstream MySQL/MariaDB servers, then choose the following version. + > + > - MySQL version > 5.7.1 + > - MariaDB version >= 10.1.3 + ++ DDL syntax compatibility + + - Currently, TiDB is not compatible with all the DDL statements that MySQL supports. Because DM uses the TiDB parser to process DDL statements, it only supports the DDL syntax supported by the TiDB parser. For details, see [MySQL Compatibility](/mysql-compatibility.md#ddl-operations). + + - DM reports an error when it encounters an incompatible DDL statement. To solve this error, you need to manually handle it using dmctl, either skipping this DDL statement or replacing it with specified DDL statements. For details, see [Skip or replace abnormal SQL statements](/dm/dm-faq.md#how-to-handle-incompatible-ddl-statements). + + - DM does not replicate view-related DDL statements and DML statements to the downstream TiDB cluster. It is recommended that you create the view in the downstream TiDB cluster manually. + ++ GBK character set compatibility + + - DM does not support migrating `charset=GBK` tables to TiDB clusters earlier than v5.4.0. + ++ Binlog compatibility + + - DM does not support the MySQL 8.0 new feature binlog [Transaction_payload_event](https://dev.mysql.com/doc/refman/8.0/en/binary-log-transaction-compression.html). Using binlog Transaction_payload_event might result in data inconsistency between upstream and downstream. + ++ Vector data type replication + + - DM does not support migrating or replicating MySQL 9.0 vector data types to TiDB. + +## Contributing + +You are welcome to participate in the DM open sourcing project. Your contribution would be highly appreciated. For more details, see [CONTRIBUTING.md](https://github.com/pingcap/tiflow/blob/master/dm/CONTRIBUTING.md). + +## Community support + +You can learn about DM through the online documentation. If you have any questions, contact us on [GitHub](https://github.com/pingcap/tiflow/tree/master/dm). + +## License + +DM complies with the Apache 2.0 license. For more details, see [LICENSE](https://github.com/pingcap/tiflow/blob/master/LICENSE). + +## DM versions + +Before v5.4, the DM documentation is independent of the TiDB documentation. To access these earlier versions of the DM documentation, click one of the following links: + +- [DM v5.3 documentation](https://docs.pingcap.com/tidb-data-migration/v5.3) +- [DM v2.0 documentation](https://docs.pingcap.com/tidb-data-migration/v2.0/) +- [DM v1.0 documentation](https://docs.pingcap.com/tidb-data-migration/v1.0/) + +> **Note:** +> +> - Since October 2021, DM's GitHub repository has been moved to [pingcap/tiflow](https://github.com/pingcap/tiflow/tree/master/dm). If you see any issues with DM, submit your issue to the `pingcap/tiflow` repository for feedback. +> - In earlier versions (v1.0 and v2.0), DM uses version numbers that are independent of TiDB. Since v5.3, DM uses the same version number as TiDB. The next version of DM v2.0 is DM v5.3. There are no compatibility changes from DM v2.0 to v5.3, and the upgrade process is the same as a normal upgrade, only an increase in version number. diff --git a/markdown-pages/en/tidb/master/ticdc/ticdc-compatibility.md b/markdown-pages/en/tidb/master/ticdc/ticdc-compatibility.md new file mode 100644 index 0000000..c127839 --- /dev/null +++ b/markdown-pages/en/tidb/master/ticdc/ticdc-compatibility.md @@ -0,0 +1,74 @@ +--- +title: TiCDC Compatibility +summary: Learn about compatibility issues of TiCDC and how to handle them. +--- + +# TiCDC Compatibility + +This section describes compatibility issues related to TiCDC and how to handle them. + + + +## CLI and configuration file compatibility + +* In TiCDC v4.0.0, `ignore-txn-commit-ts` is removed and `ignore-txn-start-ts` is added, which uses start_ts to filter transactions. +* In TiCDC v4.0.2, `db-dbs`/`db-tables`/`ignore-dbs`/`ignore-tables` are removed and `rules` is added, which uses new filter rules for databases and tables. For detailed filter syntax, see [Table Filter](/table-filter.md). +* Starting from TiCDC v6.2.0, `cdc cli` can directly interact with TiCDC server via TiCDC Open API. You can specify the address of the TiCDC server using the `--server` parameter. `--pd` is deprecated. +* Since v6.4.0, only the changefeed with the `SYSTEM_VARIABLES_ADMIN` or `SUPER` privilege can use the TiCDC Syncpoint feature. + +## Handle compatibility issues + +This section describes compatibility issues related to TiCDC and how to handle them. + +### Incompatibility issue caused by using the TiCDC v5.0.0-rc `cdc cli` tool to operate a v4.0.x cluster + +When using the `cdc cli` tool of TiCDC v5.0.0-rc to operate a v4.0.x TiCDC cluster, you might encounter the following abnormal situations: + +- If the TiCDC cluster is v4.0.8 or an earlier version, using the v5.0.0-rc `cdc cli` tool to create a replication task might cause cluster anomalies and get the replication task stuck. + +- If the TiCDC cluster is v4.0.9 or a later version, using the v5.0.0-rc `cdc cli` tool to create a replication task will cause the old value and unified sorter features to be unexpectedly enabled by default. + +Solutions: + +Use the `cdc` executable file corresponding to the TiCDC cluster version to perform the following operations: + +1. Delete the changefeed created using the v5.0.0-rc `cdc cli` tool. For example, run the `tiup cdc:v4.0.9 cli changefeed remove -c xxxx --pd=xxxxx --force` command. +2. If the replication task is stuck, restart the TiCDC cluster. For example, run the `tiup cluster restart -R cdc` command. +3. Re-create the changefeed. For example, run the `tiup cdc:v4.0.9 cli changefeed create --sink-uri=xxxx --pd=xxx` command. + +> **Note:** +> +> This issue exists only when `cdc cli` is v5.0.0-rc. `cdc cli` tool of other v5.0.x versions is compatible with v4.0.x clusters. + +### Compatibility notes for `sort-dir` and `data-dir` + +The `sort-dir` configuration is used to specify the temporary file directory for the TiCDC sorter. Its functionalities might vary in different versions. The following table lists `sort-dir`'s compatibility changes across versions. + +| Version | `sort-engine` functionality | Note | Recommendation | +| :--- | :--- | :-- | :-- | +| v4.0.11 or an earlier v4.0 version, v5.0.0-rc | It is a changefeed configuration item and specifies temporary file directory for the `file` sorter and `unified` sorter. | In these versions, `file` sorter and `unified` sorter are **experimental features** and **NOT** recommended for the production environment.

If multiple changefeeds use the `unified` sorter as its `sort-engine`, the actual temporary file directory might be the `sort-dir` configuration of any changefeed, and the directory used for each TiCDC node might be different. | It is not recommended to use `unified` sorter in the production environment. | +| v4.0.12, v4.0.13, v5.0.0, and v5.0.1 | It is a configuration item of changefeed or of `cdc server`. | By default, the `sort-dir` configuration of a changefeed does not take effect, and the `sort-dir` configuration of `cdc server` defaults to `/tmp/cdc_sort`. It is recommended to only configure `cdc server` in the production environment.

If you use TiUP to deploy TiCDC, it is recommended to use the latest TiUP version and set `sorter.sort-dir` in the TiCDC server configuration.

The `unified` sorter is enabled by default in v4.0.13, v5.0.0, and v5.0.1. If you want to upgrade your cluster to these versions, make sure that you have correctly configured `sorter.sort-dir` in the TiCDC server configuration. | You need to configure `sort-dir` using the `cdc server` command-line parameter (or TiUP). | +| v4.0.14 and later v4.0 versions, v5.0.3 and later v5.0 versions, later TiDB versions | `sort-dir` is deprecated. It is recommended to configure `data-dir`. | You can configure `data-dir` using the latest version of TiUP. In these TiDB versions, `unified` sorter is enabled by default. Make sure that `data-dir` has been configured correctly when you upgrade your cluster. Otherwise, `/tmp/cdc_data` will be used by default as the temporary file directory.

If the storage capacity of the device where the directory is located is insufficient, the problem of insufficient hard disk space might occur. In this situation, the previous `sort-dir` configuration of changefeed will become invalid.| You need to configure `data-dir` using the `cdc server` command-line parameter (or TiUP). | +| v6.0.0 and later versions | `data-dir` is used for saving the temporary files generated by TiCDC. | Starting from v6.0.0, TiCDC uses `db sorter` as the sort engine by default. `data-dir` is the disk directory for this engine. | You need to configure `data-dir` using the `cdc server` command-line parameter (or TiUP). | + +### Compatibility with temporary tables + +Since v5.3.0, TiCDC supports [global temporary tables](/temporary-tables.md#global-temporary-tables). Replicating global temporary tables to the downstream using TiCDC of a version earlier than v5.3.0 causes table definition error. + +If the upstream cluster contains a global temporary table, the downstream TiDB cluster is expected to be v5.3.0 or a later version. Otherwise, an error occurs during the replication process. + +### Compatibility with vector data types + +Starting from v8.4.0, TiCDC supports replicating tables with [vector data types](/vector-search-data-types.md) to downstream (experimental). + +When the downstream is Kafka or a storage service (such as Amazon S3, GCS, Azure Blob Storage, or NFS), TiCDC converts vector data types into string types before writing to the downstream. + +When the downstream is a MySQL-compatible database that does not support vector data types, TiCDC fails to write DDL events involving vector types to the downstream. In this case, add the `has-vector-type=true` parameter to `sink-url`, which allows TiCDC to convert vector data types into the `LONGTEXT` type before writing. \ No newline at end of file diff --git a/markdown-pages/en/tidb/master/tidb-cloud/data-service-manage-endpoint.md b/markdown-pages/en/tidb/master/tidb-cloud/data-service-manage-endpoint.md new file mode 100644 index 0000000..0814155 --- /dev/null +++ b/markdown-pages/en/tidb/master/tidb-cloud/data-service-manage-endpoint.md @@ -0,0 +1,478 @@ +--- +title: Manage an Endpoint +summary: Learn how to create, develop, test, deploy, and delete an endpoint in a Data App in the TiDB Cloud console. +--- + +# Manage an Endpoint + +An endpoint in Data Service (beta) is a web API that you can customize to execute SQL statements. You can specify parameters for the SQL statements, such as the value used in the `WHERE` clause. When a client calls an endpoint and provides values for the parameters in a request URL, the endpoint executes the SQL statement with the provided parameters and returns the results as part of the HTTP response. + +This document describes how to manage your endpoints in a Data App in the TiDB Cloud console. + +## Before you begin + +- Before you create an endpoint, make sure the following: + + - You have created a cluster and a Data App. For more information, see [Create a Data App](/tidb-cloud/data-service-manage-data-app.md#create-a-data-app). + - The databases, tables, and columns that the endpoint will operate on already exist in the target cluster. + +- Before you call an endpoint, make sure that you have created an API key in the Data App. For more information, see [Create an API key](/tidb-cloud/data-service-api-key.md#create-an-api-key). + +## Create an endpoint + +In Data Service, you can automatically generate endpoints, manually create endpoints, or add predefined system endpoints. + +> **Tip:** +> +> You can also create an endpoint from a SQL file in SQL Editor. For more information, see [Generate an endpoint from a SQL file](/tidb-cloud/explore-data-with-chat2query.md#generate-an-endpoint-from-a-sql-file). + +### Generate an endpoint automatically + +In TiDB Cloud Data Service, you can generate one or multiple endpoints automatically in one go as follows: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, locate your target Data App, click **+** to the right of the App name, and then click **Autogenerate Endpoint**. The dialog for endpoint generation is displayed. +3. In the dialog, do the following: + + 1. Select the target cluster, database, and table for the endpoint to be generated. + + > **Note:** + > + > The **Table** drop-down list includes only user-defined tables with at least one column, excluding system tables and any tables without a column definition. + + 2. Select at least one HTTP operation (such as `GET (Retrieve)`, `POST (Create)`, and `PUT (Update)`) for the endpoint to be generated. + + For each operation you select, TiDB Cloud Data Service will generate a corresponding endpoint. If you select a batch operation (such as `POST (Batch Create)`), the generated endpoint lets you operate on multiple rows in a single request. + + If the table you selected contains [vector data types](/vector-search-data-types.md), you can enable the **Vector Search Operations** option and select a vector distance function to generate a vector search endpoint that automatically calculates vector distances based on your selected distance function. The supported [vector distance functions](/vector-search-functions-and-operators.md) include the following: + + - `VEC_L2_DISTANCE` (default): calculates the L2 distance (Euclidean distance) between two vectors. + - `VEC_COSINE_DISTANCE`: calculates the cosine distance between two vectors. + - `VEC_NEGATIVE_INNER_PRODUCT`: calculates the distance by using the negative of the inner product between two vectors. + - `VEC_L1_DISTANCE`: calculates the L1 distance (Manhattan distance) between two vectors. + + 3. (Optional) Configure a timeout and tag for the operations. All the generated endpoints will automatically inherit the configured properties, which can be modified later as needed. + 4. (Optional) The **Auto-Deploy Endpoint** option (disabled by default) controls whether to enable the direct deployment of the generated endpoints. When it is enabled, the draft review process is skipped, and the generated endpoints are deployed immediately without further manual review or approval. + +4. Click **Generate**. + + The generated endpoint is displayed at the top of the endpoint list. + +5. Check the generated endpoint name, SQL statements, properties, and parameters of the new endpoint. + + - Endpoint name: the generated endpoint name is in the `/` format, and the request method (such as `GET`, `POST`, and `PUT`) is displayed before the endpoint name. For example, if the selected table name is `sample_table` and the selected operation is `POST (Create)`, the generated endpoint is displayed as `POST /sample_table`. + + - If a batch operation is selected, TiDB Cloud Data Service appends `/bulk` to the name of the generated endpoint. For example, if the selected table name is `/sample_table` and the selected operation is `POST (Batch Create)`, the generated endpoint is displayed as `POST /sample_table/bulk`. + - If `POST (Vector Similarity Search)` is selected, TiDB Cloud Data Service appends `/vector_search` to the name of the generated endpoint. For example, if the selected table name is `/sample_table` and the selected operation is `POST (Vector Similarity Search)`, the generated endpoint is displayed as `POST /sample_table/vector_search`. + - If there has been already an endpoint with the same request method and endpoint name, TiDB Cloud Data Service appends `_dump_` to the name of the generated endpoint. For example, `/sample_table_dump_EUKRfl`. + + - SQL statements: TiDB Cloud Data Service automatically writes SQL statements for the generated endpoints according to the table column specifications and the selected endpoint operations. You can click the endpoint name to view its SQL statements in the middle section of the page. + - Endpoint properties: TiDB Cloud Data Service automatically configures the endpoint path, request method, timeout, and tag according to your selection. You can find the properties in the right pane of the page. + - Endpoint parameters: TiDB Cloud Data Service automatically configures parameters for the generated endpoints. You can find the parameters in the right pane of the page. + +6. If you want to modify the details of the generated endpoint, such as its name, SQL statements, properties, or parameters, refer to the instructions provided in [Develop an endpoint](#deploy-an-endpoint). + +### Create an endpoint manually + +To create an endpoint manually, perform the following steps: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, locate your target Data App, click **+** to the right of the App name, and then click **Create Endpoint**. +3. Update the default name if necessary. The newly created endpoint is added to the top of the endpoint list. +4. Configure the new endpoint according to the instructions in [Develop an endpoint](#develop-an-endpoint). + +### Add a predefined system endpoint + +Data Service provides an endpoint library with predefined system endpoints that you can directly add to your Data App, reducing the effort in your endpoint development. Currently, the library only includes the `/system/query` endpoint, which enables you to execute any SQL statement by simply passing the statement in the predefined `sql` parameter. + +To add a predefined system endpoint to your Data App, perform the following steps: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. + +2. In the left pane, locate your target Data App, click **+** to the right of the App name, and then click **Manage Endpoint Library**. + + A dialog for endpoint library management is displayed. Currently, only **Execute Query** (that is, the `/system/query` endpoint) is provided in the dialog. + +3. To add the `/system/query` endpoint to your Data App, toggle the **Execute Query** switch to **Added**. + + > **Tip:** + > + > To remove an added predefined endpoint from your Data App, toggle the **Execute Query** switch to **Removed**. + +4. Click **Save**. + + > **Note:** + > + > - After you click **Save**, the added or removed endpoint is deployed to production immediately, which makes the added endpoint accessible and the removed endpoint inaccessible immediately. + > - If a non-predefined endpoint with the same path and method already exists in the current App, the creation of the system endpoint will fail. + + The added system-provided endpoint is displayed at the top of the endpoint list. + +5. Check the endpoint name, SQL statements, properties, and parameters of the new endpoint. + + > **Note:** + > + > The `/system/query` endpoint is powerful and versatile but can be potentially destructive. Use it with discretion and ensure the queries are secure and well-considered to prevent unintended consequences. + + - Endpoint name: the endpoint name and path is `/system/query`, and the request method `POST`. + - SQL statements: the `/system/query` endpoint does not come with any SQL statement. You can find the SQL editor in the middle section of the page and write your desired SQL statements in the SQL editor. Note that the SQL statements written in the SQL editor for the `/system/query` endpoint will be saved in the SQL editor so you can further develop and test them next time but they will not be saved in the endpoint configuration. + - Endpoint properties: in the right pane of the page, you can find the endpoint properties on the **Properties** tab. Unlike other custom endpoints, only the `timeout` and `max rows` properties can be customized for system endpoints. + - Endpoint parameters: in the right pane of the page, you can find the endpoint parameters on the **Params** tab. The parameters of the `/system/query` endpoint are configured automatically and cannot be modified. + +## Develop an endpoint + +For each endpoint, you can write SQL statements to execute on a TiDB cluster, define parameters for the SQL statements, or manage the name and version. + +> **Note:** +> +> If you have connected your Data App to GitHub with **Auto Sync & Deployment** enabled, you can also update the endpoint configurations using GitHub. Any changes you made in GitHub will be deployed in TiDB Cloud Data Service automatically. For more information, see [Deploy automatically with GitHub](/tidb-cloud/data-service-manage-github-connection.md). + +### Configure properties + +On the right pane of the endpoint details page, you can click the **Properties** tab to view and configure properties of the endpoint. + +#### Basic properties + +- **Path**: the path that users use to access the endpoint. + + - The length of the path must be less than 64 characters. + - The combination of the request method and the path must be unique within a Data App. + - Only letters, numbers, underscores (`_`), slashes (`/`), and parameters enclosed in curly braces (such as `{var}`) are allowed in a path. Each path must start with a slash (`/`) and end with a letter, number, or underscore (`_`). For example, `/my_endpoint/get_id`. + - For parameters enclosed in `{ }`, only letters, numbers, and underscores (`_`) are allowed. Each parameter enclosed in `{ }` must start with a letter or underscore (`_`). + + > **Note:** + > + > - In a path, each parameter must be at a separate level and does not support prefixes or suffixes. + > + > Valid path: ```/var/{var}``` and ```/{var}``` + > + > Invalid path: ```/var{var}``` and ```/{var}var``` + > + > - Paths with the same method and prefix might conflict, as in the following example: + > + > ```GET /var/{var1}``` + > + > ```GET /var/{var2}``` + > + > These two paths will conflict with each other because `GET /var/123` matches both. + > + > - Paths with parameters have lower priority than paths without parameters. For example: + > + > ```GET /var/{var1}``` + > + > ```GET /var/123``` + > + > These two paths will not conflict because `GET /var/123` takes precedence. + > + > - Path parameters can be used directly in SQL. For more information, see [Configure parameters](#configure-parameters). + +- **Endpoint URL**: (read-only) the default URL is automatically generated based on the region where the corresponding cluster is located, the service URL of the Data App, and the path of the endpoint. For example, if the path of the endpoint is `/my_endpoint/get_id`, the endpoint URL is `https://.data.tidbcloud.com/api/v1beta/app//endpoint/my_endpoint/get_id`. To configure a custom domain for the Data App, see [Custom Domain in Data Service](/tidb-cloud/data-service-custom-domain.md). + +- **Request Method**: the HTTP method of the endpoint. The following methods are supported: + + - `GET`: use this method to query or retrieve data, such as a `SELECT` statement. + - `POST`: use this method to insert or create data, such as an `INSERT` statement. + - `PUT`: use this method to update or modify data, such as an `UPDATE` statement. + - `DELETE`: use this method to delete data, such as a `DELETE` statement. + +- **Description** (Optional): the description of the endpoint. + +#### Advanced properties + +- **Timeout(ms)**: the timeout for the endpoint, in milliseconds. +- **Max Rows**: the maximum number of rows that the endpoint can operate or return. +- **Tag**: the tag used for identifying a group of endpoints. +- **Pagination**: this property is available only when the request method is `GET` and the last SQL statement of the endpoint is a `SELECT` operation. When **Pagination** is enabled, you can paginate the results by specifying `page` and `page_size` as query parameters when calling the endpoint, such as `https://.data.tidbcloud.com/api/v1beta/app//endpoint/my_endpoint/get_id?page=&page_size=`. For more information, see [Call an endpoint](#call-an-endpoint). + + > **Note:** + > + > - If you do not include the `page` and `page_size` parameters in the request, the default behavior is to return the maximum number of rows specified in the **Max Rows** property on a single page. + > - The `page_size` must be less than or equal to the **Max Rows** property. Otherwise, an error is returned. + +- **Cache Response**: this property is available only when the request method is `GET`. When **Cache Response** is enabled, TiDB Cloud Data Service can cache the response returned by your `GET` requests within a specified time-to-live (TTL) period. +- **Time-to-live(s)**: this property is available only when **Cache Response** is enabled. You can use it to specify the time-to-live (TTL) period in seconds for cached response. During the TTL period, if you make the same `GET` requests again, Data Service returns the cached response directly instead of fetching data from the target database again, which improves your query performance. +- **Batch Operation**: this property is visible only when the request method is `POST` or `PUT`. When **Batch Operation** is enabled, you can operate on multiple rows in a single request. For example, you can insert multiple rows of data in a single `POST` request by putting an array of data objects to the `items` field of an object in the `--data-raw` option of your curl command when [calling the endpoint](#call-an-endpoint). + + > **Note:** + > + > The endpoint with **Batch Operation** enabled supports both array and object formats for the request body: `[{dataObject1}, {dataObject2}]` and `{items: [{dataObject1}, {dataObject2}]}`. For better compatibility with other systems, it is recommended that you use the object format `{items: [{dataObject1}, {dataObject2}]}`. + +### Write SQL statements + +On the SQL editor of the endpoint details page, you can write and run the SQL statements for an endpoint. You can also simply type `--` followed by your instructions to let AI generate SQL statements automatically. + +1. Select a cluster. + + > **Note:** + > + > Only clusters that are linked to the Data App are displayed in the drop-down list. To manage the linked clusters, see [Manage linked clusters](/tidb-cloud/data-service-manage-data-app.md#manage-linked-data-sources). + + On the upper part of the SQL editor, select a cluster on which you want to execute SQL statements from the drop-down list. Then, you can view all databases of this cluster in the **Schema** tab on the right pane. + +2. Depending on your endpoint type, do one of the following to select a database: + + - Predefined system endpoints: on the upper part of the SQL editor, select the target database from the drop-down list. + - Other endpoints: write a SQL statement to specify the target database in the SQL editor. For example, `USE database_name;`. + +3. Write SQL statements. + + In the SQL editor, you can write statements such as table join queries, complex queries, and aggregate functions. You can also simply type `--` followed by your instructions to let AI generate SQL statements automatically. + + To define a parameter, you can insert it as a variable placeholder like `${ID}` in the SQL statement. For example, `SELECT * FROM table_name WHERE id = ${ID}`. Then, you can click the **Params** tab on the right pane to change the parameter definition and test values. For more information, see [Parameters](#configure-parameters). + + When defining an array parameter, the parameter is automatically converted to multiple comma-separated values in the SQL statement. To make sure that the SQL statement is valid, you need to add parentheses (`()`) around the parameter in some SQL statements (such as `IN`). For example, if you define an array parameter `ID` with test value `1,2,3`, use `SELECT * FROM table_name WHERE id IN (${ID})` to query the data. + + > **Note:** + > + > - The parameter name is case-sensitive. + > - The parameter cannot be used as a table name or column name. + +4. Run SQL statements. + + If you have inserted parameters in the SQL statements, make sure that you have set test values or default values for the parameters in the **Params** tab on the right pane. Otherwise, an error is returned. + + +
+ + For macOS: + + - If you have only one statement in the editor, to run it, press **⌘ + Enter** or click **Run**. + + - If you have multiple statements in the editor, to run one or several of them sequentially, place your cursor on your target statement or select the lines of the target statements with your cursor, and then press **⌘ + Enter** or click **Run**. + + - To run all statements in the editor sequentially, press **⇧ + ⌘ + Enter**, or select the lines of all statements with your cursor and click **Run**. + +
+ +
+ + For Windows or Linux: + + - If you have only one statement in the editor, to run it, press **Ctrl + Enter** or click **Run**. + + - If you have multiple statements in the editor, to run one or several of them sequentially, place your cursor on your target statement or select the lines of the target statements with your cursor, and then press **Ctrl + Enter** or click **Run**. + + - To run all statements in the editor sequentially, press **Shift + Ctrl + Enter**, or select the lines of all statements with your cursor and click **Run**. + +
+
+ + After running the statements, you can see the query results immediately in the **Result** tab at the bottom of the page. + + > **Note:** + > + > The returned result has a size limit of 8 MiB. + +### Configure parameters + +On the right pane of the endpoint details page, you can click the **Params** tab to view and manage the parameters used in the endpoint. + +In the **Definition** section, you can view and manage the following properties for a parameter: + +- The parameter name: the name can only include letters, digits, and underscores (`_`) and must start with a letter or an underscore (`_`). **DO NOT** use `page` and `page_size` as parameter names, which are reserved for pagination of request results. +- **Required**: specifies whether the parameter is required in the request. For path parameters, the configuration is required and cannot be modified. For other parameters, the default configuration is not required. +- **Type**: specifies the data type of the parameter. For path parameters, only `STRING` and `INTEGER` are supported. For other parameters, `STRING`, `NUMBER`, `INTEGER`, `BOOLEAN`, and `ARRAY` are supported. + + When using a `STRING` type parameter, you do not need to add quotation marks (`'` or `"`). For example, `foo` is valid for the `STRING` type and is processed as `"foo"`, whereas `"foo"` is processed as `"\"foo\""`. + +- **Enum Value**: (optional) specifies the valid values for the parameter and is available only when the parameter type is `STRING`, `INTEGER`, or `NUMBER`. + + - If you leave this field empty, the parameter can be any value of the specified type. + - To specify multiple valid values, you can separate them with a comma (`,`). For example, if you set the parameter type to `STRING` and specify this field as `foo, bar`, the parameter value can only be `foo` or `bar`. + +- **ItemType**: specifies the item type of an `ARRAY` type parameter. +- **Default Value**: specifies the default value of the parameter. + + - For `ARRAY` type, you need to separate multiple values with a comma (`,`). + - Make sure that the value can be converted to the type of parameter. Otherwise, the endpoint returns an error. + - If you do not set a test value for a parameter, the default value is used when testing the endpoint. +- **Location**: indicates the location of the parameter. This property cannot be modified. + - For path parameters, this property is `Path`. + - For other parameters, if the request method is `GET` or `DELETE`, this property is `Query`. If the request method is `POST` or `PUT`, this property is `Body`. + +In the **Test Values** section, you can view and set test parameters. These values are used as the parameter values when you test the endpoint. Make sure that the value can be converted to the type of parameter. Otherwise, the endpoint returns an error. + +### Rename + +To rename an endpoint, perform the following steps: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, click the name of your target Data App to view its endpoints. +3. Locate the endpoint you want to rename, click **...** > **Rename**., and enter a new name for the endpoint. + +> **Note:** +> +> Predefined system endpoints do not support renaming. + +## Test an endpoint + +To test an endpoint, perform the following steps: + +> **Tip:** +> +> If you have imported your Data App to Postman, you can also test endpoints of the Data App in Postman. For more information, see [Run Data App in Postman](/tidb-cloud/data-service-postman-integration.md). + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, click the name of your target Data App to view its endpoints. +3. Click the name of the endpoint you want to test to view its details. +4. (Optional) If the endpoint contains parameters, you need to set test values before testing. + + 1. On the right pane of the endpoint details page, click the **Params** tab. + 2. Expand the **Test Values** section and set test values for the parameters. + + If you do not set a test value for a parameter, the default value is used. + +5. Click **Test** in the upper-right corner. + + > **Tip:** + > + > Alternatively, you can also press F5 to test the endpoint. + +After testing the endpoint, you can see the response as JSON at the bottom of the page. For more information about the JSON response, refer to [Response of an endpoint](#response). + +## Deploy an endpoint + +> **Note:** +> +> If you have connected your Data App to GitHub with **Auto Sync & Deployment** enabled, any Data App changes you made in GitHub will be deployed in TiDB Cloud Data Service automatically. For more information, see [Deploy automatically with GitHub](/tidb-cloud/data-service-manage-github-connection.md). + +To deploy an endpoint, perform the following steps: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, click the name of your target Data App to view its endpoints. +3. Locate the endpoint you want to deploy, click the endpoint name to view its details, and then click **Deploy** in the upper-right corner. +4. If **Review Draft** is enabled for your Data App, a dialog is displayed for you to review the changes you made. You can choose whether to discard the changes based on the review. +5. Click **Deploy** to confirm the deployment. You will get the **Endpoint has been deployed** prompt if the endpoint is successfully deployed. + + On the right pane of the endpoint details page, you can click the **Deployments** tab to view the deployed history. + +## Call an endpoint + +To call an endpoint, you can send an HTTPS request to either an undeployed draft version or a deployed online version of the endpoint. + +> **Tip:** +> +> If you have imported your Data App to Postman, you can also call endpoints of the Data App in Postman. For more information, see [Run Data App in Postman](/tidb-cloud/data-service-postman-integration.md). + +### Prerequisites + +Before calling an endpoint, you need to create an API key. For more information, refer to [Create an API key](/tidb-cloud/data-service-api-key.md#create-an-api-key). + +### Request + +TiDB Cloud Data Service generates code examples to help you call an endpoint. To get the code example, perform the following steps: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, click the name of your target Data App to view its endpoints. +3. Locate the endpoint you want to call and click **...** > **Code Example**. The **Code Example** dialog box is displayed. + + > **Tip:** + > + > Alternatively, you can also click the endpoint name to view its details and click **...** > **Code Example** in the upper-right corner. + +4. In the dialog box, select the environment and authentication method that you want to use to call the endpoint, and then copy the code example. + + > **Note:** + > + > - The code examples are generated based on the properties and parameters of the endpoint. + > - Currently, TiDB Cloud Data Service only provides the curl code example. + + - Environment: choose **Test Environment** or **Online Environment** depending on your need. **Online Environment** is available only after you deploy the endpoint. + - Authentication method: choose **Basic Authentication** or **Digest Authentication**. + - **Basic Authentication** transmits your API key as based64 encoded text. + - **Digest Authentication** transmits your API key in an encrypted form, which is more secure. + + Compared with **Basic Authentication**, the curl code of **Digest Authentication** includes an additional `--digest` option. + + Here is an example of a curl code snippet for a `POST` request that enables **Batch Operation** and uses **Digest Authentication**: + + +
+ + To call a draft version of the endpoint, you need to add the `endpoint-type: draft` header: + + ```bash + curl --digest --user ':' \ + --request POST 'https://.data.tidbcloud.com/api/v1beta/app//endpoint/' \ + --header 'content-type: application/json'\ + --header 'endpoint-type: draft' + --data-raw '{ + "items": [ + { + "age": "${age}", + "career": "${career}" + } + ] + }' + ``` + +
+ +
+ + You must deploy your endpoint first before checking the code example in the online environment. + + To call the current online version of the endpoint, use the following command: + + ```bash + curl --digest --user ':' \ + --request POST 'https://.data.tidbcloud.com/api/v1beta/app//endpoint/' \ + --header 'content-type: application/json'\ + --data-raw '{ + "items": [ + { + "age": "${age}", + "career": "${career}" + } + ] + }' + ``` + +
+
+ + > **Note:** + > + > - By requesting the regional domain `.data.tidbcloud.com`, you can directly access the endpoint in the region where the TiDB cluster is located. + > - Alternatively, you can also request the global domain `data.tidbcloud.com` without specifying a region. In this way, TiDB Cloud Data Service will internally redirect the request to the target region, but this might result in additional latency. If you choose this way, make sure to add the `--location-trusted` option to your curl command when calling an endpoint. + +5. Paste the code example in your application, edit the example according to your need, and then run it. + + - You need to replace the `` and `` placeholders with your API key. For more information, refer to [Manage an API key](/tidb-cloud/data-service-api-key.md). + - If the request method of your endpoint is `GET` and **Pagination** is enabled for the endpoint, you can paginate the results by updating the values of `page=` and `page_size=` with your desired values. For example, to get the second page with 10 items per page, use `page=2` and `page_size=10`. + - If the request method of your endpoint is `POST` or `PUT`, fill in the `--data-raw` option according to the rows of data that you want to operate on. + + - For endpoints with **Batch Operation** enabled, the `--data-raw` option accepts an object with an `items` field containing an array of data objects so you can operate on multiple rows of data using one endpoint. + - For endpoints with **Batch Operation** not enabled, the `--data-raw` option only accepts one data object. + + - If the endpoint contains parameters, specify the parameter values when calling the endpoint. + +### Response + +After calling an endpoint, you can see the response in JSON format. For more information, see [Response and Status Codes of Data Service](/tidb-cloud/data-service-response-and-status-code.md). + +## Undeploy an endpoint + +> **Note:** +> +> If you have [connected your Data App to GitHub](/tidb-cloud/data-service-manage-github-connection.md) with **Auto Sync & Deployment** enabled, undeploying an endpoint of this Data App will also delete the configuration of this endpoint on GitHub. + +To undeploy an endpoint, perform the following steps: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, click the name of your target Data App to view its endpoints. +3. Locate the endpoint you want to undeploy, click **...** > **Undeploy**. +4. Click **Undeploy** to confirm the undeployment. + +## Delete an endpoint + +> **Note:** +> +> Before you delete an endpoint, make sure that the endpoint is not online. Otherwise, the endpoint cannot be deleted. To undeploy an endpoint, refer to [Undeploy an endpoint](#undeploy-an-endpoint). + +To delete an endpoint, perform the following steps: + +1. Navigate to the [**Data Service**](https://tidbcloud.com/console/data-service) page of your project. +2. In the left pane, click the name of your target Data App to view its endpoints. +3. Click the name of the endpoint you want to delete, and then click **...** > **Delete** in the upper-right corner. +4. Click **Delete** to confirm the deletion. diff --git a/markdown-pages/en/tidb/master/tidb-cloud/tidb-cloud-release-notes.md b/markdown-pages/en/tidb/master/tidb-cloud/tidb-cloud-release-notes.md new file mode 100644 index 0000000..aba9c9f --- /dev/null +++ b/markdown-pages/en/tidb/master/tidb-cloud/tidb-cloud-release-notes.md @@ -0,0 +1,349 @@ +--- +title: TiDB Cloud Release Notes in 2024 +summary: Learn about the release notes of TiDB Cloud in 2024. +aliases: ['/tidbcloud/supported-tidb-versions','/tidbcloud/release-notes'] +--- + +# TiDB Cloud Release Notes in 2024 + +This page lists the release notes of [TiDB Cloud](https://www.pingcap.com/tidb-cloud/) in 2024. + +## September 10, 2024 + +**General changes** + +- Launch the TiDB Cloud Partner Web Console and Open API to enhance resource and billing management for TiDB Cloud partners. + + Managed Service Providers (MSPs) and resellers through AWS Marketplace Channel Partner Private Offer (CPPO) can now leverage the [TiDB Cloud Partner Web Console](https://partner-console.tidbcloud.com/) and Open API to streamline their daily operations. + + For more information, see [TiDB Cloud Partner Web Console](/tidb-cloud/tidb-cloud-partners.md). + +## September 3, 2024 + +**Console changes** + +- Support exporting data from TiDB Cloud Serverless clusters using the [TiDB Cloud console](https://tidbcloud.com/). + + Previously, TiDB Cloud only supported exporting data using the [TiDB Cloud CLI](/tidb-cloud/cli-reference.md). Now, you can easily export data from TiDB Cloud Serverless clusters to local files and Amazon S3 in the [TiDB Cloud console](https://tidbcloud.com/). + + For more information, see [Export Data from TiDB Cloud Serverless](/tidb-cloud/serverless-export.md) and [Configure External Storage Access for TiDB Cloud Serverless](/tidb-cloud/serverless-external-storage.md). + +- Enhance the connection experience for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters. + + - Revise the **Connect** dialog interface to provide TiDB Cloud Dedicated users with a more streamlined and efficient connection experience. + - Introduce a new cluster-level **Networking** page to simplify network configuration for your cluster. + - Replace the **Security Settings** page with a new **Password Settings** page and move IP access list settings to the new **Networking** page. + + For more information, see [Connect to TiDB Cloud Dedicated](/tidb-cloud/connect-to-tidb-cluster.md). + +- Enhance the data import experience for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) and [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters: + + - Refine the layout of the **Import** page with a clearer layout. + - Unify the import steps for TiDB Cloud Serverless and TiDB Cloud Dedicated clusters. + - Simplify the AWS Role ARN creation process for easier connection setup. + + For more information, see [Import data from files to TiDB Cloud](/tidb-cloud/tidb-cloud-migration-overview.md#import-data-from-files-to-tidb-cloud). + +## August 20, 2024 + +**Console changes** + +- Refine the layout of the **Create Private Endpoint Connection** page to improve the user experience for creating new private endpoint connections in [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters. + + For more information, see [Connect to a TiDB Cloud Dedicated Cluster via Private Endpoint with AWS](/tidb-cloud/set-up-private-endpoint-connections.md) and [Connect to a TiDB Cloud Dedicated Cluster via Google Cloud Private Service Connect](/tidb-cloud/set-up-private-endpoint-connections-on-google-cloud.md). + +## August 6, 2024 + +**General changes** + +- [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) billing changes for load balancing on AWS. + + Starting from August 1, 2024, TiDB Cloud Dedicated bills include new AWS charges for public IPv4 addresses, aligned with [AWS pricing changes effective from February 1, 2024](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/). The charge for each public IPv4 address is $0.005 per hour, which will result in approximately $10 per month for each TiDB Cloud Dedicated cluster hosted on AWS. + + This charge will appear under the existing **TiDB Cloud Dedicated - Data Transfer - Load Balancing** service in your [billing details](/tidb-cloud/tidb-cloud-billing.md#billing-details). + +- Upgrade the default TiDB version of new [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters from [v7.5.2](https://docs.pingcap.com/tidb/v7.5/release-7.5.2) to [v7.5.3](https://docs.pingcap.com/tidb/v7.5/release-7.5.3). + +**Console changes** + +- Enhance the cluster size configuration experience for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + + Refine the layout of the **Cluster Size** section on the [**Create Cluster**](/tidb-cloud/create-tidb-cluster.md) and [**Modify Cluster**](/tidb-cloud/scale-tidb-cluster.md) pages for TiDB Cloud Dedicated clusters. In addition, the **Cluster Size** section now includes links to node size recommendation documents, which helps you select an appropriate cluster size. + +## July 23, 2024 + +**General changes** + +- [Data Service (beta)](https://tidbcloud.com/console/data-service) supports automatically generating vector search endpoints. + + If your table contains [vector data types](/vector-search-data-types.md), you can automatically generate a vector search endpoint that calculates vector distances based on your selected distance function. + + This feature enables seamless integration with AI platforms such as [Dify](https://docs.dify.ai/guides/tools) and [GPTs](https://openai.com/blog/introducing-gpts), enhancing your applications with advanced natural language processing and AI capabilities for more complex tasks and intelligent solutions. + + For more information, see [Generate an endpoint automatically](/tidb-cloud/data-service-manage-endpoint.md#generate-an-endpoint-automatically) and [Integrate a Data App with Third-Party Tools](/tidb-cloud/data-service-integrations.md). + +- Introduce the budget feature to help you track actual TiDB Cloud costs against planned expenses, preventing unexpected costs. + + To access this feature, you must be in the `Organization Owner` or `Organization Billing Admin` role of your organization. + + For more information, see [Manage budgets for TiDB Cloud](/tidb-cloud/tidb-cloud-budget.md). + +## July 9, 2024 + +**General changes** + +- Enhance the [System Status](https://status.tidbcloud.com/) page to provide better insights into TiDB Cloud system health and performance. + + To access it, visit directly, or navigate via the [TiDB Cloud console](https://tidbcloud.com) by clicking **?** in the lower-right corner and selecting **System Status**. + +**Console changes** + +- Refine the **VPC Peering** page layout to improve the user experience for [creating VPC Peering connections](/tidb-cloud/set-up-vpc-peering-connections.md) in [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters. + +## July 2, 2024 + +**General changes** + +- [Data Service (beta)](https://tidbcloud.com/console/data-service) provides an endpoint library with predefined system endpoints that you can directly add to your Data App, reducing the effort in your endpoint development. + + Currently, the library only includes the `/system/query` endpoint, which enables you to execute any SQL statement by simply passing the statement in the predefined `sql` parameter. This endpoint facilitates the immediate execution of SQL queries, enhancing flexibility and efficiency. + + For more information, see [Add a predefined system endpoint](/tidb-cloud/data-service-manage-endpoint.md#add-a-predefined-system-endpoint). + +- Enhance slow query data storage. + + The slow query access on the [TiDB Cloud console](https://tidbcloud.com) is now more stable and does not affect database performance. + +## June 25, 2024 + +**General changes** + +- [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) supports vector search (beta). + + The vector search (beta) feature provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. This feature enables developers to easily build scalable applications with generative artificial intelligence (AI) capabilities using familiar MySQL skills. Key features include: + + - [Vector data types](/vector-search-data-types.md), [vector index](/vector-search-index.md), and [vector functions and operators](/vector-search-functions-and-operators.md). + - Ecosystem integrations with [LangChain](/vector-search-integrate-with-langchain.md), [LlamaIndex](/vector-search-integrate-with-llamaindex.md), and [JinaAI](/vector-search-integrate-with-jinaai-embedding.md). + - Programming language support for Python: [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md), [Peewee](/vector-search-integrate-with-peewee.md), and [Django ORM](/vector-search-integrate-with-django-orm.md). + - Sample applications and tutorials: perform semantic searches for documents using [Python](/vector-search-get-started-using-python.md) or [SQL](/vector-search-get-started-using-sql.md). + + For more information, see [Vector search (beta) overview](/vector-search-overview.md). + +- [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) now offers weekly email reports for organization owners. + + These reports provide insights into the performance and activity of your clusters. By receiving automatic weekly updates, you can stay informed about your clusters and make data-driven decisions to optimize your clusters. + +- Release Chat2Query API v3 endpoints and deprecate the Chat2Query API v1 endpoint `/v1/chat2data`. + + With Chat2Query API v3 endpoints, you can start multi-round Chat2Query by using sessions. + + For more information, see [Get started with Chat2Query API](/tidb-cloud/use-chat2query-api.md). + +**Console changes** + +- Rename Chat2Query (beta) to SQL Editor (beta). + + The interface previously known as Chat2Query is renamed to SQL Editor. This change clarifies the distinction between manual SQL editing and AI-assisted query generation, enhancing usability and your overall experience. + + - **SQL Editor**: the default interface for manually writing and executing SQL queries in the TiDB Cloud console. + - **Chat2Query**: the AI-assisted text-to-query feature, which enables you to interact with your databases using natural language to generate, rewrite, and optimize SQL queries. + + For more information, see [Explore your data with AI-assisted SQL Editor](/tidb-cloud/explore-data-with-chat2query.md). + +## June 18, 2024 + +**General changes** + +- Increase the maximum node storage of 16 vCPU TiFlash and 32 vCPU TiFlash for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters from 2048 GiB to 4096 GiB. + + This enhancement increases the analytics data storage capacity of your TiDB Cloud Dedicated cluster, improves workload scaling efficiency, and accommodates growing data requirements. + + For more information, see [TiFlash node storage](/tidb-cloud/size-your-cluster.md#tiflash-node-storage). + +- Upgrade the default TiDB version of new [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters from [v7.5.1](https://docs.pingcap.com/tidb/v7.5/release-7.5.1) to [v7.5.2](https://docs.pingcap.com/tidb/v7.5/release-7.5.2). + +## June 4, 2024 + +**General changes** + +- Introduce the Recovery Group feature (beta) for disaster recovery of [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters deployed on AWS. + + This feature enables you to replicate your databases between TiDB Cloud Dedicated clusters, ensuring rapid recovery in the event of a regional disaster. If you are in the `Project Owner` role, you can enable this feature by creating a new recovery group and assigning databases to the group. By replicating databases with recovery groups, you can improve disaster readiness, meet stricter availability SLAs, and achieve more aggressive Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO). + + For more information, see [Get started with recovery groups](/tidb-cloud/recovery-group-get-started.md). + +- Introduce billing and metering (beta) for the [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) columnar storage [TiFlash](/tiflash/tiflash-overview.md). + + Until June 30, 2024, columnar storage in TiDB Cloud Serverless clusters remains free with a 100% discount. After this date, each TiDB Cloud Serverless cluster will include a free quota of 5 GiB for columnar storage. Usage beyond the free quota will be charged. + + For more information, see [TiDB Cloud Serverless pricing details](https://www.pingcap.com/tidb-serverless-pricing-details/#storage). + +- [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) supports [Time to live (TTL)](/time-to-live.md). + +## May 28, 2024 + +**General changes** + +- Google Cloud `Taiwan (asia-east1)` region supports the [Data Migration](/tidb-cloud/migrate-from-mysql-using-data-migration.md) feature. + + The [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters hosted in the Google Cloud `Taiwan (asia-east1)` region now support the Data Migration (DM) feature. If your upstream data is stored in or near this region, you can now take advantage of faster and more reliable data migration from Google Cloud to TiDB Cloud. + +- Provide a new [TiDB node size](/tidb-cloud/size-your-cluster.md#tidb-vcpu-and-ram) for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters hosted on AWS and Google Cloud: `16 vCPU, 64 GiB` + +**API changes** + +- Introduce TiDB Cloud Data Service API for managing the following resources automatically and efficiently: + + * **Data App**: a collection of endpoints that you can use to access data for a specific application. + * **Data Source**: clusters linked to Data Apps for data manipulation and retrieval. + * **Endpoint**: a web API that you can customize to execute SQL statements. + * **Data API Key**: used for secure endpoint access. + * **OpenAPI Specification**: Data Service supports generating the OpenAPI Specification 3.0 for each Data App, which enables you to interact with your endpoints in a standardized format. + + These TiDB Cloud Data Service API endpoints are released in TiDB Cloud API v1beta1, which is the latest API version of TiDB Cloud. + + For more information, see [API documentation (v1beta1)](https://docs.pingcap.com/tidbcloud/api/v1beta1/dataservice). + +## May 21, 2024 + +**General changes** + +- Provide a new [TiDB node size](/tidb-cloud/size-your-cluster.md#tidb-vcpu-and-ram) for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters hosted on Google Cloud: `8 vCPU, 16 GiB` + +## May 14, 2024 + +**General changes** + +- Expand the selection of time zones in the [**Time Zone**](/tidb-cloud/manage-user-access.md#set-the-time-zone-for-your-organization) section to better accommodate customers from diverse regions. + +- Support [creating a VPC peering](/tidb-cloud/set-up-vpc-peering-connections.md) when your VPC is in a different region from the VPC of TiDB Cloud. + +- [Data Service (beta)](https://tidbcloud.com/console/data-service) supports path parameters alongside query parameters. + + This feature enhances resource identification with structured URLs and improves user experience, search engine optimization (SEO), and client integration, offering developers more flexibility and better alignment with industry standards. + + For more information, see [Basic properties](/tidb-cloud/data-service-manage-endpoint.md#basic-properties). + +## April 16, 2024 + +**CLI changes** + +- Introduce [TiDB Cloud CLI 1.0.0-beta.1](https://github.com/tidbcloud/tidbcloud-cli), built upon the new [TiDB Cloud API](/tidb-cloud/api-overview.md). The new CLI brings the following new features: + + - [Export data from TiDB Cloud Serverless clusters](/tidb-cloud/serverless-export.md) + - [Import data from local storage into TiDB Cloud Serverless clusters](/tidb-cloud/ticloud-import-start.md) + - [Authenticate via OAuth](/tidb-cloud/ticloud-auth-login.md) + - [Ask questions via TiDB Bot](/tidb-cloud/ticloud-ai.md) + + Before upgrading your TiDB Cloud CLI, note that this new CLI is incompatible with previous versions. For example, `ticloud cluster` in CLI commands is now updated to `ticloud serverless`. For more information, see [TiDB Cloud CLI reference](/tidb-cloud/cli-reference.md). + +## April 9, 2024 + +**General changes** + +- Provide a new [TiDB node size](/tidb-cloud/size-your-cluster.md#tidb-vcpu-and-ram) for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters hosted on AWS: `8 vCPU, 32 GiB`. + +## April 2, 2024 + +**General changes** + +- Introduce two service plans for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters: **Free** and **Scalable**. + + To meet different user requirements, TiDB Cloud Serverless offers the free and scalable service plans. Whether you are just getting started or scaling to meet the increasing application demands, these plans provide the flexibility and capabilities you need. + + For more information, see [Cluster plans](/tidb-cloud/select-cluster-tier.md#cluster-plans). + +- Modify the throttling behavior for TiDB Cloud Serverless clusters upon reaching their usage quota. Now, once a cluster reaches its usage quota, it immediately denies any new connection attempts, thereby ensuring uninterrupted service for existing operations. + + For more information, see [Usage quota](/tidb-cloud/serverless-limitations.md#usage-quota). + +## March 5, 2024 + +**General changes** + +- Upgrade the default TiDB version of new [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters from [v7.5.0](https://docs.pingcap.com/tidb/v7.5/release-7.5.0) to [v7.5.1](https://docs.pingcap.com/tidb/v7.5/release-7.5.1). + +**Console changes** + +- Introduce the **Cost Explorer** tab on the [**Billing**](https://tidbcloud.com/console/org-settings/billing/payments) page, which provides an intuitive interface for analyzing and customizing cost reports for your organization over time. + + To use this feature, navigate to the **Billing** page of your organization and click the **Cost Explorer** tab. + + For more information, see [Cost Explorer](/tidb-cloud/tidb-cloud-billing.md#cost-explorer). + +- [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) displays a **limit** label for [node-level resource metrics](/tidb-cloud/built-in-monitoring.md#server). + + The **limit** label shows the maximum usage of resources such as CPU, memory, and storage for each component in a cluster. This enhancement simplifies the process of monitoring the resource usage rate of your cluster. + + To access these metric limits, navigate to the **Monitoring** page of your cluster, and then check the **Server** category under the **Metrics** tab. + + For more information, see [Metrics for TiDB Cloud Dedicated clusters](/tidb-cloud/built-in-monitoring.md#server). + +## February 21, 2024 + +**General changes** + +- Upgrade the TiDB version of [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters from [v6.6.0](https://docs.pingcap.com/tidb/v6.6/release-6.6.0) to [v7.1.3](https://docs.pingcap.com/tidb/v7.1/release-7.1.3). + +## February 20, 2024 + +**General changes** + +- Support creating more TiDB Cloud nodes on Google Cloud. + + - By [configuring a regional CIDR size](/tidb-cloud/set-up-vpc-peering-connections.md#prerequisite-set-a-cidr-for-a-region) of `/19` for Google Cloud, you can now create up to 124 TiDB Cloud nodes within any region of a project. + - If you want to create more than 124 nodes in any region of a project, you can contact [TiDB Cloud Support](/tidb-cloud/tidb-cloud-support.md) for assistance in customizing an IP range size ranging from `/16` to `/18`. + +## January 23, 2024 + +**General changes** + +- Add 32 vCPU as a node size option for TiDB, TiKV, and TiFlash. + + For each `32 vCPU, 128 GiB` TiKV node, the node storage ranges from 200 GiB to 6144 GiB. + + It is recommended to use such nodes in the following scenarios: + + - High-workload production environments + - Extremely high performance + +## January 16, 2024 + +**General changes** + +- Enhance CIDR configuration for projects. + + - You can directly set a region-level CIDR for each project. + - You can choose your CIDR configurations from a broader range of CIDR values. + + Note: The previous global-level CIDR settings for projects are retired, but all existing regional CIDR in active state remain unaffected. There will be no impact on the network of existing clusters. + + For more information, see [Set a CIDR for a region](/tidb-cloud/set-up-vpc-peering-connections.md#prerequisite-set-a-cidr-for-a-region). + +- TiDB Cloud Serverless users now have the capability to disable public endpoints for your clusters. + + For more information, see [Disable a Public Endpoint](/tidb-cloud/connect-via-standard-connection-serverless.md#disable-a-public-endpoint). + +- [Data Service (beta)](https://tidbcloud.com/console/data-service) supports configuring a custom domain to access endpoints in a Data App. + + By default, TiDB Cloud Data Service provides a domain `.data.tidbcloud.com` to access each Data App's endpoints. For enhanced personalization and flexibility, you can now configure a custom domain for your Data App instead of using the default domain. This feature enables you to use branded URLs for your database services and enhances security. + + For more information, see [Custom domain in Data Service](/tidb-cloud/data-service-custom-domain.md). + +## January 3, 2024 + +**General changes** + +- Support [Organization SSO](https://tidbcloud.com/console/preferences/authentication) to streamline enterprise authentication processes. + + With this feature, you can seamlessly integrate TiDB Cloud with any identity provider (IdP) using [Security Assertion Markup Language (SAML)](https://en.wikipedia.org/wiki/Security_Assertion_Markup_Language) or [OpenID Connect (OIDC)](https://openid.net/developers/how-connect-works/). + + For more information, see [Organization SSO Authentication](/tidb-cloud/tidb-cloud-org-sso-authentication.md). + +- Upgrade the default TiDB version of new [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) clusters from [v7.1.1](https://docs.pingcap.com/tidb/v7.1/release-7.1.1) to [v7.5.0](https://docs.pingcap.com/tidb/v7.5/release-7.5.0). + +- The dual region backup feature for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated) is now in General Availability (GA). + + By using this feature, you can replicate backups across geographic regions within AWS or Google Cloud. This feature provides an additional layer of data protection and disaster recovery capabilities. + + For more information, see [Dual region backup](/tidb-cloud/backup-and-restore.md#turn-on-dual-region-backup). diff --git a/markdown-pages/en/tidb/master/tiflash-upgrade-guide.md b/markdown-pages/en/tidb/master/tiflash-upgrade-guide.md new file mode 100644 index 0000000..8492418 --- /dev/null +++ b/markdown-pages/en/tidb/master/tiflash-upgrade-guide.md @@ -0,0 +1,133 @@ +--- +title: TiFlash Upgrade Guide +summary: Learn the precautions when you upgrade TiFlash. +aliases: ['/tidb/dev/tiflash-620-upgrade-guide'] +--- + +# TiFlash Upgrade Guide + +This document describes the function changes and recommended actions that you need to learn when you upgrade TiFlash. + +To learn the standard upgrade process, see the following documents: + +- [Upgrade TiDB Using TiUP](/upgrade-tidb-using-tiup.md) +- [Upgrade TiDB on Kubernetes](https://docs.pingcap.com/tidb-in-kubernetes/stable/upgrade-a-tidb-cluster) + +> **Note:** +> +> - [FastScan](/tiflash/use-fastscan.md) is introduced in v6.2.0 as an experimental feature and becomes generally available (GA) in v7.0.0. It provides more efficient query performance at the cost of strong data consistency. +> +> - It is not recommended that you upgrade TiDB that includes TiFlash across major versions, for example, from v4.x to v6.x. Instead, you need to upgrade from v4.x to v5.x first, and then to v6.x. +> +> - v4.x is near the end of its life cycle. It is recommended that you upgrade to v5.x or later as soon as possible. For more information, see [TiDB Release Support Policy](https://www.pingcap.com/tidb-release-support-policy/). +> +> - PingCAP does not provide bug fixes for non-LTS versions, such as v6.0. It is recommended that you upgrade to v6.1 and later LTS versions whenever possible. +> + +## Upgrade TiFlash using TiUP + +To upgrade TiFlash from versions earlier than v5.3.0 to v5.3.0 or later, you must stop TiFlash and then upgrade it. When you upgrade TiFlash using TiUP, note the following: + +- If the TiUP cluster version is v1.12.0 or later, you cannot stop TiFlash and then upgrade it. If the target version requires a TiUP cluster version of v1.12.0 or later, it is recommended that you first use `tiup cluster:v1.11.3 ` to upgrade TiFlash to an intermediate version, perform an online upgrade of the TiDB cluster, upgrade the TiUP version, and then upgrade the TiDB cluster to the target version directly without stopping it. +- If the TiUP cluster version is earlier than v1.12.0, perform the following steps to upgrade TiFlash. + +The following steps help you use TiUP to upgrade TiFlash without interrupting other components: + +1. Stop the TiFlash instance: + + ```shell + tiup cluster stop -R tiflash + ``` + +2. Upgrade the TiDB cluster without restarting it (only updating the files): + + ```shell + tiup cluster upgrade --offline + ``` + + For example: + + ```shell + tiup cluster upgrade v5.3.0 --offline + ``` + +3. Reload the TiDB cluster. After the reload, the TiFlash instance is started and you do not need to manually start it. + + ```shell + tiup cluster reload + ``` + +## From 5.x or v6.0 to v6.1 + +When you upgrade TiFlash from v5.x or v6.0 to v6.1, pay attention to the functional changes in TiFlash Proxy and dynamic pruning. + +### TiFlash Proxy + +TiFlash Proxy is upgraded in v6.1.0 (aligned with TiKV v6.0.0). The new version has upgraded the RocksDB version. After you upgrade TiFlash to v6.1, the data format is converted to the new version automatically. + +In regular upgrades, the data conversion does not involve any risks. However, if you need to downgrade TiFlash from v6.1 to any earlier version in special scenarios (for example, testing or verification scenarios), the earlier version might fail to parse the new RocksDB configuration. As as result, TiFlash will fail to restart. It is recommended that you fully test and verify the upgrade process and prepare an emergency plan. + +**Workaround for downgrading TiFlash in testing or other special scenarios** + +You can forcibly scale in the target TiFlash node and then replicate data from TiKV again. For detailed steps, see [Scale in a TiFlash cluster](/scale-tidb-using-tiup.md#scale-in-a-tiflash-cluster). + +### Dynamic pruning + +If you do not enable [dynamic pruning mode](/partitioned-table.md#dynamic-pruning-mode) and will not use it in the future, you can skip this section. + +- Newly installed TiDB v6.1.0: Dynamic pruning is enabled by default. + +- TiDB v6.0 and earlier: Dynamic pruning is disabled by default. The setting of dynamic pruning after an upgrade inherits that of the previous version. That is, dynamic pruning will not be enabled (or disabled) automatically after an upgrade. + + After an upgrade, to enable dynamic pruning, set `tidb_partition_prune_mode` to `dynamic` and manually update GlobalStats of partitioned tables. For details, see [Dynamic pruning mode](/partitioned-table.md#dynamic-pruning-mode). + +## From v5.x or v6.0 to v6.2 + +In TiDB v6.2, TiFlash upgrades its data storage format to the V3 version. Therefore, when you upgrade TiFlash from v5.x or v6.0 to v6.2, besides functional changes in [TiFlash Proxy](#tiflash-proxy) and [Dynamic pruning](#dynamic-pruning), you also need to pay attention to the functional change in PageStorage. + +### PageStorage + +By default, TiFlash v6.2.0 uses PageStorage V3 version [`format_version = 4`](/tiflash/tiflash-configuration.md#configure-the-tiflashtoml-file). This new data format significantly reduces the peak write I/O traffic. In scenarios with high update traffic and high concurrency or heavy queries, it effectively relieves excessive CPU usage caused by TiFlash data GC. Meanwhile, compared with the earlier storage format, the V3 version significantly reduces space amplification and resource consumption. + +- After an upgrade to v6.2.0, as new data is written to the existing TiFlash nodes, earlier data will be gradually converted to the new format. +- However, earlier data cannot be completely converted to the new format during the upgrade, because the conversion consumes a certain amount of system overhead (services are not affected, but you still need to pay attention). After the upgrade, it is recommended that you run the [`Compact` command](/sql-statements/sql-statement-alter-table-compact.md) to convert the data to the new format. The steps are as follows: + + 1. Run the following command for each table containing TiFlash replicas: + + ```sql + ALTER TABLE COMPACT tiflash replica; + ``` + + 2. Restart the TiFlash node. + +You can check whether tables still use the old data format on Grafana: **TiFlash-Summary** > **Storage Pool** > **Storage Pool Run Mode**. + +- Only V2: Number of tables using PageStorage V2 (including partitions) +- Only V3: Number of tables using PageStorage V3 (including partitions) +- Mix Mode: Number of tables with data format converted from PageStorage V2 to PageStorage V3 (including partitions) + +**Workaround for downgrading TiFlash in testing or other special scenarios** + +You can forcibly scale in the target TiFlash node and then replicate data from TiKV again. For detailed steps, see [Scale in a TiFlash cluster](/scale-tidb-using-tiup.md#scale-in-a-tiflash-cluster). + +## From v6.1 to v6.2 + +When you upgrade TiFlash from v6.1 to v6.2, pay attention to the change in data storage format. For details, see [PageStorage](#pagestorage). + +## From v6.x or v7.x to v7.3 with `storage.format_version = 5` configured + +Starting from v7.3, TiFlash introduces a new DTFile version: DTFile V3 (experimental). This new DTFile version can merge multiple small files into a single larger file to reduce the total number of files. In v7.3, the default DTFile version is still V2. To use V3, you can set the [TiFlash configuration parameter](/tiflash/tiflash-configuration.md) `storage.format_version = 5`. After the setting, TiFlash can still read V2 DTFiles and will gradually rewrite existing V2 DTFiles to V3 DTFiles during subsequent data compaction. + +After upgrading TiFlash to v7.3 and configuring TiFlash to use V3 DTFiles, if you need to revert TiFlash to an earlier version, you can use the DTTool offline to rewrite V3 DTFiles back to V2 DTFiles. For more information, see [DTTool Migration Tool](/tiflash/tiflash-command-line-flags.md#dttool-migrate). + +## From v6.x or v7.x to v7.4 or a later version + +Starting from v7.4, to reduce the read and write amplification generated during data compaction, TiFlash optimizes the data compaction logic of PageStorage V3, which leads to changes to some of the underlying storage file names. Therefore, after TiFlash is upgraded to v7.4 or a later version, in-place downgrading to the original version is not supported. + +## From v7.x to v8.4 or a later version + +Starting from v8.4, the underlying storage format of TiFlash is updated to support [vector search](/vector-search-overview.md). Therefore, after TiFlash is upgraded to v8.4 or a later version, in-place downgrading to the original version is not supported. + +**Workaround for downgrading TiFlash in testing or other special scenarios** + +To downgrade TiFlash in testing or other special scenarios, you can forcibly scale in the target TiFlash node and then replicate data from TiKV again. For detailed steps, see [Scale in a TiFlash cluster](/scale-tidb-using-tiup.md#scale-in-a-tiflash-cluster). diff --git a/markdown-pages/en/tidb/master/tiflash/tiflash-configuration.md b/markdown-pages/en/tidb/master/tiflash/tiflash-configuration.md new file mode 100644 index 0000000..73f29a3 --- /dev/null +++ b/markdown-pages/en/tidb/master/tiflash/tiflash-configuration.md @@ -0,0 +1,371 @@ +--- +title: Configure TiFlash +summary: Learn how to configure TiFlash. +aliases: ['/docs/dev/tiflash/tiflash-configuration/','/docs/dev/reference/tiflash/configuration/'] +--- + +# Configure TiFlash + +This document introduces the configuration parameters related to the deployment and use of TiFlash. + +## PD scheduling parameters + +You can adjust the PD scheduling parameters using [pd-ctl](/pd-control.md). Note that you can use `tiup ctl:v pd` to replace `pd-ctl -u ` when using tiup to deploy and manage your cluster. + +- [`replica-schedule-limit`](/pd-configuration-file.md#replica-schedule-limit): determines the rate at which the replica-related operator is generated. The parameter affects operations such as making nodes offline and add replicas. + + > **Note:** + > + > The value of this parameter should be less than that of `region-schedule-limit`. Otherwise, the normal Region scheduling among TiKV nodes is affected. + +- `store-balance-rate`: limits the rate at which Regions of each TiKV/TiFlash store are scheduled. Note that this parameter takes effect only when the stores have newly joined the cluster. If you want to change the setting for existing stores, use the following command. + + > **Note:** + > + > Since v4.0.2, the `store-balance-rate` parameter has been deprecated and changes have been made to the `store limit` command. See [store-limit](/configure-store-limit.md) for details. + + - Execute the `pd-ctl -u store limit ` command to set the scheduling rate of a specified store. (To get `store_id`, you can execute the `pd-ctl -u store` command. + - If you do not set the scheduling rate for Regions of a specified store, this store inherits the setting of `store-balance-rate`. + - You can execute the `pd-ctl -u store limit` command to view the current setting value of `store-balance-rate`. + +- [`replication.location-labels`](/pd-configuration-file.md#location-labels): indicates the topological relationship of TiKV instances. The order of the keys indicates the layering relationship of different labels. If TiFlash is enabled, you need to use [`pd-ctl config placement-rules`](/pd-control.md#config-show--set-option-value--placement-rules) to set the default value. For details, see [geo-distributed-deployment-topology](/geo-distributed-deployment-topology.md). + +## TiFlash configuration parameters + +This section introduces the configuration parameters of TiFlash. + +> **Tip:** +> +> If you need to adjust the value of a configuration item, refer to [Modify the configuration](/maintain-tidb-using-tiup.md#modify-the-configuration). + +### Configure the `tiflash.toml` file + +```toml +## The listening host for supporting services such as TPC/HTTP. It is recommended to configure it as "0.0.0.0", which means to listen on all IP addresses of this machine. +listen_host = "0.0.0.0" +## The TiFlash TCP service port. This port is used for internal testing and is set to 9000 by default. Before TiFlash v7.1.0, this port is enabled by default with a security risk. To enhance security, it is recommended to apply access control on this port to only allow access from whitelisted IP addresses. Starting from TiFlash v7.1.0, you can avoid the security risk by commenting out the configuration of this port. When the TiFlash configuration file does not specify this port, it will be disabled. +## It is **NOT** recommended to configure this port in any TiFlash deployment. (Note: Starting from TiFlash v7.1.0, TiFlash deployed by TiUP >= v1.12.5 or TiDB Operator >= v1.5.0 disables the port by default and is more secure.) +# tcp_port = 9000 +## The cache size limit of the metadata of a data block. Generally, you do not need to change this value. +mark_cache_size = 1073741824 +## The cache size limit of the min-max index of a data block. Generally, you do not need to change this value. +minmax_index_cache_size = 1073741824 +## The cache size limit of the DeltaIndex. The default value is 0, which means no limit. +delta_index_cache_size = 0 + +## The storage path of TiFlash data. If there are multiple directories, separate each directory with a comma. +## path and path_realtime_mode are deprecated since v4.0.9. Use the configurations +## in the [storage] section to get better performance in the multi-disk deployment scenarios +## Since TiDB v5.2.0, if you need to use the storage.io_rate_limit configuration, you need to set the storage path of TiFlash data to storage.main.dir at the same time. +## When the [storage] configurations exist, both path and path_realtime_mode configurations are ignored. +# path = "/tidb-data/tiflash-9000" +## or +# path = "/ssd0/tidb-data/tiflash,/ssd1/tidb-data/tiflash,/ssd2/tidb-data/tiflash" +## The default value is false. If you set it to true and multiple directories +## are set in the path, the latest data is stored in the first directory and older +## data is stored in the rest directories. +# path_realtime_mode = false + +## The path in which the TiFlash temporary files are stored. By default it is the first directory in path +## or in storage.latest.dir appended with "/tmp". +# tmp_path = "/tidb-data/tiflash-9000/tmp" + +## Storage paths settings take effect starting from v4.0.9 +[storage] + + ## DTFile format + ## * format_version = 2, the default format for versions < v6.0.0. + ## * format_version = 3, the default format for v6.0.0 and v6.1.x, which provides more data validation features. + ## * format_version = 4, the default format for versions from v6.2.0 to v7.3.0, which reduces write amplification and background task resource consumption + ## * format_version = 5, introduced in v7.3.0, the default format for versions from v7.4.0 to v8.3.0, which reduces the number of physical files by merging smaller files. + # format_version = 5 + ## * format_version = 6, introduced in v8.4.0, which partially supports the building and storage of vector indexes. + ## * format_version = 7, introduced in v7.3.0, the default format for v8.4.0 and later versions, which supports the build and storage of vector indexes + # format_version = 7 + + [storage.main] + ## The list of directories to store the main data. More than 90% of the total data is stored in + ## the directory list. + dir = [ "/tidb-data/tiflash-9000" ] + ## or + # dir = [ "/ssd0/tidb-data/tiflash", "/ssd1/tidb-data/tiflash" ] + + ## The maximum storage capacity of each directory in storage.main.dir. + ## If it is not set, or is set to multiple 0, the actual disk (the disk where the directory is located) capacity is used. + ## Note that human-readable numbers such as "10GB" are not supported yet. + ## Numbers are specified in bytes. + ## The size of the capacity list should be the same with the dir size. + ## For example: + # capacity = [ 10737418240, 10737418240 ] + + [storage.latest] + ## The list of directories to store the latest data. About 10% of the total data is stored in + ## the directory list. The directories (or directory) listed here require higher IOPS + ## metrics than those in storage.main.dir. + ## If it is not set (by default), the values of storage.main.dir are used. + # dir = [ ] + ## The maximum storage capacity of each directory in storage.latest.dir. + ## If it is not set, or is set to multiple 0, the actual disk (the disk where the directory is located) capacity is used. + # capacity = [ 10737418240, 10737418240 ] + + ## [storage.io_rate_limit] settings are new in v5.2.0. + [storage.io_rate_limit] + ## This configuration item determines whether to limit the I/O traffic, which is disabled by default. This traffic limit in TiFlash is suitable for cloud storage that has the disk bandwidth of a small and specific size. + ## The total I/O bandwidth for disk reads and writes. The unit is bytes and the default value is 0, which means the I/O traffic is not limited by default. + # max_bytes_per_sec = 0 + ## max_read_bytes_per_sec and max_write_bytes_per_sec have similar meanings to max_bytes_per_sec. max_read_bytes_per_sec means the total I/O bandwidth for disk reads, and max_write_bytes_per_sec means the total I/O bandwidth for disk writes. + ## These configuration items limit I/O bandwidth for disk reads and writes separately. You can use them for cloud storage that calculates the limit of I/O bandwidth for disk reads and writes separately, such as the Persistent Disk provided by Google Cloud. + ## When the value of max_bytes_per_sec is not 0, max_bytes_per_sec is prioritized. + # max_read_bytes_per_sec = 0 + # max_write_bytes_per_sec = 0 + + ## The following parameters control the bandwidth weights assigned to different I/O traffic types. Generally, you do not need to adjust these parameters. + ## TiFlash internally divides I/O requests into four types: foreground writes, background writes, foreground reads, background reads. + ## When the I/O traffic limit is initialized, TiFlash assigns the bandwidth according to the following weight ratio. + ## The following default configurations indicate that each type of traffic gets a weight of 25% (25 / (25 + 25 + 25 + 25) = 25%). + ## If the weight is configured to 0, the corresponding I/O traffic is not limited. + # foreground_write_weight = 25 + # background_write_weight = 25 + # foreground_read_weight = 25 + # background_read_weight = 25 + ## TiFlash supports automatically tuning the traffic limit for different I/O types according to the current I/O load. Sometimes, the tuned bandwidth might exceed the weight ratio set above. + ## auto_tune_sec indicates the interval of automatic tuning. The unit is seconds. If the value of auto_tune_sec is 0, the automatic tuning is disabled. + # auto_tune_sec = 5 + + ## The following configuration items only take effect for the TiFlash disaggregated storage and compute architecture mode. For details, see documentation at https://docs.pingcap.com/tidb/dev/tiflash-disaggregated-and-s3. + # [storage.s3] + # endpoint: http://s3.{region}.amazonaws.com # S3 endpoint address + # bucket: mybucket # TiFlash stores all data in this bucket + # root: /cluster1_data # Root directory where data is stored in the S3 bucket + # access_key_id: {ACCESS_KEY_ID} # Access S3 with ACCESS_KEY_ID + # secret_access_key: {SECRET_ACCESS_KEY} # Access S3 with SECRET_ACCESS_KEY + # [storage.remote.cache] + # dir: /data1/tiflash/cache # Local data cache directory of the Compute Node + # capacity: 858993459200 # 800 GiB + +[flash] + ## The listening address of TiFlash coprocessor services. + service_addr = "0.0.0.0:3930" + + ## Introduced in v7.4.0. When the gap between the `applied_index` advanced by the current Raft state machine and the `applied_index` at the last disk spilling exceeds `compact_log_min_gap`, TiFlash executes the `CompactLog` command from TiKV and spills data to disk. Increasing this gap might reduce the disk spilling frequency of TiFlash, thus reducing read latency in random write scenarios, but it might also increase memory overhead. Decreasing this gap might increase the disk spilling frequency of TiFlash, thus alleviating memory pressure in TiFlash. However, at this stage, the disk spilling frequency of TiFlash will not be higher than that of TiKV, even if this gap is set to 0. + ## It is recommended to keep the default value. + # compact_log_min_gap = 200 + ## Introduced in v5.0. When the number or the size of rows in the Regions cached by TiFlash exceeds either of the following thresholds, TiFlash executes the `CompactLog` command from TiKV and spills data to disk. + ## It is recommended to keep the default value. + # compact_log_min_rows = 40960 # 40k + # compact_log_min_bytes = 33554432 # 32MB + + ## The following configuration item only takes effect for the TiFlash disaggregated storage and compute architecture mode. For details, see documentation at https://docs.pingcap.com/tidb/dev/tiflash-disaggregated-and-s3. + # disaggregated_mode = tiflash_write # The supported mode is `tiflash_write` or `tiflash_compute. + +[flash.proxy] + ## The listening address of proxy. If it is left empty, 127.0.0.1:20170 is used by default. + addr = "127.0.0.1:20170" + ## The external access address of addr. If it is left empty, "addr" is used by default. + ## Should guarantee that other nodes can access through `advertise-addr` when you deploy the cluster on multiple nodes. + advertise-addr = "" + ## The listening address from which the proxy pulls metrics or status information. If it is left empty, 127.0.0.1:20292 is used by default. + status-addr = "127.0.0.1:20292" + ## The external access address of status-addr. If it is left empty, the value of "status-addr" is used by default. + ## Should guarantee that other nodes can access through `advertise-status-addr` when you deploy the cluster on multiple nodes. + advertise-status-addr = "" + ## The external access address of the TiFlash coprocessor service. + engine-addr = "10.0.1.20:3930" + ## The data storage path of proxy. + data-dir = "/tidb-data/tiflash-9000/flash" + ## The configuration file path of proxy. + config = "/tidb-deploy/tiflash-9000/conf/tiflash-learner.toml" + ## The log path of proxy. + log-file = "/tidb-deploy/tiflash-9000/log/tiflash_tikv.log" + +[logger] + ## Note that the following parameters only take effect in tiflash.log and tiflash_error.log. If you need to configure log parameters of TiFlash Proxy, specify them in tiflash-learner.toml. + ## log level (available options: "trace", "debug", "info", "warn", "error"). The default value is "info". + level = "info" + ## The log of TiFlash. + log = "/tidb-deploy/tiflash-9000/log/tiflash.log" + ## The error log of TiFlash. The "warn" and "error" level logs are also output to this log file. + errorlog = "/tidb-deploy/tiflash-9000/log/tiflash_error.log" + ## Size of a single log file. The default value is "100M". + size = "100M" + ## Maximum number of log files to save. The default value is 10. For TiFlash logs and TiFlash error logs, the maximum number of log files to save is `count` respectively. + count = 10 + +[raft] + ## PD service address. Multiple addresses are separated with commas. + pd_addr = "10.0.1.11:2379,10.0.1.12:2379,10.0.1.13:2379" + +[status] + ## The port through which Prometheus pulls metrics information. The default value is 8234. + metrics_port = 8234 + +[profiles] + +[profiles.default] + ## The default value is false. This parameter determines whether the segment + ## of DeltaTree Storage Engine uses logical split. + ## Using the logical split can reduce the write amplification. + ## However, these are at the cost of disk space waste. + ## It is strongly recommended to keep the default value `false` and + ## not to change it to `true` in v6.2.0 and later versions. For details, + ## see known issue [#5576](https://github.com/pingcap/tiflash/issues/5576). + # dt_enable_logical_split = false + + ## `max_threads` indicates the internal thread concurrency when TiFlash executes an MPP task. + ## The default value is 0. When it is set to 0, + ## TiFlash uses the number of CPU cores as the execution concurrency. + ## This parameter only takes effect + ## when the system variable `tidb_max_tiflash_threads` is set to -1. + max_threads = 0 + + ## The memory usage limit for the generated intermediate data in a single query. + ## When the value is an integer, the unit is byte. For example, 34359738368 means 32 GiB of memory limit, and 0 means no limit. + ## When the value is a floating-point number in the range of [0.0, 1.0), it means the ratio of the allowed memory usage to the total memory of the node. For example, 0.8 means 80% of the total memory, and 0.0 means no limit. + ## The default value is 0, which means no limit. + ## When a query attempts to consume memory that exceeds this limit, the query is terminated and an error is reported. + max_memory_usage = 0 + + ## The memory usage limit for the generated intermediate data in all queries. + ## When the value is an integer, the unit is byte. For example, 34359738368 means 32 GiB of memory limit, and 0 means no limit. + ## When the value is a floating-point number in the range of [0.0, 1.0), it means the ratio of the allowed memory usage to the total memory of the node. For example, 0.8 means 80% of the total memory, and 0.0 means no limit. + ## The default value is 0.8, which means 80% of the total memory. + ## When the queries attempt to consume memory that exceeds this limit, the queries are terminated and an error is reported. + max_memory_usage_for_all_queries = 0.8 + + ## New in v5.0. This item specifies the maximum number of cop requests that TiFlash Coprocessor executes at the same time. If the number of requests exceeds the specified value, the exceeded requests will queue. If the configuration value is set to 0 or not set, the default value is used, which is twice the number of physical cores. + cop_pool_size = 0 + ## New in v5.0. This item specifies the maximum number of batch requests that TiFlash Coprocessor executes at the same time. If the number of requests exceeds the specified value, the exceeded requests will queue. If the configuration value is set to 0 or not set, the default value is used, which is twice the number of physical cores. + batch_cop_pool_size = 0 + ## New in v6.1.0. This item specifies the number of requests that TiFlash can concurrently process when it receives ALTER TABLE ... COMPACT from TiDB. + ## If the value is set to 0, the default value 1 prevails. + manual_compact_pool_size = 1 + ## New in v5.4.0. This item enables or disables the elastic thread pool feature, which significantly improves CPU utilization in high concurrency scenarios of TiFlash. The default value is true. + enable_elastic_threadpool = true + ## Compression algorithm of the TiFlash storage engine. The value can be LZ4, zstd, or LZ4HC, and is case-insensitive. By default, LZ4 is used. + dt_compression_method = "LZ4" + ## Compression level of the TiFlash storage engine. The default value is 1. + ## It is recommended that you set this value to 1 if dt_compression_method is LZ4. + ## It is recommended that you set this value to -1 (smaller compression rate, but better read performance) or 1 if dt_compression_method is zstd. + ## It is recommended that you set this value to 9 if dt_compression_method is LZ4HC. + dt_compression_level = 1 + + ## New in v6.2.0. This item specifies the minimum ratio of valid data in a PageStorage data file. When the ratio of valid data in a PageStorage data file is less than the value of this configuration, GC is triggered to compact data in the file. The default value is 0.5. + dt_page_gc_threshold = 0.5 + + ## New in v7.0.0. This item specifies the maximum memory available for the HashAggregation operator with group by key before a disk spill is triggered. When the memory usage exceeds the threshold, HashAggregation reduces memory usage by spilling to disk. This item defaults to 0, which means that the memory usage is unlimited and spill to disk is never used for HashAggregation. + max_bytes_before_external_group_by = 0 + + ## New in v7.0.0. This item specifies the maximum memory available for the sort or topN operator before a disk spill is triggered. When the memory usage exceeds the threshold, the sort or topN operator reduces memory usage by spilling to disk. This item defaults to 0, which means that the memory usage is unlimited and spill to disk is never used for sort or topN. + max_bytes_before_external_sort = 0 + + ## New in v7.0.0. This item specifies the maximum memory available for the HashJoin operator with EquiJoin before a disk spill is triggered. When the memory usage exceeds the threshold, HashJoin reduces memory usage by spilling to disk. This item defaults to 0, which means that the memory usage is unlimited and spill to disk is never used for HashJoin with EquiJoin. + max_bytes_before_external_join = 0 + + ## New in v7.4.0. This item controls whether to enable the TiFlash resource control feature. When it is set to true, TiFlash uses the pipeline execution model. + enable_resource_control = true + +## Security settings take effect starting from v4.0.5. +[security] + ## New in v5.0. This configuration item enables or disables log redaction. Value options: `true`, `false`, `"on"`, `"off"`, and `"marker"`. The `"on"`, `"off"`, and `"marker"` options are introduced in v8.2.0. + ## If the configuration item is set to `false` or `"off"`, log redaction is disabled. + ## If the configuration item is set to `true` or `"on"`, all user data in the log is replaced by `?`. + ## If the configuration item is set to `"marker"`, all user data in the log is wrapped in `‹ ›`. If user data contains `‹` or `›`, `‹` is escaped as `‹‹`, and `›` is escaped as `››`. Based on the marked logs, you can decide whether to desensitize the marked information when the logs are displayed. + ## The default value is `false`. + ## Note that you also need to set security.redact-info-log for tiflash-learner's logging in tiflash-learner.toml. + # redact_info_log = false + + ## Path of the file that contains a list of trusted SSL CAs. If set, the following settings + ## cert_path and key_path are also needed. + # ca_path = "/path/to/ca.pem" + ## Path of the file that contains X509 certificate in PEM format. + # cert_path = "/path/to/tiflash-server.pem" + ## Path of the file that contains X509 key in PEM format. + # key_path = "/path/to/tiflash-server-key.pem" +``` + +### Configure the `tiflash-learner.toml` file + +The parameters in `tiflash-learner.toml` are basically the same as those in TiKV. You can refer to [TiKV configuration](/tikv-configuration-file.md) for TiFlash Proxy configuration. The following are only commonly used parameters. Note that: + +- Compared with TiKV, TiFlash Proxy has an extra `raftstore.snap-handle-pool-size` parameter. +- The `label` whose key is `engine` is reserved and cannot be configured manually. + +```toml +[log] + ## The log level of TiFlash Proxy (available options: "trace", "debug", "info", "warn", "error"). The default value is "info". Introduced in v5.4.0. + level = "info" + +[log.file] + ## The maximum number of log files to save. Introduced in v5.4.0. + ## If this parameter is not set or set to the default value `0`, TiFlash Proxy saves all log files. + ## If this parameter is set to a non-zero value, TiFlash Proxy retains at most the number of old log files specified by `max-backups`. For example, if you set it to `7`, TiFlash Proxy retains at most 7 old log files. + max-backups = 0 + ## The maximum number of days that the log files are retained. Introduced in v5.4.0. + ## If this parameter is not set or set to the default value `0`, TiFlash Proxy retains all log files. + ## If this parameter is set to a non-zero value, TiFlash Proxy cleans up outdated log files after the number of days specified by `max-days`. + max-days = 0 + +[raftstore] + ## The allowable number of threads in the pool that flushes Raft data to storage. + apply-pool-size = 4 + + ## The allowable number of threads that process Raft, which is the size of the Raftstore thread pool. + store-pool-size = 4 + + ## The number of threads that handle snapshots. + ## The default value is 2. If you set it to 0, the multi-thread optimization is disabled. + ## A specific parameter of TiFlash Proxy, introduced in v4.0.0. + snap-handle-pool-size = 2 + +[security] + ## New in v5.0. This configuration item enables or disables log redaction. Value options: `true`, `false`, `"on"`, `"off"`, and `"marker"`. The `"on"`, `"off"`, and `"marker"` options are introduced in v8.3.0. + ## If the configuration item is set to `false` or `"off"`, log redaction is disabled. + ## If the configuration item is set to `true` or `"on"`, all user data in the log is replaced by `?`. + ## If the configuration item is set to `"marker"`, all user data in the log is wrapped in `‹ ›`. If user data contains `‹` or `›`, `‹` is escaped as `‹‹`, and `›` is escaped as `››`. Based on the marked logs, you can decide whether to desensitize the marked information when the logs are displayed. + ## The default value is `false`. + redact-info-log = false + +[security.encryption] + ## The encryption method for data files. + ## Value options: "aes128-ctr", "aes192-ctr", "aes256-ctr", "sm4-ctr" (supported since v6.4.0), and "plaintext". + ## Default value: `"plaintext"`, which means encryption is disabled by default. A value other than "plaintext" means that encryption is enabled, in which case the master key must be specified. + data-encryption-method = "aes128-ctr" + ## Specifies how often the data encryption key is rotated. Default value: `7d`. + data-key-rotation-period = "168h" # 7 days + +[security.encryption.master-key] + ## Specifies the master key if encryption is enabled. To learn how to configure a master key, see Configure encryption: https://docs.pingcap.com/tidb/dev/encryption-at-rest#configure-encryption . + +[security.encryption.previous-master-key] + ## Specifies the old master key when rotating the new master key. The configuration format is the same as that of `master-key`. To learn how to configure a master key, see Configure encryption: https://docs.pingcap.com/tidb/dev/encryption-at-rest#configure-encryption . +``` + +### Schedule replicas by topology labels + +See [Set available zones](/tiflash/create-tiflash-replicas.md#set-available-zones). + +### Multi-disk deployment + +TiFlash supports multi-disk deployment. If there are multiple disks in your TiFlash node, you can make full use of those disks by configuring the parameters described in the following sections. For TiFlash's configuration template to be used for TiUP, see [The complex template for the TiFlash topology](https://github.com/pingcap/docs/blob/master/config-templates/complex-tiflash.yaml). + +#### Multi-disk deployment with TiDB version earlier than v4.0.9 + +For TiDB clusters earlier than v4.0.9, TiFlash only supports storing the main data of the storage engine on multiple disks. You can set up a TiFlash node on multiple disks by specifying the `path` (`data_dir` in TiUP) and `path_realtime_mode` configuration. + +If there are multiple data storage directories in `path`, separate each with a comma. For example, `/nvme_ssd_a/data/tiflash,/sata_ssd_b/data/tiflash,/sata_ssd_c/data/tiflash`. If there are multiple disks in your environment, it is recommended that each directory corresponds to one disk and you put disks with the best performance at the front to maximize the performance of all disks. + +If there are multiple disks with similar I/O metrics on your TiFlash node, you can leave the `path_realtime_mode` parameter to the default value (or you can explicitly set it to `false`). It means that data will be evenly distributed among all storage directories. However, the latest data is written only to the first directory, so the corresponding disk is busier than other disks. + +If there are multiple disks with different I/O metrics on your TiFlash node, it is recommended to set `path_realtime_mode` to `true` and put disks with the best I/O metrics at the front of `path`. It means that the first directory only stores the latest data, and the older data are evenly distributed among the other directories. Note that in this case, the capacity of the first directory should be planned as 10% of the total capacity of all directories. + +#### Multi-disk deployment with TiDB v4.0.9 or later + +For TiDB clusters with v4.0.9 or later versions, TiFlash supports storing the main data and the latest data of the storage engine on multiple disks. If you want to deploy a TiFlash node on multiple disks, it is recommended to specify your storage directories in the `[storage]` section to make full use of your node. Note that the configurations earlier than v4.0.9 (`path` and `path_realtime_mode`) are still supported. + +If there are multiple disks with similar I/O metrics on your TiFlash node, it is recommended to specify corresponding directories in the `storage.main.dir` list and leave `storage.latest.dir` empty. TiFlash will distribute I/O pressure and data among all directories. + +If there are multiple disks with different I/O metrics on your TiFlash node, it is recommended to specify directories with higher metrics in the `storage.latest.dir` list, and specify directories with lower metrics in the `storage.main.dir` list. For example, for one NVMe-SSD and two SATA-SSDs, you can set `storage.latest.dir` to `["/nvme_ssd_a/data/tiflash"]` and `storage.main.dir` to `["/sata_ssd_b/data/tiflash", "/sata_ssd_c/data/tiflash"]`. TiFlash will distribute I/O pressure and data among these two directories list respectively. Note that in this case, the capacity of `storage.latest.dir` should be planned as 10% of the total planned capacity. + +> **Warning:** +> +> The `[storage]` configuration is supported in TiUP since v1.2.5. If your TiDB cluster version is v4.0.9 or later, make sure that your TiUP version is v1.2.5 or later. Otherwise, the data directories defined in `[storage]` will not be managed by TiUP. diff --git a/markdown-pages/en/tidb/master/vector-search-data-types.md b/markdown-pages/en/tidb/master/vector-search-data-types.md new file mode 100644 index 0000000..62031f5 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-data-types.md @@ -0,0 +1,254 @@ +--- +title: Vector Data Types +summary: Learn about the Vector data types in TiDB. +--- + +# Vector Data Types + +A vector is a sequence of floating-point numbers, such as `[0.3, 0.5, -0.1, ...]`. TiDB offers Vector data types, specifically optimized for efficiently storing and querying vector embeddings widely used in AI applications. + + + +> **Warning:** +> +> This feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> Vector data types are only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +The following Vector data types are currently available: + +- `VECTOR`: A sequence of single-precision floating-point numbers with any dimension. +- `VECTOR(D)`: A sequence of single-precision floating-point numbers with a fixed dimension `D`. + +Using vector data types provides the following advantages over using the [`JSON`](/data-type-json.md) type: + +- Vector index support: You can build a [vector search index](/vector-search-index.md) to speed up vector searching. +- Dimension enforcement: You can specify a dimension to forbid inserting vectors with different dimensions. +- Optimized storage format: Vector data types are optimized for handling vector data, offering better space efficiency and performance compared to `JSON` types. + +## Syntax + +You can use a string in the following syntax to represent a Vector value: + +```sql +'[, , ...]' +``` + +Example: + +```sql +CREATE TABLE vector_table ( + id INT PRIMARY KEY, + embedding VECTOR(3) +); + +INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); + +INSERT INTO vector_table VALUES (2, NULL); +``` + +Inserting vector values with invalid syntax will result in an error: + +```sql +[tidb]> INSERT INTO vector_table VALUES (3, '[5, ]'); +ERROR 1105 (HY000): Invalid vector text: [5, ] +``` + +In the following example, because dimension `3` is enforced for the `embedding` column when the table is created, inserting a vector with a different dimension will result in an error: + +```sql +[tidb]> INSERT INTO vector_table VALUES (4, '[0.3, 0.5]'); +ERROR 1105 (HY000): vector has 2 dimensions, does not fit VECTOR(3) +``` + +For available functions and operators over the vector data types, see [Vector Functions and Operators](/vector-search-functions-and-operators.md). + +For more information about building and using a vector search index, see [Vector Search Index](/vector-search-index.md). + +## Store vectors with different dimensions + +You can store vectors with different dimensions in the same column by omitting the dimension parameter in the `VECTOR` type: + +```sql +CREATE TABLE vector_table ( + id INT PRIMARY KEY, + embedding VECTOR +); + +INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); -- 3 dimensions vector, OK +INSERT INTO vector_table VALUES (2, '[0.3, 0.5]'); -- 2 dimensions vector, OK +``` + +However, note that you cannot build a [vector search index](/vector-search-index.md) for this column, as vector distances can be only calculated between vectors with the same dimensions. + +## Comparison + +You can compare vector data types using [comparison operators](/functions-and-operators/operators.md) such as `=`, `!=`, `<`, `>`, `<=`, and `>=`. For a complete list of comparison operators and functions for vector data types, see [Vector Functions and Operators](/vector-search-functions-and-operators.md). + +Vector data types are compared element-wise numerically. For example: + +- `[1] < [12]` +- `[1,2,3] < [1,2,5]` +- `[1,2,3] = [1,2,3]` +- `[2,2,3] > [1,2,3]` + +Two vectors with different dimensions are compared using lexicographical comparison, with the following rules: + +- Two vectors are compared element by element from the start, and each element is compared numerically. +- The first mismatching element determines which vector is lexicographically _less_ or _greater_ than the other. +- If one vector is a prefix of another, the shorter vector is lexicographically _less_ than the other. For example, `[1,2,3] < [1,2,3,0]`. +- Vectors of the same length with identical elements are lexicographically _equal_. +- An empty vector is lexicographically _less_ than any non-empty vector. For example, `[] < [1]`. +- Two empty vectors are lexicographically _equal_. + +When comparing vector constants, consider performing an [explicit cast](#cast) from string to vector to avoid comparisons based on string values: + +```sql +-- Because string is given, TiDB is comparing strings: +[tidb]> SELECT '[12.0]' < '[4.0]'; ++--------------------+ +| '[12.0]' < '[4.0]' | ++--------------------+ +| 1 | ++--------------------+ +1 row in set (0.01 sec) + +-- Cast to vector explicitly to compare by vectors: +[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); ++--------------------------------------------------+ +| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | ++--------------------------------------------------+ +| 0 | ++--------------------------------------------------+ +1 row in set (0.01 sec) +``` + +## Arithmetic + +Vector data types support arithmetic operations `+` (addition) and `-` (subtraction). However, arithmetic operations between vectors with different dimensions are not supported and will result in an error. + +Examples: + +```sql +[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]'); ++---------------------------------------------+ +| VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]') | ++---------------------------------------------+ +| [9] | ++---------------------------------------------+ +1 row in set (0.01 sec) + +[tidb]> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]'); ++-----------------------------------------------------+ +| VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]') | ++-----------------------------------------------------+ +| [1,1,1] | ++-----------------------------------------------------+ +1 row in set (0.01 sec) + +[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[1,2,3]'); +ERROR 1105 (HY000): vectors have different dimensions: 1 and 3 +``` + +## Cast + +### Cast between Vector ⇔ String + +To cast between Vector and String, use the following functions: + +- `CAST(... AS VECTOR)`: String ⇒ Vector +- `CAST(... AS CHAR)`: Vector ⇒ String +- `VEC_FROM_TEXT`: String ⇒ Vector +- `VEC_AS_TEXT`: Vector ⇒ String + +To improve usability, if you call a function that only supports vector data types, such as a vector correlation distance function, you can also just pass in a format-compliant string. TiDB automatically performs an implicit cast in this case. + +```sql +-- The VEC_DIMS function only accepts VECTOR arguments, so you can directly pass in a string for an implicit cast. +[tidb]> SELECT VEC_DIMS('[0.3, 0.5, -0.1]'); ++------------------------------+ +| VEC_DIMS('[0.3, 0.5, -0.1]') | ++------------------------------+ +| 3 | ++------------------------------+ +1 row in set (0.01 sec) + +-- You can also explicitly cast a string to a vector using VEC_FROM_TEXT and then pass the vector to the VEC_DIMS function. +[tidb]> SELECT VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')); ++---------------------------------------------+ +| VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')) | ++---------------------------------------------+ +| 3 | ++---------------------------------------------+ +1 row in set (0.01 sec) + +-- You can also cast explicitly using CAST(... AS VECTOR): +[tidb]> SELECT VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)); ++----------------------------------------------+ +| VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)) | ++----------------------------------------------+ +| 3 | ++----------------------------------------------+ +1 row in set (0.01 sec) +``` + +When using an operator or function that accepts multiple data types, you need to explicitly cast the string type to the vector type before passing the string to that operator or function, because TiDB does not perform implicit casts in this case. For example, before performing comparison operations, you need to explicitly cast strings to vectors; otherwise, TiDB compares them as string values rather than as vector numeric values: + +```sql +-- Because string is given, TiDB is comparing strings: +[tidb]> SELECT '[12.0]' < '[4.0]'; ++--------------------+ +| '[12.0]' < '[4.0]' | ++--------------------+ +| 1 | ++--------------------+ +1 row in set (0.01 sec) + +-- Cast to vector explicitly to compare by vectors: +[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); ++--------------------------------------------------+ +| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | ++--------------------------------------------------+ +| 0 | ++--------------------------------------------------+ +1 row in set (0.01 sec) +``` + +You can also explicitly cast a vector to its string representation. Take using the `VEC_AS_TEXT()` function as an example: + +```sql +-- The string is first implicitly cast to a vector, and then the vector is explicitly cast to a string, thus returning a string in the normalized format: +[tidb]> SELECT VEC_AS_TEXT('[0.3, 0.5, -0.1]'); ++--------------------------------------+ +| VEC_AS_TEXT('[0.3, 0.5, -0.1]') | ++--------------------------------------+ +| [0.3,0.5,-0.1] | ++--------------------------------------+ +1 row in set (0.01 sec) +``` + +For additional cast functions, see [Vector Functions and Operators](/vector-search-functions-and-operators.md). + +### Cast between Vector ⇔ other data types + +Currently, direct casting between Vector and other data types (such as `JSON`) is not supported. To work around this limitation, use String as an intermediate data type for casting in your SQL statement. + +Note that vector data type columns stored in a table cannot be converted to other data types using `ALTER TABLE ... MODIFY COLUMN ...`. + +## Restrictions + +For restrictions on vector data types, see [Vector search limitations](/vector-search-limitations.md) and [Vector index restrictions](/vector-search-index.md#restrictions). + +## MySQL compatibility + +Vector data types are TiDB specific, and are not supported in MySQL. + +## See also + +- [Vector Functions and Operators](/vector-search-functions-and-operators.md) +- [Vector Search Index](/vector-search-index.md) +- [Improve Vector Search Performance](/vector-search-improve-performance.md) \ No newline at end of file diff --git a/markdown-pages/en/tidb/master/vector-search-functions-and-operators.md b/markdown-pages/en/tidb/master/vector-search-functions-and-operators.md new file mode 100644 index 0000000..f6ed644 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-functions-and-operators.md @@ -0,0 +1,292 @@ +--- +title: Vector Functions and Operators +summary: Learn about functions and operators available for Vector data types. +--- + +# Vector Functions and Operators + +This document lists the functions and operators available for Vector data types. + + + +> **Warning:** +> +> This feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> Vector data types and these vector functions are only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Vector functions + +The following functions are designed specifically for [Vector data types](/vector-search-data-types.md). + +**Vector distance functions:** + +| Function Name | Description | +| --------------------------------------------------------- | ---------------------------------------------------------------- | +| [`VEC_L2_DISTANCE`](#vec_l2_distance) | Calculates L2 distance (Euclidean distance) between two vectors | +| [`VEC_COSINE_DISTANCE`](#vec_cosine_distance) | Calculates the cosine distance between two vectors | +| [`VEC_NEGATIVE_INNER_PRODUCT`](#vec_negative_inner_product) | Calculates the negative of the inner product between two vectors | +| [`VEC_L1_DISTANCE`](#vec_l1_distance) | Calculates L1 distance (Manhattan distance) between two vectors | + +**Other vector functions:** + +| Function Name | Description | +| ------------------------------- | --------------------------------------------------- | +| [`VEC_DIMS`](#vec_dims) | Returns the dimension of a vector | +| [`VEC_L2_NORM`](#vec_l2_norm) | Calculates the L2 norm (Euclidean norm) of a vector | +| [`VEC_FROM_TEXT`](#vec_from_text) | Converts a string into a vector | +| [`VEC_AS_TEXT`](#vec_as_text) | Converts a vector into a string | + +## Extended built-in functions and operators + +The following built-in functions and operators are extended to support operations on [Vector data types](/vector-search-data-types.md). + +**Arithmetic operators:** + +| Name | Description | +| :-------------------------------------------------------------------------------------- | :--------------------------------------- | +| [`+`](https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_plus) | Vector element-wise addition operator | +| [`-`](https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_minus) | Vector element-wise subtraction operator | + +For more information about how vector arithmetic works, see [Vector Data Type | Arithmetic](/vector-search-data-types.md#arithmetic). + +**Aggregate (GROUP BY) functions:** + +| Name | Description | +| :----------------------- | :----------------------------------------------- | +| [`COUNT()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_count) | Return a count of the number of rows returned | +| [`COUNT(DISTINCT)`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_count-distinct) | Return the count of a number of different values | +| [`MAX()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_max) | Return the maximum value | +| [`MIN()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_min) | Return the minimum value | + +**Comparison functions and operators:** + +| Name | Description | +| ---------------------------------------- | ----------------------------------------------------- | +| [`BETWEEN ... AND ...`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_between) | Check whether a value is within a range of values | +| [`COALESCE()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_coalesce) | Return the first non-NULL argument | +| [`=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal) | Equal operator | +| [`<=>`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal-to) | NULL-safe equal to operator | +| [`>`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_greater-than) | Greater than operator | +| [`>=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_greater-than-or-equal) | Greater than or equal operator | +| [`GREATEST()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_greatest) | Return the largest argument | +| [`IN()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_in) | Check whether a value is within a set of values | +| [`IS NULL`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_is-null) | Test whether a value is `NULL` | +| [`ISNULL()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_isnull) | Test whether the argument is `NULL` | +| [`LEAST()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_least) | Return the smallest argument | +| [`<`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_less-than) | Less than operator | +| [`<=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_less-than-or-equal) | Less than or equal operator | +| [`NOT BETWEEN ... AND ...`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-between) | Check whether a value is not within a range of values | +| [`!=`, `<>`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-equal) | Not equal operator | +| [`NOT IN()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-in) | Check whether a value is not within a set of values | + +For more information about how vectors are compared, see [Vector Data Type | Comparison](/vector-search-data-types.md#comparison). + +**Control flow functions:** + +| Name | Description | +| :------------------------------------------------------------------------------------------------ | :--------------------------- | +| [`CASE`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#operator_case) | Case operator | +| [`IF()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_if) | If/else construct | +| [`IFNULL()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_ifnull) | Null if/else construct | +| [`NULLIF()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_nullif) | Return `NULL` if expr1 = expr2 | + +**Cast functions:** + +| Name | Description | +| :------------------------------------------------------------------------------------------ | :----------------------------- | +| [`CAST()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_cast) | Cast a value as a string or vector | +| [`CONVERT()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_convert) | Cast a value as a string | + +For more information about how to use `CAST()`, see [Vector Data Type | Cast](/vector-search-data-types.md#cast). + +## Full references + +### VEC_L2_DISTANCE + +```sql +VEC_L2_DISTANCE(vector1, vector2) +``` + +Calculates the [L2 distance](https://en.wikipedia.org/wiki/Euclidean_distance) (Euclidean distance) between two vectors using the following formula: + +$DISTANCE(p,q)=\sqrt {\sum \limits _{i=1}^{n}{(p_{i}-q_{i})^{2}}}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_L2_DISTANCE('[0,3]', '[4,0]'); ++-----------------------------------+ +| VEC_L2_DISTANCE('[0,3]', '[4,0]') | ++-----------------------------------+ +| 5 | ++-----------------------------------+ +``` + +### VEC_COSINE_DISTANCE + +```sql +VEC_COSINE_DISTANCE(vector1, vector2) +``` + +Calculates the [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity) between two vectors using the following formula: + +$DISTANCE(p,q)=1.0 - {\frac {\sum \limits _{i=1}^{n}{p_{i}q_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{p_{i}^{2}}}}\cdot {\sqrt {\sum \limits _{i=1}^{n}{q_{i}^{2}}}}}}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_COSINE_DISTANCE('[1, 1]', '[-1, -1]'); ++-------------------------------------------+ +| VEC_COSINE_DISTANCE('[1, 1]', '[-1, -1]') | ++-------------------------------------------+ +| 2 | ++-------------------------------------------+ +``` + +### VEC_NEGATIVE_INNER_PRODUCT + +```sql +VEC_NEGATIVE_INNER_PRODUCT(vector1, vector2) +``` + +Calculates the distance by using the negative of the [inner product](https://en.wikipedia.org/wiki/Dot_product) between two vectors, using the following formula: + +$DISTANCE(p,q)=- INNER\_PROD(p,q)=-\sum \limits _{i=1}^{n}{p_{i}q_{i}}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_NEGATIVE_INNER_PRODUCT('[1,2]', '[3,4]'); ++----------------------------------------------+ +| VEC_NEGATIVE_INNER_PRODUCT('[1,2]', '[3,4]') | ++----------------------------------------------+ +| -11 | ++----------------------------------------------+ +``` + +### VEC_L1_DISTANCE + +```sql +VEC_L1_DISTANCE(vector1, vector2) +``` + +Calculates the [L1 distance](https://en.wikipedia.org/wiki/Taxicab_geometry) (Manhattan distance) between two vectors using the following formula: + +$DISTANCE(p,q)=\sum \limits _{i=1}^{n}{|p_{i}-q_{i}|}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_L1_DISTANCE('[0,0]', '[3,4]'); ++-----------------------------------+ +| VEC_L1_DISTANCE('[0,0]', '[3,4]') | ++-----------------------------------+ +| 7 | ++-----------------------------------+ +``` + +### VEC_DIMS + +```sql +VEC_DIMS(vector) +``` + +Returns the dimension of a vector. + +Examples: + +```sql +[tidb]> SELECT VEC_DIMS('[1,2,3]'); ++---------------------+ +| VEC_DIMS('[1,2,3]') | ++---------------------+ +| 3 | ++---------------------+ + +[tidb]> SELECT VEC_DIMS('[]'); ++----------------+ +| VEC_DIMS('[]') | ++----------------+ +| 0 | ++----------------+ +``` + +### VEC_L2_NORM + +```sql +VEC_L2_NORM(vector) +``` + +Calculates the [L2 norm](https://en.wikipedia.org/wiki/Norm_(mathematics)) (Euclidean norm) of a vector using the following formula: + +$NORM(p)=\sqrt {\sum \limits _{i=1}^{n}{p_{i}^{2}}}$ + +Example: + +```sql +[tidb]> SELECT VEC_L2_NORM('[3,4]'); ++----------------------+ +| VEC_L2_NORM('[3,4]') | ++----------------------+ +| 5 | ++----------------------+ +``` + +### VEC_FROM_TEXT + +```sql +VEC_FROM_TEXT(string) +``` + +Converts a string into a vector. + +Example: + +```sql +[tidb]> SELECT VEC_FROM_TEXT('[1,2]') + VEC_FROM_TEXT('[3,4]'); ++-------------------------------------------------+ +| VEC_FROM_TEXT('[1,2]') + VEC_FROM_TEXT('[3,4]') | ++-------------------------------------------------+ +| [4,6] | ++-------------------------------------------------+ +``` + +### VEC_AS_TEXT + +```sql +VEC_AS_TEXT(vector) +``` + +Converts a vector into a string. + +Example: + +```sql +[tidb]> SELECT VEC_AS_TEXT('[1.000, 2.5]'); ++-------------------------------+ +| VEC_AS_TEXT('[1.000, 2.5]') | ++-------------------------------+ +| [1,2.5] | ++-------------------------------+ +``` + +## MySQL compatibility + +The vector functions and the extended usage of built-in functions and operators over vector data types are TiDB specific, and are not supported in MySQL. + +## See also + +- [Vector Data Types](/vector-search-data-types.md) diff --git a/markdown-pages/en/tidb/master/vector-search-get-started-using-python.md b/markdown-pages/en/tidb/master/vector-search-get-started-using-python.md new file mode 100644 index 0000000..0ba00c8 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-get-started-using-python.md @@ -0,0 +1,251 @@ +--- +title: Get Started with TiDB + AI via Python +summary: Learn how to quickly develop an AI application that performs semantic search using Python and TiDB Vector Search. +--- + +# Get Started with TiDB + AI via Python + +This tutorial demonstrates how to develop a simple AI application that provides **semantic search** features. Unlike traditional keyword search, semantic search intelligently understands the meaning behind your query and returns the most relevant result. For example, if you have documents titled "dog", "fish", and "tree", and you search for "a swimming animal", the application would identify "fish" as the most relevant result. + +Throughout this tutorial, you will develop this AI application using [TiDB Vector Search](/vector-search-overview.md), Python, [TiDB Vector SDK for Python](https://github.com/pingcap/tidb-vector-python), and AI models. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Get started + +The following steps show how to develop the application from scratch. To run the demo directly, you can check out the sample code in the [pingcap/tidb-vector-python](https://github.com/pingcap/tidb-vector-python/blob/main/examples/python-client-quickstart) repository. + +### Step 1. Create a new Python project + +In your preferred directory, create a new Python project and a file named `example.py`: + +```shell +mkdir python-client-quickstart +cd python-client-quickstart +touch example.py +``` + +### Step 2. Install required dependencies + +In your project directory, run the following command to install the required packages: + +```shell +pip install sqlalchemy pymysql sentence-transformers tidb-vector python-dotenv +``` + +- `tidb-vector`: the Python client for interacting with TiDB vector search. +- [`sentence-transformers`](https://sbert.net): a Python library that provides pre-trained models for generating [vector embeddings](/vector-search-overview.md#vector-embedding) from text. + +### Step 3. Configure the connection string to the TiDB cluster + +Configure the cluster connection string depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection string into it. + + The following is an example for macOS: + + ```dotenv + TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + ``` + +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ +### Step 4. Initialize the embedding model + +An [embedding model](/vector-search-overview.md#embedding-model) transforms data into [vector embeddings](/vector-search-overview.md#vector-embedding). This example uses the pre-trained model [**msmarco-MiniLM-L12-cos-v5**](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) for text embedding. This lightweight model, provided by the `sentence-transformers` library, transforms text data into 384-dimensional vector embeddings. + +To set up the model, copy the following code into the `example.py` file. This code initializes a `SentenceTransformer` instance and defines a `text_to_embedding()` function for later use. + +```python +from sentence_transformers import SentenceTransformer + +print("Downloading and loading the embedding model...") +embed_model = SentenceTransformer("sentence-transformers/msmarco-MiniLM-L12-cos-v5", trust_remote_code=True) +embed_model_dims = embed_model.get_sentence_embedding_dimension() + +def text_to_embedding(text): + """Generates vector embeddings for the given text.""" + embedding = embed_model.encode(text) + return embedding.tolist() +``` + +### Step 5. Connect to the TiDB cluster + +Use the `TiDBVectorClient` class to connect to your TiDB cluster and create a table `embedded_documents` with a vector column. + +> **Note** +> +> Make sure the dimension of your vector column in the table matches the dimension of the vectors generated by your embedding model. For example, the **msmarco-MiniLM-L12-cos-v5** model generates vectors with 384 dimensions, so the dimension of your vector columns in `embedded_documents` should be 384 as well. + +```python +import os +from tidb_vector.integrations import TiDBVectorClient +from dotenv import load_dotenv + +# Load the connection string from the .env file +load_dotenv() + +vector_store = TiDBVectorClient( + # The 'embedded_documents' table will store the vector data. + table_name='embedded_documents', + # The connection string to the TiDB cluster. + connection_string=os.environ.get('TIDB_DATABASE_URL'), + # The dimension of the vector generated by the embedding model. + vector_dimension=embed_model_dims, + # Recreate the table if it already exists. + drop_existing_table=True, +) +``` + +### Step 6. Embed text data and store the vectors + +In this step, you will prepare sample documents containing single words, such as "dog", "fish", and "tree". The following code uses the `text_to_embedding()` function to transform these text documents into vector embeddings, and then inserts them into the vector store. + +```python +documents = [ + { + "id": "f8e7dee2-63b6-42f1-8b60-2d46710c1971", + "text": "dog", + "embedding": text_to_embedding("dog"), + "metadata": {"category": "animal"}, + }, + { + "id": "8dde1fbc-2522-4ca2-aedf-5dcb2966d1c6", + "text": "fish", + "embedding": text_to_embedding("fish"), + "metadata": {"category": "animal"}, + }, + { + "id": "e4991349-d00b-485c-a481-f61695f2b5ae", + "text": "tree", + "embedding": text_to_embedding("tree"), + "metadata": {"category": "plant"}, + }, +] + +vector_store.insert( + ids=[doc["id"] for doc in documents], + texts=[doc["text"] for doc in documents], + embeddings=[doc["embedding"] for doc in documents], + metadatas=[doc["metadata"] for doc in documents], +) +``` + +### Step 7. Perform semantic search + +In this step, you will search for "a swimming animal", which doesn't directly match any words in existing documents. + +The following code uses the `text_to_embedding()` function again to convert the query text into a vector embedding, and then queries with the embedding to find the top three closest matches. + +```python +def print_result(query, result): + print(f"Search result (\"{query}\"):") + for r in result: + print(f"- text: \"{r.document}\", distance: {r.distance}") + +query = "a swimming animal" +query_embedding = text_to_embedding(query) +search_result = vector_store.query(query_embedding, k=3) +print_result(query, search_result) +``` + +Run the `example.py` file and the output is as follows: + +```plain +Search result ("a swimming animal"): +- text: "fish", distance: 0.4562914811223072 +- text: "dog", distance: 0.6469335836410557 +- text: "tree", distance: 0.798545178640937 +``` + +The three terms in the search results are sorted by their respective distance from the queried vector: the smaller the distance, the more relevant the corresponding `document`. + +Therefore, according to the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) \ No newline at end of file diff --git a/markdown-pages/en/tidb/master/vector-search-get-started-using-sql.md b/markdown-pages/en/tidb/master/vector-search-get-started-using-sql.md new file mode 100644 index 0000000..ab69b85 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-get-started-using-sql.md @@ -0,0 +1,195 @@ +--- +title: Get Started with Vector Search via SQL +summary: Learn how to quickly get started with Vector Search in TiDB using SQL statements to power your generative AI applications. +--- + +# Get Started with Vector Search via SQL + +TiDB extends MySQL syntax to support [Vector Search](/vector-search-overview.md) and introduce new [Vector data types](/vector-search-data-types.md) and several [vector functions](/vector-search-functions-and-operators.md). + +This tutorial demonstrates how to get started with TiDB Vector Search just using SQL statements. You will learn how to use the [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) to complete the following operations: + +- Connect to your TiDB cluster. +- Create a vector table. +- Store vector embeddings. +- Perform vector search queries. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Prerequisites + +To complete this tutorial, you need: + +- [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) (MySQL CLI) installed on your machine. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Get started + +### Step 1. Connect to the TiDB cluster + +Connect to your TiDB cluster depending on the TiDB deployment option you've selected. + + +
+ +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. In the connection dialog, select **MySQL CLI** from the **Connect With** drop-down list and keep the default setting of the **Connection Type** as **Public**. + +4. If you have not set a password yet, click **Generate Password** to generate a random password. + +5. Copy the connection command and paste it into your terminal. The following is an example for macOS: + + ```bash + mysql -u '.root' -h '' -P 4000 -D 'test' --ssl-mode=VERIFY_IDENTITY --ssl-ca=/etc/ssl/cert.pem -p'' + ``` + +
+
+ +After your TiDB Self-Managed cluster is started, execute your cluster connection command in the terminal. + +The following is an example connection command for macOS: + +```bash +mysql --comments --host 127.0.0.1 --port 4000 -u root +``` + +
+ +
+ +### Step 2. Create a vector table + +When creating a table, you can define a column as a [vector](/vector-search-overview.md#vector-embedding) column by specifying the `VECTOR` data type. + +For example, to create a table `embedded_documents` with a three-dimensional `VECTOR` column, execute the following SQL statements using your MySQL CLI: + +```sql +USE test; +CREATE TABLE embedded_documents ( + id INT PRIMARY KEY, + -- Column to store the original content of the document. + document TEXT, + -- Column to store the vector representation of the document. + embedding VECTOR(3) +); +``` + +The expected output is as follows: + +```text +Query OK, 0 rows affected (0.27 sec) +``` + +### Step 3. Insert vector embeddings to the table + +Insert three documents with their [vector embeddings](/vector-search-overview.md#vector-embedding) into the `embedded_documents` table: + +```sql +INSERT INTO embedded_documents +VALUES + (1, 'dog', '[1,2,1]'), + (2, 'fish', '[1,2,4]'), + (3, 'tree', '[1,0,0]'); +``` + +The expected output is as follows: + +``` +Query OK, 3 rows affected (0.15 sec) +Records: 3 Duplicates: 0 Warnings: 0 +``` + +> **Note** +> +> This example simplifies the dimensions of the vector embeddings and uses only 3-dimensional vectors for demonstration purposes. +> +> In real-world applications, [embedding models](/vector-search-overview.md#embedding-model) often produce vector embeddings with hundreds or thousands of dimensions. + +### Step 4. Query the vector table + +To verify that the documents have been inserted correctly, query the `embedded_documents` table: + +```sql +SELECT * FROM embedded_documents; +``` + +The expected output is as follows: + +```sql ++----+----------+-----------+ +| id | document | embedding | ++----+----------+-----------+ +| 1 | dog | [1,2,1] | +| 2 | fish | [1,2,4] | +| 3 | tree | [1,0,0] | ++----+----------+-----------+ +3 rows in set (0.15 sec) +``` + +### Step 5. Perform a vector search query + +Similar to full-text search, users provide search terms to the application when using vector search. + +In this example, the search term is "a swimming animal", and its corresponding vector embedding is assumed to be `[1,2,3]`. In practical applications, you need to use an embedding model to convert the user's search term into a vector embedding. + +Execute the following SQL statement, and TiDB will identify the top three documents closest to `[1,2,3]` by calculating and sorting the cosine distances (`vec_cosine_distance`) between the vector embeddings in the table. + +```sql +SELECT id, document, vec_cosine_distance(embedding, '[1,2,3]') AS distance +FROM embedded_documents +ORDER BY distance +LIMIT 3; +``` + +The expected output is as follows: + +```plain ++----+----------+---------------------+ +| id | document | distance | ++----+----------+---------------------+ +| 2 | fish | 0.00853986601633272 | +| 1 | dog | 0.12712843905603044 | +| 3 | tree | 0.7327387580875756 | ++----+----------+---------------------+ +3 rows in set (0.15 sec) +``` + +The three terms in the search results are sorted by their respective distance from the queried vector: the smaller the distance, the more relevant the corresponding `document`. + +Therefore, according to the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/markdown-pages/en/tidb/master/vector-search-improve-performance.md b/markdown-pages/en/tidb/master/vector-search-improve-performance.md new file mode 100644 index 0000000..c354384 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-improve-performance.md @@ -0,0 +1,52 @@ +--- +title: Improve Vector Search Performance +summary: Learn best practices for improving the performance of TiDB Vector Search. +--- + +# Improve Vector Search Performance + +TiDB Vector Search enables you to perform Approximate Nearest Neighbor (ANN) queries that search for results similar to an image, document, or other input. To improve the query performance, review the following best practices. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Add vector search index for vector columns + +The [vector search index](/vector-search-index.md) dramatically improves the performance of vector search queries, usually by 10x or more, with a trade-off of only a small decrease of recall rate. + +## Ensure vector indexes are fully built + +> **Note** +> +> This practice is only applicable to [TiDB Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-serverless) clusters. + +Vector indexes are built asynchronously. Until all vector data is indexed, vector search performance is suboptimal. To check the index build progress, see [View index build progress](https://docs.pingcap.com/tidbcloud/vector-search-index#view-index-build-progress). + +## Reduce vector dimensions or shorten embeddings + +The computational complexity of vector search indexing and queries increases significantly as the dimension of vectors grows, requiring more floating-point comparisons. + +To optimize performance, consider reducing vector dimensions whenever feasible. This usually needs switching to another embedding model. When switching models, you need to evaluate the impact of the model change on the accuracy of vector queries. + +Certain embedding models like OpenAI `text-embedding-3-large` support [shortening embeddings](https://openai.com/index/new-embedding-models-and-api-updates/), which removes some numbers from the end of vector sequences without losing the embedding's concept-representing properties. You can also use such an embedding model to reduce the vector dimensions. + +## Exclude vector columns from the results + +Vector embedding data is usually large and only used during the search process. By excluding vector columns from query results, you can greatly reduce the data transferred between the TiDB server and your SQL client, thereby improving query performance. + +To exclude vector columns, explicitly list the columns you want to retrieve in the `SELECT` clause, instead of using `SELECT *` to retrieve all columns. + +## Warm up the index + +When accessing an index that has never been used or has not been accessed for a long time (cold access), TiDB needs to load the entire index from cloud storage or disk (instead of from memory). This process takes time and often results in higher query latency. Additionally, if there are no SQL queries for an extended period (for example, several hours), computing resources are reclaimed, causing subsequent access to become cold access. + +To avoid such query latency, warm up your index before actual workload by running similar vector search queries that hit the vector index. \ No newline at end of file diff --git a/markdown-pages/en/tidb/master/vector-search-index.md b/markdown-pages/en/tidb/master/vector-search-index.md new file mode 100644 index 0000000..6f298c3 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-index.md @@ -0,0 +1,303 @@ +--- +title: Vector Search Index +summary: Learn how to build and use the vector search index to accelerate K-Nearest neighbors (KNN) queries in TiDB. +--- + +# Vector Search Index + +K-nearest neighbors (KNN) search is the method for finding the K closest points to a given point in a vector space. The most straightforward approach to perform KNN search is a brute force search, which calculates the distance between the given vector and all other vectors in the space. This approach guarantees perfect accuracy, but it is usually too slow for real-world use. Therefore, approximate algorithms are commonly used in KNN search to enhance speed and efficiency. + +In TiDB, you can create and use vector search indexes for such approximate nearest neighbor (ANN) searches over columns with [vector data types](/vector-search-data-types.md). By using vector search indexes, vector search queries could be finished in milliseconds. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +Currently, TiDB supports the [HNSW (Hierarchical Navigable Small World)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) vector search index algorithm. + +## Restrictions + +- TiFlash nodes must be deployed in your cluster in advance. +- Vector search indexes cannot be used as primary keys or unique indexes. +- Vector search indexes can only be created on a single vector column and cannot be combined with other columns (such as integers or strings) to form composite indexes. +- A distance function must be specified when creating and using vector search indexes. Currently, only cosine distance `VEC_COSINE_DISTANCE()` and L2 distance `VEC_L2_DISTANCE()` functions are supported. +- For the same column, creating multiple vector search indexes using the same distance function is not supported. +- Directly dropping columns with vector search indexes is not supported. You can drop such a column by first dropping the vector search index on that column and then dropping the column itself. +- Modifying the type of a column with a vector index is not supported. +- Setting vector search indexes as [invisible](/sql-statements/sql-statement-alter-index.md) is not supported. +- Building vector search indexes on TiFlash nodes with [encryption at rest](https://docs.pingcap.com/tidb/stable/encryption-at-rest) enabled is not supported. + +## Create the HNSW vector index + +[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy (> 98% in typical cases). + +In TiDB, you can create an HNSW index for a column with a [vector data type](/vector-search-data-types.md) in either of the following ways: + +- When creating a table, use the following syntax to specify the vector column for the HNSW index: + + ```sql + CREATE TABLE foo ( + id INT PRIMARY KEY, + embedding VECTOR(5), + VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))) + ); + ``` + +- For an existing table that already contains a vector column, use the following syntax to create an HNSW index for the vector column: + + ```sql + CREATE VECTOR INDEX idx_embedding ON foo ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + ALTER TABLE foo ADD VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + + -- You can also explicitly specify "USING HNSW" to build the vector search index. + CREATE VECTOR INDEX idx_embedding ON foo ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + ALTER TABLE foo ADD VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + ``` + +> **Note:** +> +> The vector search index feature relies on TiFlash replicas for tables. +> +> - If a vector search index is defined when a table is created, TiDB automatically creates a TiFlash replica for the table. +> - If no vector search index is defined when a table is created, and the table currently does not have a TiFlash replica, you need to manually create a TiFlash replica before adding a vector search index to the table. For example: `ALTER TABLE 'table_name' SET TIFLASH REPLICA 1;`. + +When creating an HNSW vector index, you need to specify the distance function for the vector: + +- Cosine Distance: `((VEC_COSINE_DISTANCE(embedding)))` +- L2 Distance: `((VEC_L2_DISTANCE(embedding)))` + +The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimensions. + +For restrictions and limitations of vector search indexes, see [Restrictions](#restrictions). + +## Use the vector index + +The vector search index can be used in K-nearest neighbor search queries by using the `ORDER BY ... LIMIT` clause as follows: + +```sql +SELECT * +FROM foo +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3, 4, 5]') +LIMIT 10 +``` + +You must use the same distance metric as you have defined when creating the vector index if you want to utilize the index in vector search. + +To use an index in a vector search, make sure that the `ORDER BY ... LIMIT` clause uses the same distance function as the one specified when creating the vector index. + +## Use the vector index with filters + +Queries that contain a pre-filter (using the `WHERE` clause) cannot utilize the vector index because they are not querying for K-Nearest neighbors according to the SQL semantics. For example: + +```sql +-- For the following query, the `WHERE` filter is performed before KNN, so the vector index cannot be used: + +SELECT * FROM vec_table +WHERE category = "document" +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 5; +``` + +To use the vector index with filters, consider the following workarounds: + +**Post-filter after vector search:** Query for the K-Nearest neighbors first, then filter out unwanted results: + +```sql +-- For the following query, the `WHERE` filter is performed after KNN, so the vector index cannot be used: + +SELECT * FROM +( + SELECT * FROM vec_table + ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') + LIMIT 5 +) t +WHERE category = "document"; + +-- Note that this query might return fewer than 5 results if some are filtered out. +``` + +**Use table partitioning**: Queries within a table [partition](/partitioned-table.md) can fully utilize the vector index. This can be useful if you want to perform equality filters, as equality filters can be turned into accessing specified partitions. + +For example, suppose you want to find the closest documentation for a specific product version: + +```sql +-- For the following query, the `WHERE` filter is performed before KNN, so the vector index cannot be used: +SELECT * FROM docs +WHERE ver = "v2.0" +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 5; +``` + +Instead of writing a query using the `WHERE` clause, you can partition the table and then query within the partition using the [`PARTITION` keyword](/partitioned-table.md#partition-selection): + +```sql +CREATE TABLE docs ( + id INT, + ver VARCHAR(10), + doc TEXT, + embedding VECTOR(3), + VECTOR INDEX idx_embedding USING HNSW ((VEC_COSINE_DISTANCE(embedding))) +) PARTITION BY LIST COLUMNS (ver) ( + PARTITION p_v1_0 VALUES IN ('v1.0'), + PARTITION p_v1_1 VALUES IN ('v1.1'), + PARTITION p_v1_2 VALUES IN ('v1.2'), + PARTITION p_v2_0 VALUES IN ('v2.0') +); + +SELECT * FROM docs +PARTITION (p_v2_0) +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 5; +``` + +For more information, see [Table Partitioning](/partitioned-table.md). + +## View index build progress + +After you insert a large volume of data, some of it might not be instantly persisted to TiFlash. For vector data that has already been persisted, the vector search index is built synchronously. For data that has not yet been persisted, the index will be built once the data is persisted. This process does not affect the accuracy and consistency of the data. You can still perform vector searches at any time and get complete results. However, performance will be suboptimal until vector indexes are fully built. + +To view the index build progress, you can query the `INFORMATION_SCHEMA.TIFLASH_INDEXES` table as follows: + +```sql +SELECT * FROM INFORMATION_SCHEMA.TIFLASH_INDEXES; ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +| TIDB_DATABASE | TIDB_TABLE | TABLE_ID | COLUMN_NAME | INDEX_NAME | COLUMN_ID | INDEX_ID | INDEX_KIND | ROWS_STABLE_INDEXED | ROWS_STABLE_NOT_INDEXED | ROWS_DELTA_INDEXED | ROWS_DELTA_NOT_INDEXED | ERROR_MESSAGE | TIFLASH_INSTANCE | ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +| test | tcff1d827 | 219 | col1fff | 0a452311 | 7 | 1 | HNSW | 29646 | 0 | 0 | 0 | | 127.0.0.1:3930 | +| test | foo | 717 | embedding | idx_embedding | 2 | 1 | HNSW | 0 | 0 | 0 | 3 | | 127.0.0.1:3930 | ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +``` + +- You can check the `ROWS_STABLE_INDEXED` and `ROWS_STABLE_NOT_INDEXED` columns for the index build progress. When `ROWS_STABLE_NOT_INDEXED` becomes 0, the index build is complete. + + As a reference, indexing a 500 MiB vector dataset might take up to 20 minutes. The indexer can run in parallel for multiple tables. Currently, adjusting the indexer priority or speed is not supported. + +- You can check the `ROWS_DELTA_NOT_INDEXED` column for the number of rows in the Delta layer. Data in the storage layer of TiFlash is stored in two layers: Delta layer and Stable layer. The Delta layer stores recently inserted or updated rows and is periodically merged into the Stable layer according to the write workload. This merge process is called Compaction. + + The Delta layer is always not indexed. To achieve optimal performance, you can force the merge of the Delta layer into the Stable layer so that all data can be indexed: + + ```sql + ALTER TABLE COMPACT; + ``` + + For more information, see [`ALTER TABLE ... COMPACT`](/sql-statements/sql-statement-alter-table-compact.md). + +In addition, you can monitor the execution progress of the DDL job by executing `ADMIN SHOW DDL JOBS;` and checking the `row count`. However, this method is not fully accurate, because the `row count` value is obtained from the `rows_stable_indexed` field in `TIFLASH_INDEXES`. You can use this approach as a reference for tracking the progress of indexing. + +## Check whether the vector index is used + +Use the [`EXPLAIN`](/sql-statements/sql-statement-explain.md) or [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement to check whether a query is using the vector index. When `annIndex:` is presented in the `operator info` column for the `TableFullScan` executor, it means this table scan is utilizing the vector index. + +**Example: the vector index is used** + +```sql +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; ++-----+-------------------------------------------------------------------------------------+ +| ... | operator info | ++-----+-------------------------------------------------------------------------------------+ +| ... | ... | +| ... | Column#5, offset:0, count:10 | +| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#5 | +| ... | MppVersion: 1, data:ExchangeSender_16 | +| ... | ExchangeType: PassThrough | +| ... | ... | +| ... | Column#4, offset:0, count:10 | +| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#4 | +| ... | annIndex:COSINE(test.vector_table_with_index.embedding..[1,2,3], limit:10), ... | ++-----+-------------------------------------------------------------------------------------+ +9 rows in set (0.01 sec) +``` + +**Example: The vector index is not used because of not specifying a Top K** + +```sql +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index + -> ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]'); ++--------------------------------+-----+--------------------------------------------------+ +| id | ... | operator info | ++--------------------------------+-----+--------------------------------------------------+ +| Projection_15 | ... | ... | +| └─Sort_4 | ... | Column#4 | +| └─Projection_16 | ... | ..., vec_cosine_distance(..., [1,2,3])->Column#4 | +| └─TableReader_14 | ... | MppVersion: 1, data:ExchangeSender_13 | +| └─ExchangeSender_13 | ... | ExchangeType: PassThrough | +| └─TableFullScan_12 | ... | keep order:false, stats:pseudo | ++--------------------------------+-----+--------------------------------------------------+ +6 rows in set, 1 warning (0.01 sec) +``` + +When the vector index cannot be used, a warning occurs in some cases to help you learn the cause: + +```sql +-- Using a wrong distance function: +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_L2_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; + +[tidb]> SHOW WARNINGS; +ANN index not used: not ordering by COSINE distance + +-- Using a wrong order: +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') DESC +LIMIT 10; + +[tidb]> SHOW WARNINGS; +ANN index not used: index can be used only when ordering by vec_cosine_distance() in ASC order +``` + +## Analyze vector search performance + +To learn detailed information about how a vector index is used, you can execute the [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement and check the `execution info` column in the output: + +```sql +[tidb]> EXPLAIN ANALYZE SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; ++-----+--------------------------------------------------------+-----+ +| | execution info | | ++-----+--------------------------------------------------------+-----+ +| ... | time:339.1ms, loops:2, RU:0.000000, Concurrency:OFF | ... | +| ... | time:339ms, loops:2 | ... | +| ... | time:339ms, loops:3, Concurrency:OFF | ... | +| ... | time:339ms, loops:3, cop_task: {...} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{...}, vector_idx:{ | ... | +| | load:{total:68ms,from_s3:1,from_disk:0,from_cache:0},| | +| | search:{total:0ms,visited_nodes:2,discarded_nodes:0},| | +| | read:{vec_total:0ms,others_total:0ms}},...} | | ++-----+--------------------------------------------------------+-----+ +``` + +> **Note:** +> +> The execution information is internal. Fields and formats are subject to change without any notification. Do not rely on them. + +Explanation of some important fields: + +- `vector_index.load.total`: The total duration of loading index. This field could be larger than actual query time because multiple vector indexes may be loaded in parallel. +- `vector_index.load.from_s3`: Number of indexes loaded from S3. +- `vector_index.load.from_disk`: Number of indexes loaded from disk. The index was already downloaded from S3 previously. +- `vector_index.load.from_cache`: Number of indexes loaded from cache. The index was already downloaded from S3 previously. +- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field could be larger than actual query time because multiple vector indexes might be searched in parallel. +- `vector_index.search.discarded_nodes`: Number of vector rows visited but discarded during the search. These discarded vectors are not considered in the search result. Large values usually indicate that there are many stale rows caused by `UPDATE` or `DELETE` statements. + +See [`EXPLAIN`](/sql-statements/sql-statement-explain.md), [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md), and [EXPLAIN Walkthrough](/explain-walkthrough.md) for interpreting the output. + +## See also + +- [Improve Vector Search Performance](/vector-search-improve-performance.md) +- [Vector Data Types](/vector-search-data-types.md) diff --git a/markdown-pages/en/tidb/master/vector-search-integrate-with-django-orm.md b/markdown-pages/en/tidb/master/vector-search-integrate-with-django-orm.md new file mode 100644 index 0000000..2ea4c21 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-integrate-with-django-orm.md @@ -0,0 +1,300 @@ +--- +title: Integrate TiDB Vector Search with Django ORM +summary: Learn how to integrate TiDB Vector Search with Django ORM to store embeddings and perform semantic search. +--- + +# Integrate TiDB Vector Search with Django ORM + +This tutorial walks you through how to use [Django](https://www.djangoproject.com/) ORM to interact with the [TiDB Vector Search](/vector-search-overview.md), store embeddings, and perform vector search queries. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with Django ORM by following the steps below. + +### Step 1. Clone the repository + +Clone the `tidb-vector-python` repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/orm-django-quickstart +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +Alternatively, you can install the following packages for your project: + +```bash +pip install Django django-tidb mysqlclient numpy python-dotenv +``` + +If you encounter installation issues with mysqlclient, refer to the mysqlclient official documentation. + +#### What is `django-tidb` + +`django-tidb` is a TiDB dialect for Django, which enhances the Django ORM to support TiDB-specific features (for example, Vector Search) and resolves compatibility issues between TiDB and Django. + +To install `django-tidb`, choose a version that matches your Django version. For example, if you are using `django==4.2.*`, install `django-tidb==4.2.*`. The minor version does not need to be the same. It is recommended to use the latest minor version. + +For more information, refer to [django-tidb repository](https://github.com/pingcap/django-tidb). + +### Step 4. Configure the environment variables + +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public` + - **Branch** is set to `main` + - **Connect With** is set to `General` + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Copy the connection parameters from the connection dialog. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection parameters to the corresponding environment variables. + + - `TIDB_HOST`: The host of the TiDB cluster. + - `TIDB_PORT`: The port of the TiDB cluster. + - `TIDB_USERNAME`: The username to connect to the TiDB cluster. + - `TIDB_PASSWORD`: The password to connect to the TiDB cluster. + - `TIDB_DATABASE`: The database name to connect to. + - `TIDB_CA_PATH`: The path to the root certificate file. + + The following is an example for macOS: + + ```dotenv + TIDB_HOST=gateway01.****.prod.aws.tidbcloud.com + TIDB_PORT=4000 + TIDB_USERNAME=********.root + TIDB_PASSWORD=******** + TIDB_DATABASE=test + TIDB_CA_PATH=/etc/ssl/cert.pem + ``` + +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_HOST=127.0.0.1 +TIDB_PORT=4000 +TIDB_USERNAME=root +TIDB_PASSWORD= +TIDB_DATABASE=test +``` + +If you are running TiDB on your local machine, `TIDB_HOST` is `127.0.0.1` by default. The initial `TIDB_PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- `TIDB_HOST`: The host of the TiDB cluster. +- `TIDB_PORT`: The port of the TiDB cluster. +- `TIDB_USERNAME`: The username to connect to the TiDB cluster. +- `TIDB_PASSWORD`: The password to connect to the TiDB cluster. +- `TIDB_DATABASE`: The name of the database you want to connect to. + +
+ +
+ +### Step 5. Run the demo + +Migrate the database schema: + +```bash +python manage.py migrate +``` + +Run the Django development server: + +```bash +python manage.py runserver +``` + +Open your browser and visit `http://127.0.0.1:8000` to try the demo application. Here are the available API paths: + +| API Path | Description | +| --------------------------------------- | ---------------------------------------- | +| `POST: /insert_documents` | Insert documents with embeddings. | +| `GET: /get_nearest_neighbors_documents` | Get the 3-nearest neighbor documents. | +| `GET: /get_documents_within_distance` | Get documents within a certain distance. | + +## Sample code snippets + +You can refer to the following sample code snippets to complete your own application development. + +### Connect to the TiDB cluster + +In the file `sample_project/settings.py`, add the following configurations: + +```python +dotenv.load_dotenv() + +DATABASES = { + "default": { + # https://github.com/pingcap/django-tidb + "ENGINE": "django_tidb", + "HOST": os.environ.get("TIDB_HOST", "127.0.0.1"), + "PORT": int(os.environ.get("TIDB_PORT", 4000)), + "USER": os.environ.get("TIDB_USERNAME", "root"), + "PASSWORD": os.environ.get("TIDB_PASSWORD", ""), + "NAME": os.environ.get("TIDB_DATABASE", "test"), + "OPTIONS": { + "charset": "utf8mb4", + }, + } +} + +TIDB_CA_PATH = os.environ.get("TIDB_CA_PATH", "") +if TIDB_CA_PATH: + DATABASES["default"]["OPTIONS"]["ssl_mode"] = "VERIFY_IDENTITY" + DATABASES["default"]["OPTIONS"]["ssl"] = { + "ca": TIDB_CA_PATH, + } +``` + +You can create a `.env` file in the root directory of your project and set up the environment variables `TIDB_HOST`, `TIDB_PORT`, `TIDB_USERNAME`, `TIDB_PASSWORD`, `TIDB_DATABASE`, and `TIDB_CA_PATH` with the actual values of your TiDB cluster. + +### Create vector tables + +#### Define a vector column + +`tidb-django` provides a `VectorField` to store vector embeddings in a table. + +Create a table with a column named `embedding` that stores a 3-dimensional vector. + +```python +class Document(models.Model): + content = models.TextField() + embedding = VectorField(dimensions=3) +``` + +#### Define a vector column optimized with index + +> **Note** +> +> This section is only applicable to [TiDB Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-serverless) clusters. + +Define a 3-dimensional vector column and optimize it with a [vector search index](/vector-search-index.md) (HNSW index). + +```python +class DocumentWithIndex(models.Model): + content = models.TextField() + # Note: + # - Using comment to add hnsw index is a temporary solution. In the future it will use `CREATE INDEX` syntax. + # - Currently the HNSW index cannot be changed after the table has been created. + # - Only Django >= 4.2 supports `db_comment`. + embedding = VectorField(dimensions=3, db_comment="hnsw(distance=cosine)") +``` + +TiDB will use this index to speed up vector search queries based on the cosine distance function. + +### Store documents with embeddings + +```python +Document.objects.create(content="dog", embedding=[1, 2, 1]) +Document.objects.create(content="fish", embedding=[1, 2, 4]) +Document.objects.create(content="tree", embedding=[1, 0, 0]) +``` + +### Search the nearest neighbor documents + +TiDB Vector support the following distance functions: + +- `L1Distance` +- `L2Distance` +- `CosineDistance` +- `NegativeInnerProduct` + +Search for the top-3 documents that are semantically closest to the query vector `[1, 2, 3]` based on the cosine distance function. + +```python +results = Document.objects.annotate( + distance=CosineDistance('embedding', [1, 2, 3]) +).order_by('distance')[:3] +``` + +### Search documents within a certain distance + +Search for the documents whose cosine distance from the query vector `[1, 2, 3]` is less than 0.2. + +```python +results = Document.objects.annotate( + distance=CosineDistance('embedding', [1, 2, 3]) +).filter(distance__lt=0.2).order_by('distance')[:3] +``` + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/markdown-pages/en/tidb/master/vector-search-integrate-with-jinaai-embedding.md b/markdown-pages/en/tidb/master/vector-search-integrate-with-jinaai-embedding.md new file mode 100644 index 0000000..6f7dd56 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-integrate-with-jinaai-embedding.md @@ -0,0 +1,295 @@ +--- +title: Integrate TiDB Vector Search with Jina AI Embeddings API +summary: Learn how to integrate TiDB Vector Search with Jina AI Embeddings API to store embeddings and perform semantic search. +--- + +# Integrate TiDB Vector Search with Jina AI Embeddings API + +This tutorial walks you through how to use [Jina AI](https://jina.ai/) to generate embeddings for text data, and then store the embeddings in TiDB vector storage and search similar texts based on embeddings. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with JinaAI Embedding by following the steps below. + +### Step 1. Clone the repository + +Clone the `tidb-vector-python` repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/jina-ai-embeddings-demo +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +### Step 4. Configure the environment variables + +Get the Jina AI API key from the [Jina AI Embeddings API](https://jina.ai/embeddings/) page, and then configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public` + - **Branch** is set to `main` + - **Connect With** is set to `SQLAlchemy` + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Switch to the **PyMySQL** tab and click the **Copy** icon to copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Create password** to generate a random password. + +5. Set the Jina AI API key and the TiDB connection string as environment variables in your terminal, or create a `.env` file with the following environment variables: + + ```dotenv + JINAAI_API_KEY="****" + TIDB_DATABASE_URL="{tidb_connection_string}" + ``` + + The following is an example connection string for macOS: + + ```dotenv + TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + ``` + +
+
+ +For a TiDB Self-Managed cluster, set the environment variables for connecting to your TiDB cluster in your terminal as follows: + +```shell +export JINA_API_KEY="****" +export TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: export TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +You need to replace parameters in the preceding command according to your TiDB cluster. If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ +### Step 5. Run the demo + +```bash +python jina-ai-embeddings-demo.py +``` + +Example output: + +```text +- Inserting Data to TiDB... + - Inserting: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI. + - Inserting: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. +- List All Documents and Their Distances to the Query: + - distance: 0.3585317326132522 + content: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI. + - distance: 0.10858102967720984 + content: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. +- The Most Relevant Document and Its Distance to the Query: + - distance: 0.10858102967720984 + content: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. +``` + +## Sample code snippets + +### Get embeddings from Jina AI + +Define a `generate_embeddings` helper function to call Jina AI embeddings API: + +```python +import os +import requests +import dotenv + +dotenv.load_dotenv() + +JINAAI_API_KEY = os.getenv('JINAAI_API_KEY') + +def generate_embeddings(text: str): + JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings' + JINAAI_HEADERS = { + 'Content-Type': 'application/json', + 'Authorization': f'Bearer {JINAAI_API_KEY}' + } + JINAAI_REQUEST_DATA = { + 'input': [text], + 'model': 'jina-embeddings-v2-base-en' # with dimension 768. + } + response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA) + return response.json()['data'][0]['embedding'] +``` + +### Connect to the TiDB cluster + +Connect to the TiDB cluster through SQLAlchemy: + +```python +import os +import dotenv + +from tidb_vector.sqlalchemy import VectorType +from sqlalchemy.orm import Session, declarative_base + +dotenv.load_dotenv() + +TIDB_DATABASE_URL = os.getenv('TIDB_DATABASE_URL') +assert TIDB_DATABASE_URL is not None +engine = create_engine(url=TIDB_DATABASE_URL, pool_recycle=300) +``` + +### Define the vector table schema + +Create a table named `jinaai_tidb_demo_documents` with a `content` column for storing texts and a vector column named `content_vec` for storing embeddings: + +```python +from sqlalchemy import Column, Integer, String, create_engine +from sqlalchemy.orm import declarative_base + +Base = declarative_base() + +class Document(Base): + __tablename__ = "jinaai_tidb_demo_documents" + + id = Column(Integer, primary_key=True) + content = Column(String(255), nullable=False) + content_vec = Column( + # DIMENSIONS is determined by the embedding model, + # for Jina AI's jina-embeddings-v2-base-en model it's 768. + VectorType(dim=768), + comment="hnsw(distance=cosine)" +``` + +> **Note:** +> +> - The dimension of the vector column must match the dimension of the embeddings generated by the embedding model. +> - In this example, the dimension of embeddings generated by the `jina-embeddings-v2-base-en` model is `768`. + +### Create embeddings with Jina AI and store in TiDB + +Use the Jina AI Embeddings API to generate embeddings for each piece of text and store the embeddings in TiDB: + +```python +TEXTS = [ + 'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.', + 'TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.', +] +data = [] + +for text in TEXTS: + # Generate embeddings for the texts via Jina AI API. + embedding = generate_embeddings(text) + data.append({ + 'text': text, + 'embedding': embedding + }) + +with Session(engine) as session: + print('- Inserting Data to TiDB...') + for item in data: + print(f' - Inserting: {item["text"]}') + session.add(Document( + content=item['text'], + content_vec=item['embedding'] + )) + session.commit() +``` + +### Perform semantic search with Jina AI embeddings in TiDB + +Generate the embedding for the query text via Jina AI embeddings API, and then search for the most relevant document based on the cosine distance between **the embedding of the query text** and **each embedding in the vector table**: + +```python +query = 'What is TiDB?' +# Generate the embedding for the query via Jina AI API. +query_embedding = generate_embeddings(query) + +with Session(engine) as session: + print('- The Most Relevant Document and Its Distance to the Query:') + doc, distance = session.query( + Document, + Document.content_vec.cosine_distance(query_embedding).label('distance') + ).order_by( + 'distance' + ).limit(1).first() + print(f' - distance: {distance}\n' + f' content: {doc.content}') +``` + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/markdown-pages/en/tidb/master/vector-search-integrate-with-langchain.md b/markdown-pages/en/tidb/master/vector-search-integrate-with-langchain.md new file mode 100644 index 0000000..9bc656c --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-integrate-with-langchain.md @@ -0,0 +1,655 @@ +--- +title: Integrate Vector Search with LangChain +summary: Learn how to integrate Vector Search in TiDB Cloud with LangChain. +--- + +# Integrate Vector Search with LangChain + +This tutorial demonstrates how to integrate the [vector search](/vector-search-overview.md) feature in TiDB Cloud with [LangChain](https://python.langchain.com/). + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +> **Tip** +> +> You can view the complete [sample code](https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) online environment. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Jupyter Notebook](https://jupyter.org/install) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Get started + +This section provides step-by-step instructions for integrating TiDB Vector Search with LangChain to perform semantic searches. + +### Step 1. Create a new Jupyter Notebook file + +In your preferred directory, create a new Jupyter Notebook file named `integrate_with_langchain.ipynb`: + +```shell +touch integrate_with_langchain.ipynb +``` + +### Step 2. Install required dependencies + +In your project directory, run the following command to install the required packages: + +```shell +!pip install langchain langchain-community +!pip install langchain-openai +!pip install pymysql +!pip install tidb-vector +``` + +Open the `integrate_with_langchain.ipynb` file in Jupyter Notebook, and then add the following code to import the required packages: + +```python +from langchain_community.document_loaders import TextLoader +from langchain_community.vectorstores import TiDBVectorStore +from langchain_openai import OpenAIEmbeddings +from langchain_text_splitters import CharacterTextSplitter +``` + +### Step 3. Set up your environment + +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. Configure environment variables. + + This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + + To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + + ```python + # Use getpass to securely prompt for environment variables in your terminal. + import getpass + import os + + # Copy your connection string from the TiDB Cloud console. + # Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + tidb_connection_string = getpass.getpass("TiDB Connection String:") + os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") + ``` + +
+
+ +This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + +To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + +```python +# Use getpass to securely prompt for environment variables in your terminal. +import getpass +import os + +# Copy your connection string from the TiDB Cloud console. +# Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" +tidb_connection_string = getpass.getpass("TiDB Connection String:") +os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") +``` + +Taking macOS as an example, the cluster connection string is as follows: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +You need to modify the values of the connection parameters according to your TiDB cluster. If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ +### Step 4. Load the sample document + +#### Step 4.1 Download the sample document + +In your project directory, create a directory named `data/how_to/` and download the sample document [`state_of_the_union.txt`](https://github.com/langchain-ai/langchain/blob/master/docs/docs/how_to/state_of_the_union.txt) from the [langchain-ai/langchain](https://github.com/langchain-ai/langchain) GitHub repository. + +```shell +!mkdir -p 'data/how_to/' +!wget 'https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/how_to/state_of_the_union.txt' -O 'data/how_to/state_of_the_union.txt' +``` + +#### Step 4.2 Load and split the document + +Load the sample document from `data/how_to/state_of_the_union.txt` and split it into chunks of approximately 1,000 characters each using a `CharacterTextSplitter`. + +```python +loader = TextLoader("data/how_to/state_of_the_union.txt") +documents = loader.load() +text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) +docs = text_splitter.split_documents(documents) +``` + +### Step 5. Embed and store document vectors + +TiDB vector store supports both cosine distance (`consine`) and Euclidean distance (`l2`) for measuring similarity between vectors. The default strategy is cosine distance. + +The following code creates a table named `embedded_documents` in TiDB, which is optimized for vector search. + +```python +embeddings = OpenAIEmbeddings() +vector_store = TiDBVectorStore.from_documents( + documents=docs, + embedding=embeddings, + table_name="embedded_documents", + connection_string=tidb_connection_string, + distance_strategy="cosine", # default, another option is "l2" +) +``` + +Upon successful execution, you can directly view and access the `embedded_documents` table in your TiDB database. + +### Step 6. Perform a vector search + +This step demonstrates how to query "What did the president say about Ketanji Brown Jackson" from the document `state_of_the_union.txt`. + +```python +query = "What did the president say about Ketanji Brown Jackson" +``` + +#### Option 1: Use `similarity_search_with_score()` + +The `similarity_search_with_score()` method calculates the vector space distance between the documents and the query. This distance serves as a similarity score, determined by the chosen `distance_strategy`. The method returns the top `k` documents with the lowest scores. A lower score indicates a higher similarity between a document and your query. + +```python +docs_with_score = vector_store.similarity_search_with_score(query, k=3) +for doc, score in docs_with_score: + print("-" * 80) + print("Score: ", score) + print(doc.page_content) + print("-" * 80) +``` + +
+ Expected output + +```plain +-------------------------------------------------------------------------------- +Score: 0.18472413652518527 +Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. + +Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. + +One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. + +And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Score: 0.21757513022785557 +A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. + +And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. + +We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. + +We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. + +We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. + +We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders. +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Score: 0.22676987253721725 +And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. + +As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. + +While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. + +And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. + +So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together. + +First, beat the opioid epidemic. +-------------------------------------------------------------------------------- +``` + +
+ +#### Option 2: Use `similarity_search_with_relevance_scores()` + +The `similarity_search_with_relevance_scores()` method returns the top `k` documents with the highest relevance scores. A higher score indicates a higher degree of similarity between a document and your query. + +```python +docs_with_relevance_score = vector_store.similarity_search_with_relevance_scores(query, k=2) +for doc, score in docs_with_relevance_score: + print("-" * 80) + print("Score: ", score) + print(doc.page_content) + print("-" * 80) +``` + +
+ Expected output + +```plain +-------------------------------------------------------------------------------- +Score: 0.8152758634748147 +Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. + +Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. + +One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. + +And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Score: 0.7824248697721444 +A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. + +And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. + +We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. + +We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. + +We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. + +We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders. +-------------------------------------------------------------------------------- +``` + +
+ +### Use as a retriever + +In Langchain, a [retriever](https://python.langchain.com/v0.2/docs/concepts/#retrievers) is an interface that retrieves documents in response to an unstructured query, providing more functionality than a vector store. The following code demonstrates how to use TiDB vector store as a retriever. + +```python +retriever = vector_store.as_retriever( + search_type="similarity_score_threshold", + search_kwargs={"k": 3, "score_threshold": 0.8}, +) +docs_retrieved = retriever.invoke(query) +for doc in docs_retrieved: + print("-" * 80) + print(doc.page_content) + print("-" * 80) +``` + +The expected output is as follows: + +``` +-------------------------------------------------------------------------------- +Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. + +Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. + +One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. + +And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. +-------------------------------------------------------------------------------- +``` + +### Remove the vector store + +To remove an existing TiDB vector store, use the `drop_vectorstore()` method: + +```python +vector_store.drop_vectorstore() +``` + +## Search with metadata filters + +To refine your searches, you can use metadata filters to retrieve specific nearest-neighbor results that match the applied filters. + +### Supported metadata types + +Each document in the TiDB vector store can be paired with metadata, structured as key-value pairs within a JSON object. Keys are always strings, while values can be any of the following types: + +- String +- Number: integer or floating point +- Boolean: `true` or `false` + +For example, the following is a valid metadata payload: + +```json +{ + "page": 12, + "book_title": "Siddhartha" +} +``` + +### Metadata filter syntax + +Available filters include the following: + +- `$or`: Selects vectors that match any one of the specified conditions. +- `$and`: Selects vectors that match all the specified conditions. +- `$eq`: Equal to the specified value. +- `$ne`: Not equal to the specified value. +- `$gt`: Greater than the specified value. +- `$gte`: Greater than or equal to the specified value. +- `$lt`: Less than the specified value. +- `$lte`: Less than or equal to the specified value. +- `$in`: In the specified array of values. +- `$nin`: Not in the specified array of values. + +If the metadata of a document is as follows: + +```json +{ + "page": 12, + "book_title": "Siddhartha" +} +``` + +The following metadata filters can match this document: + +```json +{ "page": 12 } +``` + +```json +{ "page": { "$eq": 12 } } +``` + +```json +{ + "page": { + "$in": [11, 12, 13] + } +} +``` + +```json +{ "page": { "$nin": [13] } } +``` + +```json +{ "page": { "$lt": 11 } } +``` + +```json +{ + "$or": [{ "page": 11 }, { "page": 12 }], + "$and": [{ "page": 12 }, { "page": 13 }] +} +``` + +In a metadata filter, TiDB treats each key-value pair as a separate filter clause and combines these clauses using the `AND` logical operator. + +### Example + +The following example adds two documents to `TiDBVectorStore` and adds a `title` field to each document as the metadata: + +```python +vector_store.add_texts( + texts=[ + "TiDB Vector offers advanced, high-speed vector processing capabilities, enhancing AI workflows with efficient data handling and analytics support.", + "TiDB Vector, starting as low as $10 per month for basic usage", + ], + metadatas=[ + {"title": "TiDB Vector functionality"}, + {"title": "TiDB Vector Pricing"}, + ], +) +``` + +The expected output is as follows: + +```plain +[UUID('c782cb02-8eec-45be-a31f-fdb78914f0a7'), + UUID('08dcd2ba-9f16-4f29-a9b7-18141f8edae3')] +``` + +Perform a similarity search with metadata filters: + +```python +docs_with_score = vector_store.similarity_search_with_score( + "Introduction to TiDB Vector", filter={"title": "TiDB Vector functionality"}, k=4 +) +for doc, score in docs_with_score: + print("-" * 80) + print("Score: ", score) + print(doc.page_content) + print("-" * 80) +``` + +The expected output is as follows: + +```plain +-------------------------------------------------------------------------------- +Score: 0.12761409169211535 +TiDB Vector offers advanced, high-speed vector processing capabilities, enhancing AI workflows with efficient data handling and analytics support. +-------------------------------------------------------------------------------- +``` + +## Advanced usage example: travel agent + +This section demonstrates a use case of integrating vector search with Langchain for a travel agent. The goal is to create personalized travel reports for clients, helping them find airports with specific amenities, such as clean lounges and vegetarian options. + +The process involves two main steps: + +1. Perform a semantic search across airport reviews to identify airport codes that match the desired amenities. +2. Execute a SQL query to merge these codes with route information, highlighting airlines and destinations that align with user's preferences. + +### Prepare data + +First, create a table to store airport route data: + +```python +# Create a table to store flight plan data. +vector_store.tidb_vector_client.execute( + """CREATE TABLE airplan_routes ( + id INT AUTO_INCREMENT PRIMARY KEY, + airport_code VARCHAR(10), + airline_code VARCHAR(10), + destination_code VARCHAR(10), + route_details TEXT, + duration TIME, + frequency INT, + airplane_type VARCHAR(50), + price DECIMAL(10, 2), + layover TEXT + );""" +) + +# Insert some sample data into airplan_routes and the vector table. +vector_store.tidb_vector_client.execute( + """INSERT INTO airplan_routes ( + airport_code, + airline_code, + destination_code, + route_details, + duration, + frequency, + airplane_type, + price, + layover + ) VALUES + ('JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', '06:00:00', 5, 'Boeing 777', 299.99, 'None'), + ('LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', '04:00:00', 3, 'Airbus A320', 149.99, 'None'), + ('EFGH', 'UA', 'SEA', 'Daily flights from SFO to SEA.', '02:30:00', 7, 'Boeing 737', 129.99, 'None'); + """ +) +vector_store.add_texts( + texts=[ + "Clean lounges and excellent vegetarian dining options. Highly recommended.", + "Comfortable seating in lounge areas and diverse food selections, including vegetarian.", + "Small airport with basic facilities.", + ], + metadatas=[ + {"airport_code": "JFK"}, + {"airport_code": "LAX"}, + {"airport_code": "EFGH"}, + ], +) +``` + +The expected output is as follows: + +```plain +[UUID('6dab390f-acd9-4c7d-b252-616606fbc89b'), + UUID('9e811801-0e6b-4893-8886-60f4fb67ce69'), + UUID('f426747c-0f7b-4c62-97ed-3eeb7c8dd76e')] +``` + +### Perform a semantic search + +The following code searches for airports with clean facilities and vegetarian options: + +```python +retriever = vector_store.as_retriever( + search_type="similarity_score_threshold", + search_kwargs={"k": 3, "score_threshold": 0.85}, +) +semantic_query = "Could you recommend a US airport with clean lounges and good vegetarian dining options?" +reviews = retriever.invoke(semantic_query) +for r in reviews: + print("-" * 80) + print(r.page_content) + print(r.metadata) + print("-" * 80) +``` + +The expected output is as follows: + +```plain +-------------------------------------------------------------------------------- +Clean lounges and excellent vegetarian dining options. Highly recommended. +{'airport_code': 'JFK'} +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Comfortable seating in lounge areas and diverse food selections, including vegetarian. +{'airport_code': 'LAX'} +-------------------------------------------------------------------------------- +``` + +### Retrieve detailed airport information + +Extract airport codes from the search results and query the database for detailed route information: + +```python +# Extracting airport codes from the metadata +airport_codes = [review.metadata["airport_code"] for review in reviews] + +# Executing a query to get the airport details +search_query = "SELECT * FROM airplan_routes WHERE airport_code IN :codes" +params = {"codes": tuple(airport_codes)} + +airport_details = vector_store.tidb_vector_client.execute(search_query, params) +airport_details.get("result") +``` + +The expected output is as follows: + +```plain +[(1, 'JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', datetime.timedelta(seconds=21600), 5, 'Boeing 777', Decimal('299.99'), 'None'), + (2, 'LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', datetime.timedelta(seconds=14400), 3, 'Airbus A320', Decimal('149.99'), 'None')] +``` + +### Streamline the process + +Alternatively, you can streamline the entire process using a single SQL query: + +```python +search_query = f""" + SELECT + VEC_Cosine_Distance(se.embedding, :query_vector) as distance, + ar.*, + se.document as airport_review + FROM + airplan_routes ar + JOIN + {TABLE_NAME} se ON ar.airport_code = JSON_UNQUOTE(JSON_EXTRACT(se.meta, '$.airport_code')) + ORDER BY distance ASC + LIMIT 5; +""" +query_vector = embeddings.embed_query(semantic_query) +params = {"query_vector": str(query_vector)} +airport_details = vector_store.tidb_vector_client.execute(search_query, params) +airport_details.get("result") +``` + +The expected output is as follows: + +```plain +[(0.1219207353407008, 1, 'JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', datetime.timedelta(seconds=21600), 5, 'Boeing 777', Decimal('299.99'), 'None', 'Clean lounges and excellent vegetarian dining options. Highly recommended.'), + (0.14613754359804654, 2, 'LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', datetime.timedelta(seconds=14400), 3, 'Airbus A320', Decimal('149.99'), 'None', 'Comfortable seating in lounge areas and diverse food selections, including vegetarian.'), + (0.19840519342700513, 3, 'EFGH', 'UA', 'SEA', 'Daily flights from SFO to SEA.', datetime.timedelta(seconds=9000), 7, 'Boeing 737', Decimal('129.99'), 'None', 'Small airport with basic facilities.')] +``` + +### Clean up data + +Finally, clean up the resources by dropping the created table: + +```python +vector_store.tidb_vector_client.execute("DROP TABLE airplan_routes") +``` + +The expected output is as follows: + +```plain +{'success': True, 'result': 0, 'error': None} +``` + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/markdown-pages/en/tidb/master/vector-search-integrate-with-llamaindex.md b/markdown-pages/en/tidb/master/vector-search-integrate-with-llamaindex.md new file mode 100644 index 0000000..9dfb78a --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-integrate-with-llamaindex.md @@ -0,0 +1,328 @@ +--- +title: Integrate Vector Search with LlamaIndex +summary: Learn how to integrate TiDB Vector Search with LlamaIndex. +--- + +# Integrate Vector Search with LlamaIndex + +This tutorial demonstrates how to integrate the [vector search](/vector-search-overview.md) feature of TiDB with [LlamaIndex](https://www.llamaindex.ai). + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +> **Tip** +> +> You can view the complete [sample code](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) online environment. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Jupyter Notebook](https://jupyter.org/install) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Get started + +This section provides step-by-step instructions for integrating TiDB Vector Search with LlamaIndex to perform semantic searches. + +### Step 1. Create a new Jupyter Notebook file + +In the root directory, create a new Jupyter Notebook file named `integrate_with_llamaindex.ipynb`: + +```shell +touch integrate_with_llamaindex.ipynb +``` + +### Step 2. Install required dependencies + +In your project directory, run the following command to install the required packages: + +```shell +pip install llama-index-vector-stores-tidbvector +pip install llama-index +``` + +Open the `integrate_with_llamaindex.ipynb` file in Jupyter Notebook and add the following code to import the required packages: + +```python +import textwrap + +from llama_index.core import SimpleDirectoryReader, StorageContext +from llama_index.core import VectorStoreIndex +from llama_index.vector_stores.tidbvector import TiDBVectorStore +``` + +### Step 3. Configure environment variables + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. Configure environment variables. + + This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + + To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + + ```python + # Use getpass to securely prompt for environment variables in your terminal. + import getpass + import os + + # Copy your connection string from the TiDB Cloud console. + # Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + tidb_connection_string = getpass.getpass("TiDB Connection String:") + os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") + ``` + + +
+ +This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string of your TiDB cluster and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + +To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + +```python +# Use getpass to securely prompt for environment variables in your terminal. +import getpass +import os + +# Copy your connection string from the TiDB Cloud console. +# Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" +tidb_connection_string = getpass.getpass("TiDB Connection String:") +os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") +``` + +Taking macOS as an example, the cluster connection string is as follows: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +You need to modify the parameters in the connection string according to your TiDB cluster. If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ + + +### Step 4. Load the sample document + +#### Step 4.1 Download the sample document + +In your project directory, create a directory named `data/paul_graham/` and download the sample document [`paul_graham_essay.txt`](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt) from the [run-llama/llama_index](https://github.com/run-llama/llama_index) GitHub repository. + +```shell +!mkdir -p 'data/paul_graham/' +!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt' +``` + +#### Step 4.2 Load the document + +Load the sample document from `data/paul_graham/paul_graham_essay.txt` using the `SimpleDirectoryReader` class. + +```python +documents = SimpleDirectoryReader("./data/paul_graham").load_data() +print("Document ID:", documents[0].doc_id) + +for index, document in enumerate(documents): + document.metadata = {"book": "paul_graham"} +``` + +### Step 5. Embed and store document vectors + +#### Step 5.1 Initialize the TiDB vector store + +The following code creates a table named `paul_graham_test` in TiDB, which is optimized for vector search. + +```python +tidbvec = TiDBVectorStore( + connection_string=tidb_connection_url, + table_name="paul_graham_test", + distance_strategy="cosine", + vector_dimension=1536, + drop_existing_table=False, +) +``` + +Upon successful execution, you can directly view and access the `paul_graham_test` table in your TiDB database. + +#### Step 5.2 Generate and store embeddings + +The following code parses the documents, generates embeddings, and stores them in the TiDB vector store. + +```python +storage_context = StorageContext.from_defaults(vector_store=tidbvec) +index = VectorStoreIndex.from_documents( + documents, storage_context=storage_context, show_progress=True +) +``` + +The expected output is as follows: + +```plain +Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 8.76it/s] +Generating embeddings: 100%|██████████| 21/21 [00:02<00:00, 8.22it/s] +``` + +### Step 6. Perform a vector search + +The following creates a query engine based on the TiDB vector store and performs a semantic similarity search. + +```python +query_engine = index.as_query_engine() +response = query_engine.query("What did the author do?") +print(textwrap.fill(str(response), 100)) +``` + +> **Note** +> +> `TiDBVectorStore` only supports the [`default`](https://docs.llamaindex.ai/en/stable/api_reference/storage/vector_store/?h=vectorstorequerymode#llama_index.core.vector_stores.types.VectorStoreQueryMode) query mode. + +The expected output is as follows: + +```plain +The author worked on writing, programming, building microcomputers, giving talks at conferences, +publishing essays online, developing spam filters, painting, hosting dinner parties, and purchasing +a building for office use. +``` + +### Step 7. Search with metadata filters + +To refine your searches, you can use metadata filters to retrieve specific nearest-neighbor results that match the applied filters. + +#### Query with `book != "paul_graham"` filter + +The following example excludes results where the `book` metadata field is `"paul_graham"`: + +```python +from llama_index.core.vector_stores.types import ( + MetadataFilter, + MetadataFilters, +) + +query_engine = index.as_query_engine( + filters=MetadataFilters( + filters=[ + MetadataFilter(key="book", value="paul_graham", operator="!="), + ] + ), + similarity_top_k=2, +) +response = query_engine.query("What did the author learn?") +print(textwrap.fill(str(response), 100)) +``` + +The expected output is as follows: + +```plain +Empty Response +``` + +#### Query with `book == "paul_graham"` filter + +The following example filters results to include only documents where the `book` metadata field is `"paul_graham"`: + +```python +from llama_index.core.vector_stores.types import ( + MetadataFilter, + MetadataFilters, +) + +query_engine = index.as_query_engine( + filters=MetadataFilters( + filters=[ + MetadataFilter(key="book", value="paul_graham", operator="=="), + ] + ), + similarity_top_k=2, +) +response = query_engine.query("What did the author learn?") +print(textwrap.fill(str(response), 100)) +``` + +The expected output is as follows: + +```plain +The author learned programming on an IBM 1401 using an early version of Fortran in 9th grade, then +later transitioned to working with microcomputers like the TRS-80 and Apple II. Additionally, the +author studied philosophy in college but found it unfulfilling, leading to a switch to studying AI. +Later on, the author attended art school in both the US and Italy, where they observed a lack of +substantial teaching in the painting department. +``` + +### Step 8. Delete documents + +Delete the first document from the index: + +```python +tidbvec.delete(documents[0].doc_id) +``` + +Check whether the documents had been deleted: + +```python +query_engine = index.as_query_engine() +response = query_engine.query("What did the author learn?") +print(textwrap.fill(str(response), 100)) +``` + +The expected output is as follows: + +```plain +Empty Response +``` + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/markdown-pages/en/tidb/master/vector-search-integrate-with-peewee.md b/markdown-pages/en/tidb/master/vector-search-integrate-with-peewee.md new file mode 100644 index 0000000..226fcc7 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-integrate-with-peewee.md @@ -0,0 +1,290 @@ +--- +title: Integrate TiDB Vector Search with peewee +summary: Learn how to integrate TiDB Vector Search with peewee to store embeddings and perform semantic searches. +--- + +# Integrate TiDB Vector Search with peewee + +This tutorial walks you through how to use [peewee](https://docs.peewee-orm.com/) to interact with the [TiDB Vector Search](/vector-search-overview.md), store embeddings, and perform vector search queries. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with peewee by following the steps below. + +### Step 1. Clone the repository + +Clone the [`tidb-vector-python`](https://github.com/pingcap/tidb-vector-python) repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/orm-peewee-quickstart +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +Alternatively, you can install the following packages for your project: + +```bash +pip install peewee pymysql python-dotenv tidb-vector +``` + +### Step 4. Configure the environment variables + +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `General`. + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Copy the connection parameters from the connection dialog. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection parameters to the corresponding environment variables. + + - `TIDB_HOST`: The host of the TiDB cluster. + - `TIDB_PORT`: The port of the TiDB cluster. + - `TIDB_USERNAME`: The username to connect to the TiDB cluster. + - `TIDB_PASSWORD`: The password to connect to the TiDB cluster. + - `TIDB_DATABASE`: The database name to connect to. + - `TIDB_CA_PATH`: The path to the root certificate file. + + The following is an example for macOS: + + ```dotenv + TIDB_HOST=gateway01.****.prod.aws.tidbcloud.com + TIDB_PORT=4000 + TIDB_USERNAME=********.root + TIDB_PASSWORD=******** + TIDB_DATABASE=test + TIDB_CA_PATH=/etc/ssl/cert.pem + ``` + +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_HOST=127.0.0.1 +TIDB_PORT=4000 +TIDB_USERNAME=root +TIDB_PASSWORD= +TIDB_DATABASE=test +``` + +If you are running TiDB on your local machine, `TIDB_HOST` is `127.0.0.1` by default. The initial `TIDB_PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- `TIDB_HOST`: The host of the TiDB cluster. +- `TIDB_PORT`: The port of the TiDB cluster. +- `TIDB_USERNAME`: The username to connect to the TiDB cluster. +- `TIDB_PASSWORD`: The password to connect to the TiDB cluster. +- `TIDB_DATABASE`: The name of the database you want to connect to. + +
+ +
+ +### Step 5. Run the demo + +```bash +python peewee-quickstart.py +``` + +Example output: + +```text +Get 3-nearest neighbor documents: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog + - distance: 0.7327387580875756 + document: tree +Get documents within a certain distance: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog +``` + +## Sample code snippets + +You can refer to the following sample code snippets to develop your application. + +### Create vector tables + +#### Connect to TiDB cluster + +```python +import os +import dotenv + +from peewee import Model, MySQLDatabase, SQL, TextField +from tidb_vector.peewee import VectorField + +dotenv.load_dotenv() + +# Using `pymysql` as the driver. +connect_kwargs = { + 'ssl_verify_cert': True, + 'ssl_verify_identity': True, +} + +# Using `mysqlclient` as the driver. +# connect_kwargs = { +# 'ssl_mode': 'VERIFY_IDENTITY', +# 'ssl': { +# # Root certificate default path +# # https://docs.pingcap.com/tidbcloud/secure-connections-to-serverless-clusters/#root-certificate-default-path +# 'ca': os.environ.get('TIDB_CA_PATH', '/path/to/ca.pem'), +# }, +# } + +db = MySQLDatabase( + database=os.environ.get('TIDB_DATABASE', 'test'), + user=os.environ.get('TIDB_USERNAME', 'root'), + password=os.environ.get('TIDB_PASSWORD', ''), + host=os.environ.get('TIDB_HOST', 'localhost'), + port=int(os.environ.get('TIDB_PORT', '4000')), + **connect_kwargs, +) +``` + +#### Define a vector column + +Create a table with a column named `peewee_demo_documents` that stores a 3-dimensional vector. + +```python +class Document(Model): + class Meta: + database = db + table_name = 'peewee_demo_documents' + + content = TextField() + embedding = VectorField(3) +``` + +#### Define a vector column optimized with index + +> **Note** +> +> This section is only applicable to [TiDB Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-serverless) clusters. + +Define a 3-dimensional vector column and optimize it with a [vector search index](/vector-search-index.md) (HNSW index). + +```python +class DocumentWithIndex(Model): + class Meta: + database = db + table_name = 'peewee_demo_documents_with_index' + + content = TextField() + embedding = VectorField(3, constraints=[SQL("COMMENT 'hnsw(distance=cosine)'")]) +``` + +TiDB will use this index to accelerate vector search queries based on the cosine distance function. + +### Store documents with embeddings + +```python +Document.create(content='dog', embedding=[1, 2, 1]) +Document.create(content='fish', embedding=[1, 2, 4]) +Document.create(content='tree', embedding=[1, 0, 0]) +``` + +### Search the nearest neighbor documents + +Search for the top-3 documents that are semantically closest to the query vector `[1, 2, 3]` based on the cosine distance function. + +```python +distance = Document.embedding.cosine_distance([1, 2, 3]).alias('distance') +results = Document.select(Document, distance).order_by(distance).limit(3) +``` + +### Search documents within a certain distance + +Search for the documents whose cosine distance from the query vector `[1, 2, 3]` is less than 0.2. + +```python +distance_expression = Document.embedding.cosine_distance([1, 2, 3]) +distance = distance_expression.alias('distance') +results = Document.select(Document, distance).where(distance_expression < 0.2).order_by(distance).limit(3) +``` + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/markdown-pages/en/tidb/master/vector-search-integrate-with-sqlalchemy.md b/markdown-pages/en/tidb/master/vector-search-integrate-with-sqlalchemy.md new file mode 100644 index 0000000..c92ec95 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-integrate-with-sqlalchemy.md @@ -0,0 +1,259 @@ +--- +title: Integrate TiDB Vector Search with SQLAlchemy +summary: Learn how to integrate TiDB Vector Search with SQLAlchemy to store embeddings and perform semantic searches. +--- + +# Integrate TiDB Vector Search with SQLAlchemy + +This tutorial walks you through how to use [SQLAlchemy](https://www.sqlalchemy.org/) to interact with [TiDB Vector Search](/vector-search-overview.md), store embeddings, and perform vector search queries. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster. + + + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with SQLAlchemy by following the steps below. + +### Step 1. Clone the repository + +Clone the `tidb-vector-python` repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/orm-sqlalchemy-quickstart +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install the required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +Alternatively, you can install the following packages for your project: + +```bash +pip install pymysql python-dotenv sqlalchemy tidb-vector +``` + +### Step 4. Configure the environment variables + +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection string into it. + + The following is an example for macOS: + + ```dotenv + TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + ``` + +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ +### Step 5. Run the demo + +```bash +python sqlalchemy-quickstart.py +``` + +Example output: + +```text +Get 3-nearest neighbor documents: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog + - distance: 0.7327387580875756 + document: tree +Get documents within a certain distance: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog +``` + +## Sample code snippets + +You can refer to the following sample code snippets to develop your application. + +### Create vector tables + +#### Connect to TiDB cluster + +```python +import os +import dotenv + +from sqlalchemy import Column, Integer, create_engine, Text +from sqlalchemy.orm import declarative_base, Session +from tidb_vector.sqlalchemy import VectorType + +dotenv.load_dotenv() + +tidb_connection_string = os.environ['TIDB_DATABASE_URL'] +engine = create_engine(tidb_connection_string) +``` + +#### Define a vector column + +Create a table with a column named `embedding` that stores a 3-dimensional vector. + +```python +Base = declarative_base() + +class Document(Base): + __tablename__ = 'sqlalchemy_demo_documents' + id = Column(Integer, primary_key=True) + content = Column(Text) + embedding = Column(VectorType(3)) +``` + +#### Define a vector column optimized with index + +> **Note** +> +> This section is only applicable to [TiDB Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-serverless) clusters. + +Define a 3-dimensional vector column and optimize it with a [vector search index](/vector-search-index.md) (HNSW index). + +```python +class DocumentWithIndex(Base): + __tablename__ = 'sqlalchemy_demo_documents_with_index' + id = Column(Integer, primary_key=True) + content = Column(Text) + embedding = Column(VectorType(3), comment="hnsw(distance=cosine)") +``` + +TiDB will use this index to accelerate vector search queries based on the cosine distance function. + +### Store documents with embeddings + +```python +with Session(engine) as session: + session.add(Document(content="dog", embedding=[1, 2, 1])) + session.add(Document(content="fish", embedding=[1, 2, 4])) + session.add(Document(content="tree", embedding=[1, 0, 0])) + session.commit() +``` + +### Search the nearest neighbor documents + +Search for the top-3 documents that are semantically closest to the query vector `[1, 2, 3]` based on the cosine distance function. + +```python +with Session(engine) as session: + distance = Document.embedding.cosine_distance([1, 2, 3]).label('distance') + results = session.query( + Document, distance + ).order_by(distance).limit(3).all() +``` + +### Search documents within a certain distance + +Search for documents whose cosine distance from the query vector `[1, 2, 3]` is less than 0.2. + +```python +with Session(engine) as session: + distance = Document.embedding.cosine_distance([1, 2, 3]).label('distance') + results = session.query( + Document, distance + ).filter(distance < 0.2).order_by(distance).limit(3).all() +``` + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/markdown-pages/en/tidb/master/vector-search-integration-overview.md b/markdown-pages/en/tidb/master/vector-search-integration-overview.md new file mode 100644 index 0000000..d96deac --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-integration-overview.md @@ -0,0 +1,79 @@ +--- +title: Vector Search Integration Overview +summary: An overview of TiDB vector search integration, including supported AI frameworks, embedding models, and ORM libraries. +--- + +# Vector Search Integration Overview + +This document provides an overview of TiDB vector search integration, including supported AI frameworks, embedding models, and Object Relational Mapping (ORM) libraries. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## AI frameworks + +TiDB provides official support for the following AI frameworks, enabling you to easily integrate AI applications developed based on these frameworks with TiDB Vector Search. + +| AI frameworks | Tutorial | +|---------------|---------------------------------------------------------------------------------------------------| +| Langchain | [Integrate Vector Search with LangChain](/vector-search-integrate-with-langchain.md) | +| LlamaIndex | [Integrate Vector Search with LlamaIndex](/vector-search-integrate-with-llamaindex.md) | + +Moreover, you can also use TiDB for various purposes, such as document storage and knowledge graph storage for AI applications. + +## Embedding models and services + +TiDB Vector Search supports storing vectors of up to 16383 dimensions, which accommodates most embedding models. + +You can either use self-deployed open-source embedding models or third-party embedding APIs provided by third-party embedding providers to generate vectors. + +The following table lists some mainstream embedding service providers and the corresponding integration tutorials. + +| Embedding service providers | Tutorial | +|-----------------------------|---------------------------------------------------------------------------------------------------------------------| +| Jina AI | [Integrate Vector Search with Jina AI Embeddings API](/vector-search-integrate-with-jinaai-embedding.md) | + +## Object Relational Mapping (ORM) libraries + +You can integrate TiDB Vector Search with your ORM library to interact with the TiDB database. + +The following table lists the supported ORM libraries and the corresponding integration tutorials: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LanguageORM/ClientHow to installTutorial
PythonTiDB Vector Clientpip install tidb-vector[client]Get Started with Vector Search Using Python
SQLAlchemypip install tidb-vectorIntegrate TiDB Vector Search with SQLAlchemy
peeweepip install tidb-vectorIntegrate TiDB Vector Search with peewee
Djangopip install django-tidb[vector]Integrate TiDB Vector Search with Django
diff --git a/markdown-pages/en/tidb/master/vector-search-limitations.md b/markdown-pages/en/tidb/master/vector-search-limitations.md new file mode 100644 index 0000000..063ddcd --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-limitations.md @@ -0,0 +1,67 @@ +--- +title: Vector Search Limitations +summary: Learn the limitations of the TiDB vector search. +--- + +# Vector Search Limitations + +This document describes the known limitations of TiDB vector search. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Vector data type limitations + +- Each [vector](/vector-search-data-types.md) supports up to 16383 dimensions. +- Vector data types cannot store `NaN`, `Infinity`, or `-Infinity` values. +- Vector data types cannot store double-precision floating-point numbers. If you insert or store double-precision floating-point numbers in vector columns, TiDB converts them to single-precision floating-point numbers. +- Vector columns cannot be used as primary keys or as part of a primary key. +- Vector columns cannot be used as unique indexes or as part of a unique index. +- Vector columns cannot be used as partition keys or as part of a partition key. +- Currently, TiDB does not support modifying a vector column to other data types (such as `JSON` and `VARCHAR`). + +## Vector index limitations + +See [Vector search restrictions](/vector-search-index.md#restrictions). + +## Compatibility with TiDB tools + + + +- Make sure that you are using v8.4.0 or a later version of BR to back up and restore data. Restoring tables with vector data types to TiDB clusters earlier than v8.4.0 is not supported. +- TiDB Data Migration (DM) does not support migrating or replicating MySQL 9.0 vector data types to TiDB. +- When TiCDC replicates vector data to a downstream that does not support vector data types, it will change the vector data types to another type. For more information, see [Compatibility with vector data types](/ticdc/ticdc-compatibility.md#compatibility-with-vector-data-types). + + + + + +- The Data Migration feature in the TiDB Cloud console does not support migrating or replicating MySQL 9.0 vector data types to TiDB Cloud. + + + +## Feedback + +We value your feedback and are always here to help: + + + +- [Join our Discord](https://discord.gg/zcqexutz2R) + + + + + +- [Join our Discord](https://discord.gg/zcqexutz2R) +- [Visit our Support Portal](https://tidb.support.pingcap.com/) + + \ No newline at end of file diff --git a/markdown-pages/en/tidb/master/vector-search-overview.md b/markdown-pages/en/tidb/master/vector-search-overview.md new file mode 100644 index 0000000..3a21165 --- /dev/null +++ b/markdown-pages/en/tidb/master/vector-search-overview.md @@ -0,0 +1,88 @@ +--- +title: Vector Search Overview +summary: Learn about Vector Search in TiDB Cloud. This feature provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. +--- + +# Vector Search Overview + +TiDB Vector Search provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. This feature enables developers to easily build scalable applications with generative artificial intelligence (AI) capabilities using familiar MySQL skills. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + + + +> **Warning:** +> +> The vector search feature is in beta. It might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Concepts + +Vector search is a search method that prioritizes the meaning of your data to deliver relevant results. + +Unlike traditional full-text search, which relies on exact keyword matching and word frequency, vector search converts various data types (such as text, images, or audio) into high-dimensional vectors and queries based on the similarity between these vectors. This search method captures the semantic meaning and contextual information of the data, leading to a more precise understanding of user intent. + +Even when the search terms do not exactly match the content in the database, vector search can still provide results that align with the user's intent by analyzing the semantics of the data. + +For example, a full-text search for "a swimming animal" only returns results containing these exact keywords. In contrast, vector search can return results for other swimming animals, such as fish or ducks, even if these results do not contain the exact keywords. + +### Vector embedding + +A vector embedding, also known as an embedding, is a sequence of numbers that represents real-world objects in a high-dimensional space. It captures the meaning and context of unstructured data, such as documents, images, audio, and videos. + +Vector embeddings are essential in machine learning and serve as the foundation for semantic similarity searches. + +TiDB introduces [Vector data types](/vector-search-data-types.md) designed to optimize the storage and retrieval of vector embeddings, enhancing their use in AI applications. You can store vector embeddings in TiDB and perform vector search queries to find the most relevant data using these data types. + +### Embedding model + +Embedding models are algorithms that transform data into [vector embeddings](#vector-embedding). + +Choosing an appropriate embedding model is crucial for ensuring the accuracy and relevance of semantic search results. For unstructured text data, you can find top-performing text embedding models on the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). + +To learn how to generate vector embeddings for your specific data types, refer to integration tutorials or examples of embedding models. + +## How vector search works + +After converting raw data into vector embeddings and storing them in TiDB, your application can execute vector search queries to find the data most semantically or contextually relevant to a user's query. + +TiDB vector search identifies the top-k nearest neighbor (KNN) vectors by using a [distance function](/vector-search-functions-and-operators.md) to calculate the distance between the given vector and vectors stored in the database. The vectors closest to the given vector in the query represent the most similar data in meaning. + +![The Schematic TiDB Vector Search](/media/vector-search/embedding-search.png) + +As a relational database with integrated vector search capabilities, TiDB enables you to store data and their corresponding vector representations (that is, vector embeddings) together in one database. You can choose any of the following ways for storage: + +- Store data and their corresponding vector representations in different columns of the same table. +- Store data and their corresponding vector representation in different tables. In this way, you need to use `JOIN` queries to combine the tables when retrieving data. + +## Use cases + +### Retrieval-Augmented Generation (RAG) + +Retrieval-Augmented Generation (RAG) is an architecture designed to optimize the output of Large Language Models (LLMs). By using vector search, RAG applications can store vector embeddings in the database and retrieve relevant documents as additional context when the LLM generates responses, thereby improving the quality and relevance of the answers. + +### Semantic search + +Semantic search is a search technology that returns results based on the meaning of a query, rather than simply matching keywords. It interprets the meaning across different languages and various types of data (such as text, images, and audio) using embeddings. Vector search algorithms then use these embeddings to find the most relevant data that satisfies the user's query. + +### Recommendation engine + +A recommendation engine is a system that proactively suggests content, products, or services that are relevant and personalized to users. It accomplishes this by creating embeddings that represent user behavior and preferences. These embeddings help the system identify similar items that other users have interacted with or shown interest in. This increases the likelihood that the recommendations will be both relevant and appealing to the user. + +## See also + +To get started with TiDB Vector Search, see the following documents: + +- [Get started with vector search using Python](/vector-search-get-started-using-python.md) +- [Get started with vector search using SQL](/vector-search-get-started-using-sql.md)