add two HTAP documents (pingcap#6206)

Frank945946 · Aug 27, 2021 · b589312 · b589312
1 parent db55105
commit b589312
Show file tree

Hide file tree

Showing 4 changed files with 322 additions and 1 deletion.
diff --git a/TOC.md b/TOC.md
@@ -20,7 +20,9 @@
   + [Credits](/credits.md)
 + Quick Start
   + [Try Out TiDB](/quick-start-with-tidb.md)
+  + [Try Out HTAP](/quick-start-with-htap.md)
   + [Learn TiDB SQL](/basic-sql-operations.md)
+  + [Learn HTAP](/explore-htap.md)
   + [Import Example Database](/import-example-data.md)
 + Deploy
   + [Software and Hardware Requirements](/hardware-and-software-requirements.md)

diff --git a/_index.md b/_index.md
@@ -26,8 +26,10 @@ Designed for the cloud, TiDB provides flexible scalability, reliability and secu
 <NavColumn>
 <ColumnTitle>Quick Start</ColumnTitle>
 
-- [Quick Start Guide](/quick-start-with-tidb.md)
+- [Quick Start with TiDB](/quick-start-with-tidb.md)
+- [Quick Start with HTAP](/quick-start-with-htap.md)
 - [Explore SQL with TiDB](/basic-sql-operations.md)
+- [Explore HTAP](/explore-htap.md)
 
 </NavColumn>
 

diff --git a/explore-htap.md b/explore-htap.md
@@ -0,0 +1,104 @@
+---
+title: Explore HTAP
+summary: Learn how to explore and use the features of TiDB HTAP.
+---
+
+# Explore HTAP
+
+This guide describes how to explore and use the features of TiDB Hybrid Transactional and Analytical Processing (HTAP).
+
+> **Note:**
+>
+> If you are new to TiDB HTAP and want to start using it quickly, see [Quick start with HTAP](/quick-start-with-htap.md).
+
+## Use cases
+
+TiDB HTAP can handle the massive data that increases rapidly, reduce the cost of DevOps, and be deployed in either on-premises or cloud environments easily, which brings the value of data assets in real time.
+
+The following are the typical use cases of HTAP:
+
+- Hybrid workload
+
+    When using TiDB for real-time Online Analytical Processing (OLAP) in hybrid load scenarios, you only need to provide an entry point of TiDB to your data. TiDB automatically selects different processing engines based on the specific business.
+
+- Real-time stream processing
+
+    When using TiDB in real-time stream processing scenarios, TiDB ensures that all the data flowed in constantly can be queried in real time. At the same time, TiDB also can handle highly concurrent data workloads and Business Intelligence (BI) queries.
+
+- Data hub
+
+    When using TiDB as a data hub, TiDB can meet specific business needs by seamlessly connecting the data for the application and the data warehouse.
+
+For more information about use cases of TiDB HTAP, see [blogs about HTAP on the PingCAP website](https://en.pingcap.com/blog/tag/HTAP).
+
+## Architecture
+
+In TiDB, a row-based storage engine [TiKV](/tikv-overview.md) for Online Transactional Processing (OLTP) and a columnar storage engine [TiFlash](/tiflash/tiflash-overview.md) for Online Analytical Processing (OLAP) co-exist, replicate data automatically, and keep strong consistency. 
+
+For more information about the architecture, see [architecture of TiDB HTAP](/tiflash/tiflash-overview.md#architecture).
+
+## Environment preparation 
+
+Before exploring the features of TiDB HTAP, you need to deploy TiDB and the corresponding storage engines according to the data volume. If the data volume is large (for example, 100 T), it is recommended to use TiFlash Massively Parallel Processing (MPP) as the primary solution and TiSpark as the supplementary solution.
+
+- TiFlash
+
+    - If you have deployed a TiDB cluster with no TiFlash node, add the TiFlash nodes in the current TiDB cluster. For detailed information, see [Scale out a TiFlash cluster](/scale-tidb-using-tiup.md#scale-out-a-tiflash-cluster).
+    - If you have not deployed a TiDB cluster, see [Deploy a TiDB Cluster using TiUP](/production-deployment-using-tiup.md).  Based on the minimal TiDB topology, you also need to deploy the [topology of TiFlash](/tiflash-deployment-topology.md).
+    - When deciding how to choose the number of TiFlash nodes, consider the following scenarios:
+
+        - If your use case requires OLTP with small-scale analytical processing and Ad-Hoc queries, deploy one or several TiFlash nodes. They can dramatically increase the speed of analytic queries.
+        - If the OLTP throughput does not cause significant pressure to I/O usage rate of the TiFlash nodes, each TiFlash node uses more resources for computation, and thus the TiFlash cluster can have near-linear scalability. The number of TiFlash nodes should be tuned based on expected performance and response time.
+        - If the OLTP throughput is relatively high (for example, the write or update throughput is higher than 10 million lines/hours), due to the limited write capacity of network and physical disks, the I/O between TiKV and TiFlash becomes a bottleneck and is also prone to read and write hotspots. In this case, the number of TiFlash nodes has a complex non-linear relationship with the computation volume of analytical processing, so you need to tune the number of TiFlash nodes based on the actual status of the system.
+
+- TiSpark
+
+    - If your data needs to be analyzed with Spark, deploy TiSpark (Spark 3.x is not currently supported). For specific process, see [TiSpark User Guide](/tispark-overview.md).
+
+<!--    - Real-time stream processing
+  - If you want to build an efficient and easy-to-use real-time data warehouse with TiDB and Flink, you are welcome to participate in Apache Flink x TiDB meetups.-->
+
+## Data preparation 
+
+After TiFlash is deployed, TiKV does not replicate data to TiFlash automatically. You need to manually specify which tables need to be replicated to TiFlash. After that, TiDB creates the corresponding TiFlash replicas.
+
+- If there is no data in the TiDB Cluster, migrate the data to TiDB first. For detailed information, see [data migration](/migration-overview.md).
+- If the TiDB cluster already has the replicated data from upstream, after TiFlash is deployed, data replication does not automatically begin. You need to manually specify the tables to be replicated to TiFlash. For detailed information, see [Use TiFlash](/tiflash/use-tiflash.md).
+
+## Data processing
+
+With TiDB, you can simply enter SQL statements for query or write requests. For the tables with TiFlash replicas, TiDB uses the front-end optimizer to automatically choose the optimal execution plan.
+
+> **Note:**
+> 
+> The MPP mode of TiFlash is enabled by default. When an SQL statement is executed, TiDB automatically determines whether to run in the MPP mode through the optimizer.
+>
+> - To disable the MPP mode of TiFlash, set the value of the [tidb_allow_mpp](/system-variables.md#tidb_allow_mpp-new-in-v50) system variable to `OFF`.
+> - To forcibly enable MPP mode of TiFlash for query execution, set the values of [tidb_allow_mpp](/system-variables.md#tidb_allow_mpp-new-in-v50) and [tidb_enforce_mpp](/system-variables.md#tidb_enforce_mpp-new-in-v51) to `ON`.
+> - To check whether TiDB chooses the MPP mode to execute a specific query, see [Explain Statements in the MPP Mode](/explain-mpp.md#explain-statements-in-the-mpp-mode). If the output of `EXPLAIN` statement includes the `ExchangeSender` and `ExchangeReceiver` operators, the MPP mode is in use.
+
+## Performance monitoring
+
+When using TiDB, you can monitor the TiDB cluster status and performance metrics in either of the following ways:
+
+- [TiDB Dashboard](/dashboard/dashboard-intro.md): you can see the overall running status of the TiDB cluster, analyse distribution and trends of read and write traffic, and learn the detailed execution information of slow queries.
+- [Monitoring system (Prometheus & Grafana)](/grafana-overview-dashboard.md): you can see the monitoring parameters of TiDB cluster-related componants including PD, TiDB, TiKV, TiFlash,TiCDC, and Node_exporter.
+
+To see the alert rules of TiDB cluster and TiFlash cluster, see [TiDB cluster alert rules](/alert-rules.md) and [TiFlash alert rules](/tiflash/tiflash-alert-rules.md).
+
+## Troubleshooting
+
+If any issue occurs during using TiDB, refer to the following documents:
+
+- [Analyze slow queries](/analyze-slow-queries.md)
+- [Identify expensive queries](/identify-expensive-queries.md)
+- [Troubleshoot hotspot issues](/troubleshoot-hot-spot-issues.md)
+- [TiDB cluster troubleshooting guide](/troubleshoot-tidb-cluster.md)
+- [Troubleshoot a TiFlash Cluster](/tiflash/troubleshoot-tiflash.md)
+
+You are also welcome to create [Github Issues](https://github.com/pingcap/tiflash/issues) or submit your questions on [AskTUG](https://asktug.com/).
+
+## What's next
+
+- To check the TiFlash version, critical logs, system tables, see [Maintain a TiFlash cluster](/tiflash/maintain-tiflash.md).
+- To remove a specific TiFlash node, see [Scale out a TiFlash cluster](/scale-tidb-using-tiup.md#scale-out-a-tiflash-cluster).
diff --git a/quick-start-with-htap.md b/quick-start-with-htap.md
@@ -0,0 +1,213 @@
+---
+title: Quick start with HTAP
+summary: Learn how to quickly get started with the TiDB HTAP.
+---
+
+# Quick Start Guide for TiDB HTAP
+
+This guide walks you through the quickest way to get started with TiDB's one-stop solution of Hybrid Transactional and Analytical Processing (HTAP).
+
+> **Note:**
+>
+> The steps provided in this guide is ONLY for quick start in the test environment. For production environments, [explore HTAP](/explore-htap.md) is recommended. 
+
+## Basic concepts
+
+Before using TiDB HTAP, you need to have some basic knowledge about [TiKV](/tikv-overview.md), a row-based storage engine for TiDB Online Transactional Processing (OLTP), and [TiFlash](/tiflash/tiflash-overview.md), a columnar storage engine for TiDB Online Analytical Processing (OLAP).
+
+- Storage engines of HTAP: The row-based storage engine and the columnar storage engine co-exist for HTAP. Both storage engines can replicate data automatically and keep strong consistency. The row-based storage engine optimizes OLTP performance, and the columnar storage engine optimizes OLAP performance.
+- Data consistency of HTAP: As a distributed and transactional key-value database, TiKV provides transactional interfaces with ACID compliance, and guarantees data consistency between multiple replicas and high availability with the implementation of the [Raft consensus algorithm](https://raft.github.io/raft.pdf). As a columnar storage extension of TiKV, TiFlash replicates data from TiKV in real time according to the Raft Learner consensus algorithm, which ensures that data is strongly consistent between TiKV and TiFlash.
+- Data isolation of HTAP: TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.
+- MPP computing engine: [MPP](/tiflash/use-tiflash.md#control-whether-to-select-the-mpp-mode) is a distributed computing framework provided by the TiFlash engine since TiDB 5.0, which allows data exchange between nodes and provides high-performance, high-throughput SQL algorithms. In the MPP mode, the run time of the analytic queries can be significantly reduced.
+
+## Steps
+
+In this document, you can experience the convenience and high performance of TiDB HTAP by querying an example table in a popular [TPC-H](http://www.tpc.org/tpch/) dataset.
+
+### Step 1. Deploy a local test environment 
+
+Before using TiDB HTAP, follow the steps in the [Quick Start Guide for the TiDB Database Platform](/quick-start-with-tidb.md) to prepare a local test environment, and run the following command to deploy a TiDB cluster:
+
+{{< copyable "shell-regular" >}}
+
+```shell
+tiup playground
+```
+
+> **Note:**
+>
+> `tiup playground` command is ONLY for quick start, NOT for production.
+
+### Step 2. Prepare test data
+
+In the following steps, you can create a [TPC-H](http://www.tpc.org/tpch/) dataset as the test data to use TiDB HTAP. If you are interested in TPC-H, see [General Implementation Guidelines](http://tpc.org/tpc_documents_current_versions/pdf/tpc-h_v3.0.0.pdf).
+
+> **Note:**
+>
+> If you want to use your existing data for analytic queries, you can [migrate your data to TiDB](/migration-overview.md). If you want to design and create your own test data, you can create it by executing SQL statements or using related tools.
+
+1. Install the test data generation tool by running the following command:
+
+    {{< copyable "shell-regular" >}}
+
+    ```shell
+    tiup install bench
+    ```
+
+2. Generate the test data by running the following command:
+
+    {{< copyable "shell-regular" >}}
+
+    ```shell
+    tiup bench tpch --sf=1 prepare
+    ```
+
+    If the output of this command shows `Finished`, it indicates that the data is created.
+
+3. Execute the following SQL statement to view the generated data:
+
+    {{< copyable "sql" >}}
+
+    ```sql
+    SELECT 
+      CONCAT(table_schema,'.',table_name) AS 'Table Name', 
+      table_rows AS 'Number of Rows', 
+      FORMAT_BYTES(data_length) AS 'Data Size', 
+      FORMAT_BYTES(index_length) AS 'Index Size', 
+      FORMAT_BYTES(data_length+index_length) AS'Total' 
+    FROM 
+      information_schema.TABLES 
+    WHERE 
+      table_schema='test';
+    ```
+
+    As you can see from the output, eight tables are created in total, and the largest table has 6.5 million rows (the number of rows created by the tool depends on the actual SQL query result because the data is randomly generated).
+
+    ```sql
+    +---------------+----------------+-----------+------------+-----------+
+    |  Table Name   | Number of Rows | Data Size | Index Size |   Total   |
+    +---------------+----------------+-----------+------------+-----------+
+    | test.nation   |             25 | 2.44 KiB  | 0 bytes    | 2.44 KiB  |
+    | test.region   |              5 | 416 bytes | 0 bytes    | 416 bytes |
+    | test.part     |         200000 | 25.07 MiB | 0 bytes    | 25.07 MiB |
+    | test.supplier |          10000 | 1.45 MiB  | 0 bytes    | 1.45 MiB  |
+    | test.partsupp |         800000 | 120.17 MiB| 12.21 MiB  | 132.38 MiB|
+    | test.customer |         150000 | 24.77 MiB | 0 bytes    | 24.77 MiB |
+    | test.orders   |        1527648 | 174.40 MiB| 0 bytes    | 174.40 MiB|
+    | test.lineitem |        6491711 | 849.07 MiB| 99.06 MiB  | 948.13 MiB|
+    +---------------+----------------+-----------+------------+-----------+
+    8 rows in set (0.06 sec)
+     ```
+
+    This is a database of a commercial ordering system. In which, the `test.nation` table indicates the information about countries, the `test.region` table indicates the information about regions, the `test.part` table indicates the information about parts, the `test.supplier` table indicates the information about suppliers, the `test.partsupp` table indicates the information about parts of suppliers, the `test.customer` table indicates the information about customers, the `test.customer` table indicates the information about orders, and the `test.lineitem` table indicates the information about online items.
+
+### Step 3. Query data with the row-based storage engine
+
+To know the performance of TiDB with only the row-based storage engine, execute the following SQL statements:
+
+{{< copyable "sql" >}}
+
+```sql
+SELECT
+    l_orderkey,
+    SUM(
+        l_extendedprice * (1 - l_discount)
+    ) AS revenue,
+    o_orderdate,
+    o_shippriority
+FROM
+    customer,
+    orders,
+    lineitem
+WHERE
+    c_mktsegment = 'BUILDING'
+AND c_custkey = o_custkey
+AND l_orderkey = o_orderkey
+AND o_orderdate < DATE '1996-01-01'
+AND l_shipdate > DATE '1996-02-01'
+GROUP BY
+    l_orderkey,
+    o_orderdate,
+    o_shippriority
+ORDER BY
+    revenue DESC,
+    o_orderdate
+limit 10;
+```
+
+This is a shipping priority query, which provides the priority and potential revenue of the highest-revenue order that has not been shipped before a specified date. The potential revenue is defined as the sum of `l_extendedprice * (1-l_discount)`. The orders are listed in the descending order of revenue. In this example, this query lists the unshipped orders with potential query revenue in the top 10.
+
+### Step 4. Replicate the test data to the columnar storage engine
+
+After TiFlash is deployed, TiKV does not replicate data to TiFlash immediately. You need to execute the following DDL statements in a MySQL client of TiDB to specify which tables need to be replicated. After that, TiDB will create the specified replicas in TiFlash accordingly. 
+
+{{< copyable "sql" >}}
+
+```sql
+ALTER TABLE test.customer SET TIFLASH REPLICA 1;
+ALTER TABLE test.orders SET TIFLASH REPLICA 1;
+ALTER TABLE test.lineitem SET TIFLASH REPLICA 1;
+```
+
+To check the replication status of the specific tables, execute the following statements:
+
+{{< copyable "sql" >}}
+
+```sql
+SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'customer';
+SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'orders';
+SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'lineitem';
+```
+
+In the result of the above statements:
+
+- `AVAILABLE` indicates whether the TiFlash replica of a specific table is available or not. `1` means available and `0` means unavailable. Once a replica becomes available, this status does not change any more. If you use DDL statements to modify the number of replicas, the replication status will be recalculated.
+- `PROGRESS` means the progress of the replication. The value is between 0.0 and 1.0. 1 means at least one replica is replicated.
+
+### Step 5. Analyze data faster using HTAP
+
+Execute the SQL statements in [Step 3](#step-3-query-data-with-the-row-based-storage-engine) again, and you can see the performance of TiDB HTAP.
+
+For tables with TiFlash replicas, the TiDB optimizer automatically determines whether to use TiFlash replicas based on the cost estimation. To check whether or not a TiFlash replica is selected, you can use the `desc` or `explain analyze` statement. For example:
+
+{{< copyable "sql" >}}
+
+```sql
+explain analyze SELECT
+    l_orderkey,
+    SUM(
+        l_extendedprice * (1 - l_discount)
+    ) AS revenue,
+    o_orderdate,
+    o_shippriority
+FROM
+    customer,
+    orders,
+    lineitem
+WHERE
+    c_mktsegment = 'BUILDING'
+AND c_custkey = o_custkey
+AND l_orderkey = o_orderkey
+AND o_orderdate < DATE '1996-01-01'
+AND l_shipdate > DATE '1996-02-01'
+GROUP BY
+    l_orderkey,
+    o_orderdate,
+    o_shippriority
+ORDER BY
+    revenue DESC,
+    o_orderdate
+limit 10;
+```
+
+If the result of the `EXPLAIN` statement shows `ExchangeSender` and `ExchangeReceiver` operators, it indicates that the MPP mode has taken effect.
+
+In addition, you can specify that each part of the entire query is computed using only the TiFlash engine. For detailed information, see [Use TiDB to read TiFlash replicas](/tiflash/use-tiflash.md#use-tidb-to-read-tiflash-replicas).
+
+You can compare query results and query performance of these two methods.
+
+## What's next
+
+- [Architecture of TiDB HTAP](/tiflash/tiflash-overview.md#architecture)
+- [Explore HTAP](/explore-htap.md)
+- [Use TiFlash](/tiflash/use-tiflash.md#use-tiflash)