Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add two HTAP documents #6206

Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
+ [Credits](/credits.md)
+ Quick Start
+ [Try Out TiDB](/quick-start-with-tidb.md)
+ [Try Out HTAP](/quick-start-with-htap.md)
+ [Learn TiDB SQL](/basic-sql-operations.md)
+ [Learn HTAP](/explore-htap.md)
+ [Import Example Database](/import-example-data.md)
+ Deploy
+ [Software and Hardware Requirements](/hardware-and-software-requirements.md)
Expand Down
4 changes: 3 additions & 1 deletion _index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,10 @@ Designed for the cloud, TiDB provides flexible scalability, reliability and secu
<NavColumn>
<ColumnTitle>Quick Start</ColumnTitle>

- [Quick Start Guide](/quick-start-with-tidb.md)
- [Quick Start with TiDB](/quick-start-with-tidb.md)
- [Quick Start with HTAP](/quick-start-with-htap.md)
- [Explore SQL with TiDB](/basic-sql-operations.md)
- [Explore HTAP](/explore-htap.md)

</NavColumn>

Expand Down
104 changes: 104 additions & 0 deletions explore-htap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: Explore HTAP
summary: Learn how to explore and use the features of TiDB HTAP.
---

# Explore HTAP

This guide describes how to explore and use the features of TiDB Hybrid Transactional and Analytical Processing (HTAP).

> **Note:**
>
> If you are new to TiDB HTAP and want to start using it quickly, see [Quick start with HTAP](/quick-start-with-htap.md).

## Use cases

TiDB HTAP can handle the massive data that increases rapidly, reduce the cost of dev-ops, and be deployed as either on-premises or cloud easily, which brings the value of data assets in real time.
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved

The following are the typical use cases of HTAP:

- Hybrid workload

When using TiDB for real-time Online Analytical Processing (OLAP) that is in hybrid load scenarios, you only need to provide an entry point. TiDB automatically selects different processing engines based on the specific business.
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved

- Real-time stream processing
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved

When using TiDB in real-time stream processing scenarios, TiDB ensures that all the data flowed in constantly can be queried in real time. At the same time, TiDB also can handle highly concurrent data workloads and Business Intelligence (BI) queries.

- Data hub

When using TiDB as a data hub, TiDB can meet specific business needs by seamlessly connecting the data for the application and the data warehouse.

For more information about use cases of TiDB HTAP, see [blogs about HTAP on the PingCAP website](https://pingcap.com/blog-cn/#HTAP).
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved

## Architecture

In TiDB, a row-based storage engine [TiKV](/tikv-overview.md) for Online Transactional Processing (OLTP) and a columnar storage engine [TiFlash](/tiflash/tiflash-overview.md) for Online Analytical Processing (OLAP) co-exist, replicate data automatically, and keep strong consistency.

For more information about the architecture, see [architecture of TiDB HTAP](/tiflash/tiflash-overview.md#architecture).

## Environment preparation

Before exploring the features of TiDB HTAP, you need to deploy TiDB and the corresponding storage engines according to the data volume. If the data volume is large (for example, 100 T), it is recommended to use TiFlash Massively Parallel Processing (MPP) as the primary solution and TiSpark as the supplementary solution.

- TiFlash

- If you have deployed a TiDB cluster with no TiFlash node, add the TiFlash nodes in the current TiDB cluster. For detailed information, see [Scale out a TiFlash cluster](/scale-tidb-using-tiup.md#scale-out-a-tiflash-cluster).
- If you have not deployed a TiDB cluster, see [Deploy a TiDB Cluster using TiUP](/production-deployment-using-tiup.md). Based on the minimal TiDB topology, you also need to deploy the [topology of TiFlash](/tiflash-deployment-topology.md).
- When deciding how to choose the number of TiFlash nodes, consider the following scenarios:

- If you mainly need OLTP that runs small-scale analytical processing, deploy one or several TiFlash nodes. They can dramatically increase the speed of analytic queries.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
- If the OLTP throughput does not cause significant pressure to I/O usage rate of the TiFlash nodes, each TiFlash node uses more resources for computation, and thus the TiFlash cluster can have near-linear scalability. The number of TiFlash nodes should be tuned based on expected performance and response time.
- If the OLTP throughput is relatively high (for example, the rate of write throughput or update throughput is higher than 10 million lines/hours), the hot write regions and hot read regions can be formed. This is because the I/O usage in TiKV and TiFlash becomes the bottleneck due to the limited write capacity of network and physical disk in this case. At this point, the number of TiFlash nodes has a complex non-linear relationship with the quantity of analytical processing, so you need to tune the number of TiFlash nodes based on the specific status of the system.
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved

- TiSpark

- If your data needs to be analyzed with Spark, deploy TiSpark (Spark 3.x is not currently supported). For specific process, see [TiSpark User Guide](/tispark-overview.md).
qiancai marked this conversation as resolved.
Show resolved Hide resolved

<!-- - Real-time stream processing
- If you want to build an efficient and easy-to-use real-time data warehouse with TiDB and Flink, you are welcome to participate in Apache Flink x TiDB meetups.-->

## Data preparation

After TiFlash is deployed, TiKV does not replicate data to TiFlash automatically. You need to manually specify which tables need to be replicated to TiFlash. After that, TiDB creates the corresponding TiFlash replicas.

- If there is no data in the TiDB Cluster, migrate the data to TiDB first. For detailed information, see [data migration](/migration-overview.md).
- If the TiDB cluster already has the replicated data from upstream, after TiFlash is deployed, data replication does not automatically begin. You need to manually specify the tables to be replicated to TiFlash. For detailed information, see [Use TiFlash](/tiflash/use-tiflash.md).

## Data processing

With TiDB, you can simply enter SQL statements for query or write requests. For the tables with TiFlash replicas, TiDB uses the front-end optimizer to automatically choose the optimal execution plan.

> **Note:**
>
> The MPP mode of TiFlash is enabled by default. When an SQL statement is executed, TiDB automatically determines whether to run in the MPP mode through the optimizer.
>
> - To disable the MPP mode of TiFlash, set the value of the [tidb_allow_mpp](/system-variables.md#tidb_allow_mpp-new-in-v50) system variable to `OFF`.
> - To forcibly enable MPP mode of TiFlash for query execution, set the values of [tidb_allow_mpp](/system-variables.md#tidb_allow_mpp-new-in-v50) and [tidb_enforce_mpp](/system-variables.md#tidb_enforce_mpp-new-in-v51) to `ON`.
> - To check whether TiDB chooses the MPP mode to execute a specific query, see [Explain Statements in the MPP Mode](/explain-mpp.md#explain-statements-in-the-mpp-mode). If the output of `EXPLAIN` statement includes the `ExchangeSender` and `ExchangeReceiver` operators, the MPP mode is in use.

## Performance monitoring

When using TiDB, you can monitor the TiDB cluster status and performance metrics in either of the following ways:

- [TiDB Dashboard](/dashboard/dashboard-intro.md): you can see the overall running status of the TiDB cluster, analyse distribution and trends of read and write traffic, and learn the detailed execution information of slow queries.
- [Monitoring system (Prometheus & Grafana)](/grafana-overview-dashboard.md): you can see the monitoring parameters of TiDB cluster-related componants including PD, TiDB, TiKV, TiFlash,TiCDC, and Node_exporter.

To see the alert rules of TiDB cluster and TiFlash cluster, see [TiDB cluster alert rules](/alert-rules.md) and [TiFlash alert rules](/tiflash/tiflash-alert-rules.md).

## Troubleshooting

If any issue occurs during using TiDB, refer to the following documents:

- [Analyze slow queries](/analyze-slow-queries.md)
- [Identify expensive queries](/identify-expensive-queries.md)
- [Troubleshoot hotspot issues](/troubleshoot-hot-spot-issues.md)
- [TiDB cluster troubleshooting guide](/troubleshoot-tidb-cluster.md)
- [Troubleshoot a TiFlash Cluster](/tiflash/troubleshoot-tiflash.md)

You are also welcome to create [Github Issues](https://github.com/pingcap/tiflash/issues) or submit your questions on [AskTUG](https://asktug.com/).

## What's next

- To check the TiFlash version, critical logs and system tables, see [Maintain a TiFlash cluster](/tiflash/maintain-tiflash.md).
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved
- To remove a specific TiFlash node, see [Scale out a TiFlash cluster](/scale-tidb-using-tiup.md#scale-out-a-tiflash-cluster).
225 changes: 225 additions & 0 deletions quick-start-with-htap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
---
title: Quick start with HTAP
summary: Learn how to quickly get started with the TiDB HTAP.
---

# Quick Start Guide for TiDB HTAP

This guide walks you through the quickest way to get started with TiDB's one-stop solution of Hybrid Transactional and Analytical Processing (HTAP).

> **Note:**
>
> The steps provided in this guide is ONLY for quick start in the test environment. For production environments, [explore HTAP](/explore-htap.md) is recommended.

## Basic concepts

Before using TiDB HTAP, you need to have some basic knowledge about [TiKV](/tikv-overview.md), a row-based storage engine for TiDB Online Transactional Processing (OLTP), and [TiFlash](/tiflash/tiflash-overview.md), a columnar storage engine for TiDB Online Analytical Processing (OLAP).

- Storage engines of HTAP: The row-based storage engine and the columnar storage engine co-exist for HTAP. Both storage engines can replicate data automatically and keep strong consistency. The row-based storage engine optimizes OLTP performance, and the columnar storage engine optimizes OLAP performance.
- Data consistency of HTAP: As a distributed and transactional key-value database, TiKV provides transactional interfaces with ACID compliance, and guarantees data consistency between multiple replicas and high availability with the implementation of the [Raft consensus algorithm](https://raft.github.io/raft.pdf). As a columnar storage extension of TiKV, TiFlash replicates data from TiKV in real time according to the Raft Learner consensus algorithm, which ensures that data is strongly consistent between TiKV and TiFlash.
- Data isolation of HTAP: TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.
- MPP computing engine: [MPP](/tiflash/use-tiflash.md#control-whether-to-select-the-mpp-mode) is a distributed computing framework provided by the TiFlash engine since TiDB 5.0, which allows data exchange between nodes and provides high-performance, high-throughput SQL algorithms. In the MPP mode, the run time of the analytic queries can be significantly reduced.

## Steps

In this document, you can experience the convenience and high performance of TiDB HTAP by querying an example table in a popular [TPC-H](http://www.tpc.org/tpch/) dataset.

### Step 1. Deploy a local test environment

Before using TiDB HTAP, follow the steps in the [Quick Start Guide for the TiDB Database Platform](/quick-start-with-tidb.md) to deploy a local test environment.

In [Quick Start Guide for the TiDB Database Platform](/quick-start-with-tidb.md):

- You are recommended to run `tiup playground` to deploy a TiDB cluster of the latest version. When you run the following command, 1 TiDB instance, 1 TiKV instance, 1 PD instance, and 1 TiFlash instance are deployed automatically:

{{< copyable "shell-regular" >}}

```shell
tiup playground
```

- If you want to specify the TiDB version and the number of the instances of each component, you need to also specify the number of the TiFlash instances as in the following example command:

{{< copyable "shell-regular" >}}

```shell
tiup playground v5.1.0 --db 2 --pd 3 --kv 3 --tiflash 1 --monitor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • --monitor is the default isn't it?
  • Shouldn't we use >1 TiFlash to show the MPP benefits?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mpp's benefit need more tiflash node

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiSong0214 Do we need to update the node number to a large number? If so, we might need a test on that too.

```

> **Note:**
>
> `tiup playground` command is ONLY for quick start, NOT for production.
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved

### Step 2. Prepare test data

In the following steps, you can create a [TPC-H](http://www.tpc.org/tpch/) dataset as the test data to use TiDB HTAP. If you are interested in TPC-H, see [General Implementation Guidelines](http://tpc.org/tpc_documents_current_versions/pdf/tpc-h_v3.0.0.pdf).

> **Note:**
>
> If you want to use your existing data for analytic queries, you can [migrate your data to TiDB](/migration-overview.md). If you want to design and create your own test data, you can create it by executing SQL statements or using related tools.

1. Install the test data generation tool by running the following command:

{{< copyable "shell-regular" >}}

```shell
tiup install bench
```

2. Generate the test data by running the following command:

{{< copyable "shell-regular" >}}

```shell
tiup bench tpch --sf=1 prepare
```

If the output of this command shows `Finished`, it indicates that the data is created.

3. Execute the following SQL statement to view the generated data:

{{< copyable "sql" >}}

```sql
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved
SELECT
CONCAT(table_schema,'.',table_name) AS 'Table Name',
table_rows AS 'Number of Rows',
FORMAT_BYTES(data_length) AS 'Data Size',
FORMAT_BYTES(index_length) AS 'Index Size',
FORMAT_BYTES(data_length+index_length) AS'Total'
FROM
information_schema.TABLES
WHERE
table_schema='test';
```

As you can see from the output, eight tables are created in total, and the largest table has 6.5 million rows (the number of rows created by the tool depends on the actual SQL query result because the data is randomly generated).

```sql
+---------------+----------------+-----------+------------+-----------+
| Table Name | Number of Rows | Data Size | Index Size | Total |
+---------------+----------------+-----------+------------+-----------+
| test.nation | 25 | 2.44 KiB | 0 bytes | 2.44 KiB |
| test.region | 5 | 416 bytes | 0 bytes | 416 bytes |
| test.part | 200000 | 25.07 MiB | 0 bytes | 25.07 MiB |
| test.supplier | 10000 | 1.45 MiB | 0 bytes | 1.45 MiB |
| test.partsupp | 800000 | 120.17 MiB| 12.21 MiB | 132.38 MiB|
| test.customer | 150000 | 24.77 MiB | 0 bytes | 24.77 MiB |
| test.orders | 1527648 | 174.40 MiB| 0 bytes | 174.40 MiB|
| test.lineitem | 6491711 | 849.07 MiB| 99.06 MiB | 948.13 MiB|
+---------------+----------------+-----------+------------+-----------+
8 rows in set (0.06 sec)
```

This is a database of a commercial ordering system. In which, the `test.nation` table indicates the information about countries, the `test.region` table indicates the information about regions, the `test.part` table indicates the information about parts, the `test.supplier` table indicates the information about suppliers, the `test.partsupp` table indicates the information about parts of suppliers, the `test.customer` table indicates the information about customers, the `test.customer` table indicates the information about orders, and the `test.lineitem` table indicates the information about online items.

### Step 3. Query data with the row-based storage engine

To know the performance of TiDB with only the row-based storage engine, execute the following SQL statements:

{{< copyable "sql" >}}

```sql
SELECT
l_orderkey,
SUM(
l_extendedprice * (1 - l_discount)
) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
orders,
lineitem
WHERE
c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE '1996-01-01'
AND l_shipdate > DATE '1996-02-01'
GROUP BY
l_orderkey,
o_orderdate,
o_shippriority
ORDER BY
revenue DESC,
o_orderdate
limit 10;
```

This is a shipping priority query that gives priority and potential revenue to the highest-revenue order that has not been shipped by a specified date. The potential income is defined as the sum of `l_extendedprice * (1-l_discount)`. The orders are listed in descending order of revenue. In this example, this query lists the unshipped orders with potential query revenue in the top 10.
en-jin19 marked this conversation as resolved.
Show resolved Hide resolved

### Step 4. Replicate the test data to the columnar storage engine

After TiFlash is deployed, TiKV does not replicate data to TiFlash immediately. You need to execute the following DDL statements in a MySQL client of TiDB to specify which tables need to be replicated. After that, TiDB will create the specified replicas in TiFlash accordingly.

{{< copyable "sql" >}}

```sql
ALTER TABLE test.customer SET TIFLASH REPLICA 1;
ALTER TABLE test.orders SET TIFLASH REPLICA 1;
ALTER TABLE test.lineitem SET TIFLASH REPLICA 1;
```

To check the replication status of the specific tables, execute the following statements:

{{< copyable "sql" >}}

```sql
SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'customer';
SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'orders';
SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'lineitem';
```

In the result of the above statements:

- `AVAILABLE` indicates whether the TiFlash replica of a specific table is available or not. `1` means available and `0` means unavailable. Once a replica becomes available, this status does not change any more. If you use DDL statements to modify the number of replicas, the replication status will be recalculated.
- `PROGRESS` means the progress of the replication. The value is between 0.0 and 1.0. 1 means at least one replica is replicated.

### Step 5. Analyze data faster using HTAP

Execute the SQL statements in [Step 3](#step-3-query-data-with-the-row-based-storage-engine) again, and you can see the performance of TiDB HTAP.

For tables with TiFlash replicas, the TiDB optimizer automatically determines whether to use TiFlash replicas based on the cost estimation. To check whether or not a TiFlash replica is selected, you can use the `desc` or `explain analyze` statement. For example:

{{< copyable "sql" >}}

```sql
explain analyze SELECT
l_orderkey,
SUM(
l_extendedprice * (1 - l_discount)
) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
orders,
lineitem
WHERE
c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE '1996-01-01'
AND l_shipdate > DATE '1996-02-01'
GROUP BY
l_orderkey,
o_orderdate,
o_shippriority
ORDER BY
revenue DESC,
o_orderdate
limit 10;
```

If the result of the `EXPLAIN` statement shows `ExchangeSender` and `ExchangeReceiver` operators, it indicates that the MPP mode has taken effect.

In addition, you can specify that each part of the entire query is computed using only the TiFlash engine. For detailed information, see [Use TiDB to read TiFlash replicas](/tiflash/use-tiflash.md#use-tidb-to-read-tiflash-replicas).

You can compare query results and query performance of these two methods.

## What's next

- [Architecture of TiDB HTAP](/tiflash/tiflash-overview.md#architecture)
- [Explore HTAP](/explore-htap.md)
- [Use TiFlash](/tiflash/use-tiflash.md#use-tiflash)