Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the Performance of Bulk Deletion #18028

Closed
scsldb opened this issue Jun 15, 2020 · 14 comments
Closed

Optimize the Performance of Bulk Deletion #18028

scsldb opened this issue Jun 15, 2020 · 14 comments
Assignees
Labels
feature/accepted This feature request is accepted by product managers priority/P0 The issue has P0 priority. type/feature-request Categorizes issue or PR as related to a new feature.
Milestone

Comments

@scsldb
Copy link

scsldb commented Jun 15, 2020

Description

Optimize the performance and efficiency of bulk deletion, a typical query is:

delete * from table where id < ?

Category

Feature, Stability

Value

Implement a special SQL using offset in session to improve performance delete * from table where t < xxx (backgroud & no-automic).

Time

GanttStart: 2020-07-31
GanttDue: 2020-08-15

@scsldb scsldb added type/feature-request Categorizes issue or PR as related to a new feature. priority/P0 The issue has P0 priority. labels Jun 15, 2020
@scsldb scsldb added this to the v5.0-alpha milestone Jun 15, 2020
@zhangjinpeng87
Copy link
Contributor

The original scenario should be

affect_rows = 1
while affect_rows > 0 {
    affect_rows = `delete * from table where t < xxx limit 5000`
}

@zhangjinpeng87
Copy link
Contributor

At first, we can test how about the performance when enabling coprocessor cache.

@zhangjinpeng87
Copy link
Contributor

zhangjinpeng87 commented Jul 9, 2020

#18308 may be a common solution for this issue. We can also enable cop cache
at the same time if the transaction is extremely huge.

@zz-jason zz-jason changed the title Optimize performance and efficiency of bulk deletion - delete * from table where id < ? Optimize the Performance of Bulk Deletion Jul 11, 2020
@zz-jason
Copy link
Member

@SunRunAway PTAL. If we rely on coprocessor cache to speed up this scenario, we should make coprocessor cache generally available at least in 5.0.0.

BTW, we can also brainstorm whether there are other methods to speed up this scenario.

@zz-jason
Copy link
Member

At first, we can test how about the performance when enabling coprocessor cache.

We need to construct a bulk delete scenario to reproduce the performance issue.

@zz-jason
Copy link
Member

zz-jason commented Jul 13, 2020

Another solution is to separate all the data into "pages", calculate the page boundaries, delete the data "page" by "page":

create a table with shard_row_id_bits equals to 4 and load some data:

create table t(a bigint, b bigint) shard_row_id_bits = 4;
insert into t values(1, 1);
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;

separate all the data into "pages", calculate the page bounds, in the following example, each "page" contains 5 rows:

set @batch_size = 5;

select
    min(tmp._tidb_rowid) as start_rowid,
    max(tmp._tidb_rowid) as end_rowid,
    count(*) as page_size
from (
    select
        *,
        _tidb_rowid as _tidb_rowid,
        row_number () over (order by t._tidb_rowid) as row_num
    from t
) tmp
group by floor((tmp.row_num - 1) / @batch_size)
order by start_rowid;

the query result of the above SQL is:

+---------------------+---------------------+-----------+
| start_rowid         | end_rowid           | page_size |
+---------------------+---------------------+-----------+
|                   5 |  576460752303423505 |         5 |
|  576460752303423506 |  576460752303423510 |         5 |
|  576460752303423511 |  576460752303423515 |         5 |
|  576460752303423516 |  576460752303423520 |         5 |
| 1152921504606846979 | 2882303761517117475 |         5 |
| 2882303761517117476 | 2882303761517117480 |         5 |
| 2882303761517117481 | 2882303761517117485 |         5 |
| 2882303761517117486 | 2882303761517117490 |         5 |
| 2882303761517117491 | 2882303761517117495 |         5 |
| 2882303761517117496 | 2882303761517117500 |         5 |
| 2882303761517117501 | 2882303761517117505 |         5 |
| 2882303761517117506 | 2882303761517117510 |         5 |
| 2882303761517117511 | 2882303761517117515 |         5 |
| 2882303761517117516 | 2882303761517117520 |         5 |
| 2882303761517117521 | 2882303761517117525 |         5 |
| 2882303761517117526 | 2882303761517117530 |         5 |
| 2882303761517117531 | 2882303761517117535 |         5 |
| 2882303761517117536 | 2882303761517117540 |         5 |
| 2882303761517117541 | 2882303761517117545 |         5 |
| 2882303761517117546 | 2882303761517117550 |         5 |
| 2882303761517117551 | 2882303761517117555 |         5 |
| 2882303761517117556 | 2882303761517117560 |         5 |
| 2882303761517117561 | 2882303761517117565 |         5 |
| 2882303761517117566 | 4611686018427387913 |         5 |
| 4611686018427387914 | 4611686018427387918 |         5 |
| 4611686018427387919 | 5764607523034235011 |         5 |
| 5764607523034235012 | 5764607523034235016 |         5 |
| 5764607523034235017 | 5764607523034235021 |         5 |
| 5764607523034235022 | 5764607523034235026 |         5 |
| 5764607523034235027 | 5764607523034235031 |         5 |
| 5764607523034235032 | 5764607523034235036 |         5 |
| 5764607523034235037 | 5764607523034235041 |         5 |
| 5764607523034235042 | 5764607523034235046 |         5 |
| 5764607523034235047 | 5764607523034235051 |         5 |
| 5764607523034235052 | 5764607523034235056 |         5 |
| 5764607523034235057 | 5764607523034235061 |         5 |
| 5764607523034235062 | 5764607523034235066 |         5 |
| 5764607523034235067 | 5764607523034235071 |         5 |
| 5764607523034235072 | 5764607523034235076 |         5 |
| 5764607523034235077 | 5764607523034235081 |         5 |
| 5764607523034235082 | 5764607523034235086 |         5 |
| 5764607523034235087 | 5764607523034235091 |         5 |
| 5764607523034235092 | 5764607523034235096 |         5 |
| 5764607523034235097 | 5764607523034235101 |         5 |
| 5764607523034235102 | 5764607523034235106 |         5 |
| 5764607523034235107 | 5764607523034235111 |         5 |
| 5764607523034235112 | 5764607523034235116 |         5 |
| 5764607523034235117 | 5764607523034235121 |         5 |
| 5764607523034235122 | 5764607523034235126 |         5 |
| 5764607523034235127 | 5764607523034235131 |         5 |
| 5764607523034235132 | 5764607523034235136 |         5 |
| 7493989779944505346 | 7493989779944505346 |         1 |
+---------------------+---------------------+-----------+
52 rows in set (0.00 sec)

delete them page by pages:

delete from t where _tidb_rowid >=                   5 and _tidb_rowid <=  576460752303423505;
delete from t where _tidb_rowid >=  576460752303423506 and _tidb_rowid <=  576460752303423510;
delete from t where _tidb_rowid >=  576460752303423511 and _tidb_rowid <=  576460752303423515;
delete from t where _tidb_rowid >=  576460752303423516 and _tidb_rowid <=  576460752303423520;
delete from t where _tidb_rowid >= 1152921504606846979 and _tidb_rowid <= 2882303761517117475;
delete from t where _tidb_rowid >= 2882303761517117476 and _tidb_rowid <= 2882303761517117480;
delete from t where _tidb_rowid >= 2882303761517117481 and _tidb_rowid <= 2882303761517117485;
delete from t where _tidb_rowid >= 2882303761517117486 and _tidb_rowid <= 2882303761517117490;
delete from t where _tidb_rowid >= 2882303761517117491 and _tidb_rowid <= 2882303761517117495;
delete from t where _tidb_rowid >= 2882303761517117496 and _tidb_rowid <= 2882303761517117500;
delete from t where _tidb_rowid >= 2882303761517117501 and _tidb_rowid <= 2882303761517117505;
delete from t where _tidb_rowid >= 2882303761517117506 and _tidb_rowid <= 2882303761517117510;
delete from t where _tidb_rowid >= 2882303761517117511 and _tidb_rowid <= 2882303761517117515;
delete from t where _tidb_rowid >= 2882303761517117516 and _tidb_rowid <= 2882303761517117520;
delete from t where _tidb_rowid >= 2882303761517117521 and _tidb_rowid <= 2882303761517117525;
delete from t where _tidb_rowid >= 2882303761517117526 and _tidb_rowid <= 2882303761517117530;
delete from t where _tidb_rowid >= 2882303761517117531 and _tidb_rowid <= 2882303761517117535;
delete from t where _tidb_rowid >= 2882303761517117536 and _tidb_rowid <= 2882303761517117540;
delete from t where _tidb_rowid >= 2882303761517117541 and _tidb_rowid <= 2882303761517117545;
delete from t where _tidb_rowid >= 2882303761517117546 and _tidb_rowid <= 2882303761517117550;
delete from t where _tidb_rowid >= 2882303761517117551 and _tidb_rowid <= 2882303761517117555;
delete from t where _tidb_rowid >= 2882303761517117556 and _tidb_rowid <= 2882303761517117560;
delete from t where _tidb_rowid >= 2882303761517117561 and _tidb_rowid <= 2882303761517117565;
delete from t where _tidb_rowid >= 2882303761517117566 and _tidb_rowid <= 4611686018427387913;
delete from t where _tidb_rowid >= 4611686018427387914 and _tidb_rowid <= 4611686018427387918;
delete from t where _tidb_rowid >= 4611686018427387919 and _tidb_rowid <= 5764607523034235011;
delete from t where _tidb_rowid >= 5764607523034235012 and _tidb_rowid <= 5764607523034235016;
delete from t where _tidb_rowid >= 5764607523034235017 and _tidb_rowid <= 5764607523034235021;
delete from t where _tidb_rowid >= 5764607523034235022 and _tidb_rowid <= 5764607523034235026;
delete from t where _tidb_rowid >= 5764607523034235027 and _tidb_rowid <= 5764607523034235031;
delete from t where _tidb_rowid >= 5764607523034235032 and _tidb_rowid <= 5764607523034235036;
delete from t where _tidb_rowid >= 5764607523034235037 and _tidb_rowid <= 5764607523034235041;
delete from t where _tidb_rowid >= 5764607523034235042 and _tidb_rowid <= 5764607523034235046;
delete from t where _tidb_rowid >= 5764607523034235047 and _tidb_rowid <= 5764607523034235051;
delete from t where _tidb_rowid >= 5764607523034235052 and _tidb_rowid <= 5764607523034235056;
delete from t where _tidb_rowid >= 5764607523034235057 and _tidb_rowid <= 5764607523034235061;
delete from t where _tidb_rowid >= 5764607523034235062 and _tidb_rowid <= 5764607523034235066;
delete from t where _tidb_rowid >= 5764607523034235067 and _tidb_rowid <= 5764607523034235071;
delete from t where _tidb_rowid >= 5764607523034235072 and _tidb_rowid <= 5764607523034235076;
delete from t where _tidb_rowid >= 5764607523034235077 and _tidb_rowid <= 5764607523034235081;
delete from t where _tidb_rowid >= 5764607523034235082 and _tidb_rowid <= 5764607523034235086;
delete from t where _tidb_rowid >= 5764607523034235087 and _tidb_rowid <= 5764607523034235091;
delete from t where _tidb_rowid >= 5764607523034235092 and _tidb_rowid <= 5764607523034235096;
delete from t where _tidb_rowid >= 5764607523034235097 and _tidb_rowid <= 5764607523034235101;
delete from t where _tidb_rowid >= 5764607523034235102 and _tidb_rowid <= 5764607523034235106;
delete from t where _tidb_rowid >= 5764607523034235107 and _tidb_rowid <= 5764607523034235111;
delete from t where _tidb_rowid >= 5764607523034235112 and _tidb_rowid <= 5764607523034235116;
delete from t where _tidb_rowid >= 5764607523034235117 and _tidb_rowid <= 5764607523034235121;
delete from t where _tidb_rowid >= 5764607523034235122 and _tidb_rowid <= 5764607523034235126;
delete from t where _tidb_rowid >= 5764607523034235127 and _tidb_rowid <= 5764607523034235131;
delete from t where _tidb_rowid >= 5764607523034235132 and _tidb_rowid <= 5764607523034235136;
delete from t where _tidb_rowid >= 7493989779944505346 and _tidb_rowid <= 7493989779944505346;

with the filters on _tidb_rowid, we only need to scan the desired key-value pairs, hope that would save the performance:

TiDB(root@127.0.0.1:test) > explain delete from t where _tidb_rowid >=                   5 and _tidb_rowid <=  576460752303423505;
+--------------------------+---------+-----------+---------------+--------------------------------------------------------------+
| id                       | estRows | task      | access object | operator info                                                |
+--------------------------+---------+-----------+---------------+--------------------------------------------------------------+
| Delete_4                 | N/A     | root      |               | N/A                                                          |
| └─TableReader_7          | 256.00  | root      |               | data:TableRangeScan_6                                        |
|   └─TableRangeScan_6     | 256.00  | cop[tikv] | table:t       | range:[5,576460752303423505], keep order:false, stats:pseudo |
+--------------------------+---------+-----------+---------------+--------------------------------------------------------------+
3 rows in set (0.00 sec)

We can also delete the "pages" parallel if it's required to execute faster.

@SunRunAway

This comment has been minimized.

@zz-jason zz-jason assigned zz-jason and SunRunAway and unassigned zz-jason Jul 14, 2020
@zz-jason zz-jason added the feature/accepted This feature request is accepted by product managers label Jul 29, 2020
@SunRunAway SunRunAway assigned lzmhhh123 and unassigned SunRunAway Sep 10, 2020
@lzmhhh123
Copy link
Contributor

I have test continuous delete with coprocessor cache or without coprocessor cache.

The environment is:

  • 3 TiDB, 3 PD, 3 TiKV, 3 TiFlash
  • version is v4.0.6
  • use tpch factor 10, lineitem table

The script is:

#!/usr/bin/env  python
#encoding=utf-8
import mysql.connector

conn = mysql.connector.connect(host='172.16.5.85', port=8092, user='root', password='', database='tpch_10')

for i in range(2000):
    cursor = conn.cursor()
    cursor.execute("delete from lineitem limit 10000")
    print(cursor.rowcount, i)
    conn.commit()
    cursor.close()

print("finished")

By monitor of Grafanaz, the result is:

  • TiDB with coprocessor cache:
    image

  • TiDB without coprocessor cache:
    image

When I turn on the coprocessor cache. P99 is about 250ms, p999 is about 500ms and the jitter is very stable. When turn off the coprocessor cache. P99 is about 1s, p999 is about 2s. By the way, the script got an error TiKV server timeout halfway. So in my opinion, turn on the coprocessor cache, to a great extent, could resolve the issue.

@lzmhhh123
Copy link
Contributor

For now, we could turn on the copr-cache to solve the problem by temp. If we need some improvement in the future. I will open the issue again.

@zz-jason
Copy link
Member

@lzmhhh123 could you sumarize the time to delete all the data in the table with or without coprocessor cache?

@zz-jason
Copy link
Member

zz-jason commented Sep 24, 2020

Could you also summarize the following questions:

  • What's the coprocessor cache hit ratio when the coprocessor cache is enabled?
  • What's the memory usage of coprocessor cache enabled and disabled respectively?
  • Is there any scenario that the coprocessor cache can not speed up the bulk deletion?

@lzmhhh123
Copy link
Contributor

lzmhhh123 commented Sep 24, 2020

@lzmhhh123 could you sumarize the time to delete all the data in the table with or without coprocessor cache?

I only delete 20M rows data. The number of total rows is 60M. For turning on coprocessor cache, it cost about 6 minutes. For turning off coprocessor cache, it cost 9 min to delete 9M rows and then got an error TiKV server timeout.

  • What's the coprocessor cache hit ratio when the coprocessor cache is enabled?

There is not an accurate value now. However, it may exceed 90% I guess.

  • What's the memory usage of coprocessor cache enabled and disabled respectively?

here is the memory usage:
image

  • with copr-cache: 11:08-11:14, memory increase from 800MB to 1.2GB
  • without copr-cahce: 11:34-11:43, memory is about 250MB+

By the way, the lots of memory cost in middle time is because of load data.

  • Is there any scenario that the coprocessor cache can not speed up the bulk deletion?

Yes, the user manually maintains the deleted range for each iteration.

@zz-jason
Copy link
Member

I only delete 20M rows data. The number of total rows is 60M. For turning on coprocessor cache, it cost about 6 minutes.

Could you add more tests to help users know more about the deletion speed on different data sizes and the batch size for each delete transaction?

  1. For different records, how about testing 1, 10, and 100 million records?
  2. For the best batch size (N in limit N), how about testing 1000, 10000, 100000, and other batch sizes to find which is best?

Finally, we can add a new best practice document about the bulk deletion.

There is not an accurate value now. However, it may exceed 90% I guess.

It's better to add some metrics to observe the exact number. BTW, Do we have metrics about the memory consumption of coprocessor cache?

Yes, the user manually maintains the deleted range for each iteration.

It's important for the users to know more about what scenario can benefit from coprocessor cache and what scenarios can not. Could you summarize these scenarios with some examples and experiments?

@lzmhhh123
Copy link
Contributor

lzmhhh123 commented Sep 24, 2020

@zz-jason Metrics is added, PR is under review. For best practice, I will do more experiments and publish an essay for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/accepted This feature request is accepted by product managers priority/P0 The issue has P0 priority. type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants