Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental commit of INSERT...SELECT #18038

Open
kolbe opened this issue Jun 15, 2020 · 12 comments
Open

Support incremental commit of INSERT...SELECT #18038

kolbe opened this issue Jun 15, 2020 · 12 comments
Assignees
Labels
feature/discussing This feature request is discussing among product managers type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@kolbe
Copy link
Contributor

kolbe commented Jun 15, 2020

Feature Request

Is your feature request related to a problem? Please describe:

TiDB does not handle large transactions well.

INSERT...SELECT is a common and generally inexpensive mechanism used by MySQL applications to insert a large amount of data from one table (or query result) into another table. This does not work well in TiDB if there are very many rows in the transaction.

Describe the feature you'd like:

A solution to support large INSERT...SELECT transactions in TiDB could be to automatically commit the running transaction periodically so that the transaction does not grow to be too large.

I'd like to see support for this in 2 forms:

  1. A per-statement flag that could be added to the SQL itself, like this:

    INSERT INTO tbl1 SELECT * from tbl2 /*+ TIDB_INCREMENTAL_COMMIT(10000) */

  2. A system variable that can be set either as a SESSION or GLOBAL variable (with a corresponding configuration variable to set the GLOBAL value on startup):

    SET GLOBAL tidb_incremental_commit=10000;

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

Users migrating from MySQL might run into big problems in TiDB because they're using this idiom. Adding support for this could make it easier for those users to migrate their applications to TiDB.

If this feature is implemented, it should be considered whether it applies to all statements or only specific types of statements (LOAD DATA, INSERT...SELECT, etc.).

@kolbe kolbe added the type/feature-request Categorizes issue or PR as related to a new feature. label Jun 15, 2020
@scsldb scsldb added the feature/reviewing This feature request is reviewing by product managers label Jul 16, 2020
@zz-jason
Copy link
Member

A solution to support large INSERT...SELECT transactions in TiDB could be to automatically commit the running transaction periodically so that the transaction does not grow to be too large.

"commit the running transaction periodically" would break the atomic property of the original transaction. Is that acceptable by the user? Seems it's the same with #18028. Although it's issue summary is about "performance", but the original problem that #18028 wants to solve is similar to this issue.

Is #18028 (comment) a workaround for users?

@zz-jason
Copy link
Member

zz-jason commented Jul 20, 2020

For now:

@ghost
Copy link

ghost commented Jul 22, 2020

  • A per-statement flag that could be added to the SQL itself, like this:
    INSERT INTO tbl1 SELECT * from tbl2 /*+ TIDB_INCREMENTAL_COMMIT(10000) */

As an alternative, may I suggest adding the SET_VAR hint. It helps introduce per-statement scoped variables without needing additional syntax.

Edit: See #18748 for SET_VAR hint.

@zz-jason zz-jason added feature/discussing This feature request is discussing among product managers and removed feature/reviewing This feature request is reviewing by product managers labels Aug 10, 2020
@scsldb
Copy link

scsldb commented Aug 28, 2020

We need to investigate the user scenario in depth, and then make a decision after the research is completed.

@ghost
Copy link

ghost commented Sep 4, 2020

Note that LOAD DATA already has this behavior, although no longer by default. See: #18807

The setting to make it incremental commit is called tidb_dml_batch_size.

@ghost
Copy link

ghost commented Sep 6, 2020

I hit this today while trying to create examples for partitioning on moderate-sized tables. Because TiDB does not support adding partitioning, or removing partitioning from existing tables, it requires an export and then a reload:

mysql> ALTER TABLE t1 PARTITION BY RANGE ( YEAR(`start_date`) ) (
    ->   PARTITION `p2010` VALUES LESS THAN (2011),
    ->   PARTITION `p2011` VALUES LESS THAN (2012),
    ->   PARTITION `p2012` VALUES LESS THAN (2013),
    ->   PARTITION `p2013` VALUES LESS THAN (2014),
    ->   PARTITION `p2014` VALUES LESS THAN (2015),
    ->   PARTITION `p2015` VALUES LESS THAN (2016),
    ->   PARTITION `p2016` VALUES LESS THAN (2017),
    ->   PARTITION `p2017` VALUES LESS THAN (2018),
    ->   PARTITION `p2018` VALUES LESS THAN (2019),
    ->   PARTITION `p2019` VALUES LESS THAN (2020),
    ->   PARTITION `pmax` VALUES LESS THAN (MAXVALUE)
    -> );
ERROR 1105 (HY000): alter table partition is unsupported

mysql> ALTER TABLE trips_partitioned REMOVE PARTITIONING;
ERROR 8200 (HY000): Unsupported remove partitioning

It is not user-friendly to show examples that require a local dumper as an in-between step. It is (obviously) optimal if TiDB supports all DDL variants, but anticipating that this could take some time, incremental commit is important because it provides a viable workaround.

@ghost
Copy link

ghost commented Oct 21, 2020

Here is another use case:

This feature combines very well with a time-travel query ( #18672 ), so you can effectively INSERT INTO tmplocation SELECT * FROM maintable AS OF SYSTEM TYPE '2020-10-01 10:00:00'.

Since there is already a PR for the SET_VAR hint, and LOAD DATA already supports incremental commit with tidb_dml_batch_size, my suggestion is that this could be implemented by extending tidb_dml_batch_size to be settable with SET_VAR and apply to INSERT .. SELECT.

@zz-jason @jackysp what do you think?

@jackysp
Copy link
Member

jackysp commented Oct 21, 2020

tidb_dml_batch_size can already apply to insert .. select, as long as https://github.com/pingcap/tidb/blob/master/config/config.toml.example#L65 is turned true and set tidb_batch_insert = 1, you can commit transactions in batches. However, I am not sure whether the set_var hint can allow writing when using tidb_snapshot.
@nullnotnil

@ghost
Copy link

ghost commented Oct 22, 2020

@jackysp Thanks! I can confirm it works (kind of):

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
 pk VARBINARY(36) NOT NULL PRIMARY KEY,
 b BIGINT NOT NULL,
 c BIGINT NOT NULL,
 pad VARBINARY(2048),
 INDEX (b),
 INDEX (c)
);
INSERT INTO t1 SELECT uuid(), FLOOR(RAND()*5), FLOOR(RAND()*1000000), HEX(RANDOM_BYTES(1000)) FROM dual;
INSERT INTO t1 SELECT uuid(), FLOOR(RAND()*5), FLOOR(RAND()*1000000), HEX(RANDOM_BYTES(1000)) FROM t1 a JOIN t1 b JOIN t1 c;
INSERT INTO t1 SELECT uuid(), FLOOR(RAND()*5), FLOOR(RAND()*1000000), HEX(RANDOM_BYTES(1000)) FROM t1 a JOIN t1 b JOIN t1 c;

-- This should insert 1000 rows (10^3)
INSERT INTO t1 SELECT uuid(), FLOOR(RAND()*5), FLOOR(RAND()*1000000), HEX(RANDOM_BYTES(1000)) FROM t1 a JOIN t1 b JOIN t1 c;

-- 100K rows (fails by default)
INSERT INTO t1 SELECT uuid(), FLOOR(RAND()*5), FLOOR(RAND()*1000000), HEX(RANDOM_BYTES(1000)) FROM t1 a JOIN t1 b LIMIT 100000;

SET tidb_dml_batch_size = 20000;
SET tidb_batch_insert = 1;
-- 100K rows (now works)
INSERT INTO t1 SELECT uuid(), FLOOR(RAND()*5), FLOOR(RAND()*1000000), HEX(RANDOM_BYTES(1000)) FROM t1 a JOIN t1 b LIMIT 100000;

I don't think set_var + tidb_snapshot can work together here, since tidb_snapshot implies read-only and set_var scope is statement (not query block). We will need to add full support for time-travel query.

I also noted that if I try to simulate a very large table (inserting 1010^3 rows) I could trigger out of memory controls. But this also happened when running a SELECT stmt and piping to /dev/null, so I will create a separate issue Edit: It was the client OOM'ing and not the server. If I add the --quick option the memory is stable. I am going to fork this to a separate bug report.

@kennytm
Copy link
Contributor

kennytm commented Oct 27, 2020

(note that the hint should be placed after the INSERT keyword, not at the end of the statement.)

INSERT /*+ SET_VAR(...) */ INTO tbl1 SELECT * from tbl2;

@ghost
Copy link

ghost commented Oct 31, 2020

I have a proposal for how to address this issue. I miss some of the context about why certain feature-flags exist, so I appreciate input:

  1. The config option enable-batch-dml is enabled by default.
  2. The sysvar tidb_batch_insert is deprecated, and implied by a non-zero value for tidb_dml_batch_size.
  3. The sysvar tidb_dml_batch_size is changed to be SET_VAR hintable.
  4. tidb_batch_insert should not hit executor OOM quota #20597 is fixed. (Done)

Items 1 and 2 make INSERT.. SELECT behave the same as LOAD DATA does, which (inconsistently) does not depend on these two settings.

@Mini256
Copy link
Member

Mini256 commented Mar 22, 2024

Have been resolved by non-transactional-dml?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/discussing This feature request is discussing among product managers type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants