Skip to content

Cloudstorage Sink support merge dml events #9169

Open
@Lloyd-Pottiger

Description

Is your feature request related to a problem?

I am trying to use ticdc to migration TiDB data to snowflake. The main process is

  1. create file format and stage in snowflake to store files
CREATE OR REPLACE FILE FORMAT my_log_csv_format
  TYPE = CSV
  FIELD_OPTIONALLY_ENCLOSED_BY='"';
  
CREATE STAGE my_s3_stage_log
  STORAGE_INTEGRATION = my_s3
  URL = 's3://wenxuan-snowflake-test/cdc/test2/chbenchmark/'
  FILE_FORMAT = my_log_csv_format;
  1. create a changefeed in cdc to capture data change, and store in S3

  2. put the file in s3 into snowflake stage

CREATE OR REPLACE STAGE "table_a" FILE_FORMAT = my_log_csv_format;
  1. merge the staged file into snowflake table

  2. remove the file from the stage

There will cause an error in step4 when there are multiple dml events in the same file. The MERGE INTO statements of Snowflake can not update target table real-time. So there are two dml on the same row like insert row1 -> delete row1, then row1 will not be deleted.

Describe the feature you'd like

Merge the dml events affect on the same row in the same file.

Like in CDC0000001.csv, we have

Case 1

    uk
U  0   1  A
U  0   2  A

merge to

    uk
U  0   2  A

Case 2

    uk
I    0   1  A
U  0   2  A

merge to

    uk
I   0   2  A

Case 3

    uk
I    0   1  A
D  0   1  A

merge to

    uk

Describe alternatives you've considered

It can also help improve the performance of consuming in the downstream

Teachability, Documentation, Adoption, Migration Strategy

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

type/featureIssues about a new feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions