Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sink to external storage (cdc) depends on local clock to generate file/dir name is not reliable for distributed system #10374

Open
zhangjinpeng87 opened this issue Dec 27, 2023 · 1 comment
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@zhangjinpeng87
Copy link
Contributor

Description

Sink to external storage will output DDL and DML events to files and then upload them to external storage like S3. TiCDC's path generator https://github.com/pingcap/tiflow/blob/master/pkg/sink/cloudstorage/path.go use a local monotonic clock's time as dir/file names which is not reliable under some cases, for example when a changfeed was handled by CDC node1, when CDC node1 crashed or restarted and this changfeed was re-scheduled to CDC node2, if there is clock drift between CDC node1 and CDC node2, from the consumer's perspective, there might be a time rewind issue which may cause the consumer missing some data.

|-------------|                               |-------------|
| CDC Node1   |                               | CDC Node2   |
|-------------|                               |-------------|
          |                                          |
          |        |---------------------------|     |
          |-->     |  External Storage like S3 |  <--|
                   |---------------------------|

file-20231227-084133-xxx(generated by cdc node1)
file-20231227-084132-xxx(generated by cdc node2 after cdc node1 crashed, consumer may skip this file since the consumer checkpoint may reached 20231227-084133 when cdc node1 crashed)

Enhancement

As a distributed system, TiCDC should use a reliable way like global monotonic timestamp to generate file/dir names. In this way, TiCDC can work as expected in case of cross region deployment or there is clock drift between different nodes, and other extreme cases.

Alternatives

Open NTP to keep the clock drift between nodes under some threshold like 500ms, and make sure the changefeed wait for a safe time range after it rescheduled to other nodes. This make TiCDC has a strong dependency with these preparations which is not a good design for a distributed system.

@zhangjinpeng87 zhangjinpeng87 added the type/enhancement The issue or PR belongs to an enhancement. label Dec 27, 2023
@zhangjinpeng87
Copy link
Contributor Author

#10351 use pd clock as a short term fix.

ti-chi-bot bot pushed a commit that referenced this issue Jan 22, 2024
CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Jan 24, 2024
CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Jan 24, 2024
CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Jan 24, 2024
CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Jan 24, 2024
CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Feb 7, 2024
CharlesCheung96 added a commit to ti-chi-bot/tiflow that referenced this issue Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

1 participant