Skip to content

Conversation

@yuxiqian
Copy link
Member

This closes FLINK-37578 and #3858.

There's a subtle sequential bug in both SchemaCoordinators after a schema evolve coordination process finishes. Coordinator may finish operators' blocking state first before restoring internal state properly, which may accidentally expose unwanted internal states or freeze the entire pipeline job.

@linjianchang's optimization in #3858 is actually correct, however it increases the chance of this glitch. Baked the original commit into this PR to test if it works well.

@lvyanquan
Copy link
Contributor

Hi, @yuxiqian. This pr looks good to me.
But I believe that we lack the necessary e2e testing to expose such issues. Can you create a Jira to trace it?

@yuxiqian
Copy link
Member Author

Thanks for @lvyanquan's suggestion, traced in FLINK-37704. Perhaps more test cases could be added based on changes in #3965.

Copy link
Contributor

@leonardBang leonardBang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yuxiqian for the contribution, LGTM

@leonardBang leonardBang merged commit 4743399 into apache:master Apr 21, 2025
28 checks passed
linjianchang pushed a commit to linjianchang/flink-cdc that referenced this pull request May 16, 2025
…d internal state accidentally

This closes  apache#3972

Co-authored-by: linjc13 <linjc13@chinatelecom.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants