-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover #12958
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 65182ec (Wed Jul 22 09:46:45 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for preparing this PR, @Myracle.
I've some comments regarding the changes. My main concern is regarding how the it would behave in case of JM failovers.
Some additional comments regarding organizing the commits:
- Usually we put jira id & component in
[]
at the beginning of commit messages. E.g.,[FLINK-18625][runtime] Support redundant slots to speed up failover
- The second commit should be squashed into the first one. Alternatively, you can use a fixup commit.
flink-core/src/main/java/org/apache/flink/configuration/ResourceManagerOptions.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/configuration/ResourceManagerOptions.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java
Show resolved
Hide resolved
...time/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImplTest.java
Outdated
Show resolved
Hide resolved
...g/apache/flink/runtime/resourcemanager/slotmanager/TaskManagerValidateInSlotManagerTest.java
Outdated
Show resolved
Hide resolved
e4616f4
to
deb09ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments, @Myracle.
The PR already looks quite good to me. I have only a few minor comments.
Moreover, I think you misunderstood my previous comment. I meant the hotfix commit which generates the docs should be squashed in to the commit that makes changes to the config options. Otherwise, we would leave a intermediate broken state in the git history.
flink-core/src/main/java/org/apache/flink/configuration/ResourceManagerOptions.java
Outdated
Show resolved
Hide resolved
.../org/apache/flink/runtime/resourcemanager/slotmanager/TaskManagerCheckInSlotManagerTest.java
Outdated
Show resolved
Hide resolved
35ddeb1
to
800ed5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @Myracle . It generally looks good to me. I don't see any logic breaks the lower and upper bounds of slots number at the moment.
Thanks, @KarmaGYZ . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments, @Myracle. LGTM.
I'll take over from here.
800ed5c
to
37599ea
Compare
@flinkbot run azure |
Observed in manual tests that, Flink session cluster will not release redundant task managers when there's no job running. In such cases, I think we should release the redundant task managers because the extra resource occupation does not bring us any benefit. Offline discussed with @Myracle and we agreed to make this optimization a follow up issue. I opened FLINK-18760 for tracking this. |
What is the purpose of the change
When flink job fails because of killed taskmanagers, it will request new containers when restarting. Requesting new containers can be very slow, sometimes it takes dozens of seconds even more. The reasons can be different, for example, yarn and hdfs are slow, machine performance is poor.
To speed up the failover process, we can maintain redundant slots. Once job restarts, it can use the redundant slots at once instead of requesting new taskmanagers.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation