[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover #12958

Myracle · 2020-07-22T09:43:45Z

What is the purpose of the change

When flink job fails because of killed taskmanagers, it will request new containers when restarting. Requesting new containers can be very slow, sometimes it takes dozens of seconds even more. The reasons can be different, for example, yarn and hdfs are slow, machine performance is poor.

To speed up the failover process, we can maintain redundant slots. Once job restarts, it can use the redundant slots at once instead of requesting new taskmanagers.

Brief change log

Add config slotmanager.redundant-slot-num in ResourceManagerOptions
Allocate redundant slots when start SlotManagerImpl
Change method taskManagerTimeoutCheck to checkValidTaskManagers
In method checkValidTaskManagers, maintain enough redundant slots. Under this premise, release timeout taskmanager

Verifying this change

This change added tests and can be verified as follows:

Add test case in SlotManagerImplTest
Change class TaskManagerReleaseInSlotManagerTest to TaskManagerValidateInSlotManagerTest and add tests

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (not documented)

flinkbot · 2020-07-22T09:46:46Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 65182ec (Wed Jul 22 09:46:45 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-07-22T10:35:11Z

CI report:

37599ea Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

xintongsong

Thanks for preparing this PR, @Myracle.
I've some comments regarding the changes. My main concern is regarding how the it would behave in case of JM failovers.

Some additional comments regarding organizing the commits:

Usually we put jira id & component in [] at the beginning of commit messages. E.g., [FLINK-18625][runtime] Support redundant slots to speed up failover
The second commit should be squashed into the first one. Alternatively, you can use a fixup commit.

flink-core/src/main/java/org/apache/flink/configuration/ResourceManagerOptions.java

...time/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java

.../src/test/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImplTest.java

...g/apache/flink/runtime/resourcemanager/slotmanager/TaskManagerValidateInSlotManagerTest.java

xintongsong

Thanks for addressing my comments, @Myracle.
The PR already looks quite good to me. I have only a few minor comments.
Moreover, I think you misunderstood my previous comment. I meant the hotfix commit which generates the docs should be squashed in to the commit that makes changes to the config options. Otherwise, we would leave a intermediate broken state in the git history.

flink-core/src/main/java/org/apache/flink/configuration/ResourceManagerOptions.java

.../org/apache/flink/runtime/resourcemanager/slotmanager/TaskManagerCheckInSlotManagerTest.java

KarmaGYZ

Thanks for the PR @Myracle . It generally looks good to me. I don't see any logic breaks the lower and upper bounds of slots number at the moment.

Myracle · 2020-07-29T23:56:22Z

Thanks, @KarmaGYZ .

xintongsong

Thanks for addressing the comments, @Myracle. LGTM.
I'll take over from here.

…ilover

xintongsong · 2020-07-30T02:32:01Z

@flinkbot run azure

xintongsong · 2020-07-30T05:15:31Z

Observed in manual tests that, Flink session cluster will not release redundant task managers when there's no job running. In such cases, I think we should release the redundant task managers because the extra resource occupation does not bring us any benefit.

Offline discussed with @Myracle and we agreed to make this optimization a follow up issue. I opened FLINK-18760 for tracking this.

rmetzger added review=description? component=Runtime/Coordination labels Jul 22, 2020

xintongsong changed the title ~~[FLINK-18625] [ Runtime / Coordination] Maintain redundant taskmanagers to speed up failover~~ [FLINK-18625] [runtime] Maintain redundant taskmanagers to speed up failover Jul 23, 2020

xintongsong changed the title ~~[FLINK-18625] [runtime] Maintain redundant taskmanagers to speed up failover~~ [FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover Jul 23, 2020

xintongsong requested changes Jul 23, 2020

View reviewed changes

Myracle force-pushed the FLINK-18625-redundant-slots branch 2 times, most recently from e4616f4 to deb09ce Compare July 27, 2020 00:59

xintongsong requested changes Jul 29, 2020

View reviewed changes

flink-core/src/main/java/org/apache/flink/configuration/ResourceManagerOptions.java Outdated Show resolved Hide resolved

.../org/apache/flink/runtime/resourcemanager/slotmanager/TaskManagerCheckInSlotManagerTest.java Outdated Show resolved Hide resolved

Myracle force-pushed the FLINK-18625-redundant-slots branch from 35ddeb1 to 800ed5c Compare July 29, 2020 12:14

KarmaGYZ approved these changes Jul 29, 2020

View reviewed changes

xintongsong approved these changes Jul 30, 2020

View reviewed changes

[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up fa…

37599ea

…ilover

xintongsong force-pushed the FLINK-18625-redundant-slots branch from 800ed5c to 37599ea Compare July 30, 2020 02:24

xintongsong closed this in 73a3111 Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover #12958

[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover #12958

Myracle commented Jul 22, 2020

flinkbot commented Jul 22, 2020

flinkbot commented Jul 22, 2020 •

edited

Loading

xintongsong left a comment

xintongsong left a comment

KarmaGYZ left a comment

Myracle commented Jul 29, 2020

xintongsong left a comment

xintongsong commented Jul 30, 2020

xintongsong commented Jul 30, 2020

[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover #12958

[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover #12958

Conversation

Myracle commented Jul 22, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jul 22, 2020

Automated Checks

Review Progress

flinkbot commented Jul 22, 2020 • edited Loading

CI report:

xintongsong left a comment

Choose a reason for hiding this comment

xintongsong left a comment

Choose a reason for hiding this comment

KarmaGYZ left a comment

Choose a reason for hiding this comment

Myracle commented Jul 29, 2020

xintongsong left a comment

Choose a reason for hiding this comment

xintongsong commented Jul 30, 2020

xintongsong commented Jul 30, 2020

flinkbot commented Jul 22, 2020 •

edited

Loading