Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover #12958

Closed
wants to merge 1 commit into from

Conversation

Myracle
Copy link
Contributor

@Myracle Myracle commented Jul 22, 2020

What is the purpose of the change

When flink job fails because of killed taskmanagers, it will request new containers when restarting. Requesting new containers can be very slow, sometimes it takes dozens of seconds even more. The reasons can be different, for example, yarn and hdfs are slow, machine performance is poor.

To speed up the failover process, we can maintain redundant slots. Once job restarts, it can use the redundant slots at once instead of requesting new taskmanagers.

Brief change log

  • Add config slotmanager.redundant-slot-num in ResourceManagerOptions
  • Allocate redundant slots when start SlotManagerImpl
  • Change method taskManagerTimeoutCheck to checkValidTaskManagers
  • In method checkValidTaskManagers, maintain enough redundant slots. Under this premise, release timeout taskmanager

Verifying this change

This change added tests and can be verified as follows:

  • Add test case in SlotManagerImplTest
  • Change class TaskManagerReleaseInSlotManagerTest to TaskManagerValidateInSlotManagerTest and add tests

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (not documented)

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 65182ec (Wed Jul 22 09:46:45 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 22, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@xintongsong xintongsong changed the title [FLINK-18625] [ Runtime / Coordination] Maintain redundant taskmanagers to speed up failover [FLINK-18625] [runtime] Maintain redundant taskmanagers to speed up failover Jul 23, 2020
@xintongsong xintongsong changed the title [FLINK-18625] [runtime] Maintain redundant taskmanagers to speed up failover [FLINK-18625][runtime] Maintain redundant taskmanagers to speed up failover Jul 23, 2020
Copy link
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for preparing this PR, @Myracle.
I've some comments regarding the changes. My main concern is regarding how the it would behave in case of JM failovers.

Some additional comments regarding organizing the commits:

  • Usually we put jira id & component in [] at the beginning of commit messages. E.g., [FLINK-18625][runtime] Support redundant slots to speed up failover
  • The second commit should be squashed into the first one. Alternatively, you can use a fixup commit.

@Myracle Myracle force-pushed the FLINK-18625-redundant-slots branch 2 times, most recently from e4616f4 to deb09ce Compare July 27, 2020 00:59
Copy link
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments, @Myracle.
The PR already looks quite good to me. I have only a few minor comments.
Moreover, I think you misunderstood my previous comment. I meant the hotfix commit which generates the docs should be squashed in to the commit that makes changes to the config options. Otherwise, we would leave a intermediate broken state in the git history.

@Myracle Myracle force-pushed the FLINK-18625-redundant-slots branch from 35ddeb1 to 800ed5c Compare July 29, 2020 12:14
Copy link
Contributor

@KarmaGYZ KarmaGYZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Myracle . It generally looks good to me. I don't see any logic breaks the lower and upper bounds of slots number at the moment.

@Myracle
Copy link
Contributor Author

Myracle commented Jul 29, 2020

Thanks, @KarmaGYZ .

Copy link
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, @Myracle. LGTM.
I'll take over from here.

@xintongsong xintongsong force-pushed the FLINK-18625-redundant-slots branch from 800ed5c to 37599ea Compare July 30, 2020 02:24
@xintongsong
Copy link
Contributor

@flinkbot run azure

@xintongsong
Copy link
Contributor

Observed in manual tests that, Flink session cluster will not release redundant task managers when there's no job running. In such cases, I think we should release the redundant task managers because the extra resource occupation does not bring us any benefit.

Offline discussed with @Myracle and we agreed to make this optimization a follow up issue. I opened FLINK-18760 for tracking this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants