Always allow local blocks to be put#1
Closed
holdenk wants to merge 4 commits intoagrawaldevesh:block-manager-decom-flakyfrom
Closed
Always allow local blocks to be put#1holdenk wants to merge 4 commits intoagrawaldevesh:block-manager-decom-flakyfrom
holdenk wants to merge 4 commits intoagrawaldevesh:block-manager-decom-flakyfrom
Conversation
An interesting failure happens when migrateDuring = true (and persist or shuffle is true): - We schedule the job with tasks on executors 0, 1, 2. - We wait 300 ms and decommission executor 0. - If the task is not yet done on executor 0, it will now fail because the block manager won't be able to save the block. This condition is easy to trigger on a loaded machine where the github checks run. - The task with retry on a different executor (1 or 2) and its shuffle blocks will land there. - No actual block migration happens here because the decommissioned executor technically failed before it could even produce a block. So this change makes two fixes to remove the above race condition. - When migrateDuring = true, wait for a task to complete and write the block, and then decommission that executor. - When migrateDuring = false, it is still possible (because of delay scheduling) for two tasks to be run on the same executor serially and one executor to go idle. In which case, we must make sure to decommission an executor that actually had a task run on it.
Now that we wait for an actual task to succeed, we don't need to wait for events prior to that: broadcast of job-info finished and task started. The waiting for the task end/success subsumes that. Simplifying the test even further.
…cal blocks from the BlockManager (useful in testing).
d709002 to
1b469fb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
From our conversation, this changes the block manager to always allow local blocks to be put avoiding the race condition. Generally speaking this shouldn't matter in production but it should help avoid the test race condition.