You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The current reconstruction process of EC blocks is based on the original contiguous blocks. It is mainly implemented through the work constructed by computeReconstructionWorkForBlocks. It can be roughly divided into three processes:
scheduleReconstruction
chooseTargets
validateReconstructionWork
For ordinary contiguous blocks:
(1) scheduleReconstruction
Select srcNodes as the source of the copy block according to the status of each replica of the block.
(2) chooseTargets
Select the target of the copy.
(3) validateReconstructionWork
Add the copy command to srcNode, srcNode receives the command through heartbeat, and executes the block copy from src to target.
For EC blocks:
(1) and (2) seems nearly same. However, whether to perform simple block copy or block reconstruction for EC blocks is determined in (3). And when some storage is busy, may result no work, it will lead to the problem described in HDFS-17516. Even if no block copying or block reconstruction is generated, pendingReconstruction and neededReconstruction will still be updated until the block times out, which wastes the scheduling opportunity.
Because the decision of whether to perform block copy or block reconstruction is made in (3), unnecessary liveBusyBlockIndices, and excludeReconstructedIndices are introduced. We know many bugs are related here. These should be avoided.
How was this patch tested?
unit test and test in cluster
For code changes:
Move the work of deciding whether to copy or reconstruct blocks from validateReconstructionWork to scheduleReconstruction.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
https://issues.apache.org/jira/browse/HDFS-17542
The current reconstruction process of EC blocks is based on the original contiguous blocks. It is mainly implemented through the work constructed by computeReconstructionWorkForBlocks. It can be roughly divided into three processes:
For ordinary contiguous blocks:
Select srcNodes as the source of the copy block according to the status of each replica of the block.
Select the target of the copy.
Add the copy command to srcNode, srcNode receives the command through heartbeat, and executes the block copy from src to target.
For EC blocks:
(1) and (2) seems nearly same. However, whether to perform simple block copy or block reconstruction for EC blocks is determined in (3). And when some storage is busy, may result no work, it will lead to the problem described in HDFS-17516. Even if no block copying or block reconstruction is generated, pendingReconstruction and neededReconstruction will still be updated until the block times out, which wastes the scheduling opportunity.
Because the decision of whether to perform block copy or block reconstruction is made in (3), unnecessary liveBusyBlockIndices, and excludeReconstructedIndices are introduced. We know many bugs are related here. These should be avoided.
How was this patch tested?
unit test and test in cluster
For code changes: