-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-16070. DataTransfer block storm when datanode's io is busy. #3105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
Random rand = new Random(SEED); | ||
rand.nextBytes(buffer); | ||
stm.write(buffer); | ||
LOG.info("Created file " + name + " with " + repl + " replicas."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logger format {}
writeFile(fileSys, name, repl, 2); | ||
} | ||
|
||
static protected void writeFile(FileSystem fileSys, Path name, int repl, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected static void writeFile() and the same for the other methods.
stm.flush(); | ||
// Do not close stream, return it | ||
// so that it is not garbage collected | ||
return stm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You never capture this when using writeFile() so it will be GC for this current test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thank you for help me check so carefully. In fact, I copied from other test, I didn't reconstruct it carefully. I will reconstruct the unit-test.
Assert.assertEquals(replicas.liveReplicas(), 2); | ||
|
||
// 5. sleep and verfiy the unnecessary transfer. | ||
Thread.sleep(3000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be made to wait for something? Otherwise, explain why 3 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to sleep, because we have already make sure the block meet replication. I will remove it.
public boolean add(ExtendedBlock block) { | ||
boolean ret = super.add(block); | ||
try { | ||
Thread.sleep(30000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment saying why waiting 30 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need sleep more than DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY+ DFS_HEARTBEAT_INTERVAL_KEY + DFS_NAMENODE_REDUNDANCY_INTERVAL_SECONDS_KEY seconds, then we will make sure the pending reconstruction block is timeout, then trigger next DataTransfer.
Because in this uni-test, I wanna to simulate that duplicated DataTransfer for same block.
30 seconds is too long, I will reset this value.
return ret; | ||
} | ||
}); | ||
Field transferringBlock = DataNode.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are doing some reflection here, can you make it into its own block explaining why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set DataNode.transferringBlock to sleepSet so that we could simulate that DataNode's DataTransfer will stuck.
Because when decommission node, some nodes were stuck on io (network or disk), then DataTransfer will run for long time.
private void waitBlockMeetReplication(BlockInfo blockInfo, Short repl) { | ||
boolean done = repl == blockInfo.numNodes(); | ||
while (!done) { | ||
LOG.info("Waiting for repl change to " + repl + " current repl: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger
/* | ||
* Wait till node is fully decommissioned. | ||
*/ | ||
private void waitBlockMeetReplication(BlockInfo blockInfo, Short repl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done with LamdaTestsUtils.waitFor()
LOG.info("{} Starting thread to transfer {} to {}", bpReg, block, | ||
xferTargetsString); | ||
if (transferringBlock.contains(block)) { | ||
LOG.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we are avoiding the block storm?
How big can this set get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like I said, when datanode's io is busy, DataTransfer will stuck for long time, the pendingReconstructBlock will timeout. Then a duplicated DataTransfer thread for same block will start, and the duplicated DataTransfer thread will make io busier. In fact, we found our datanode's io is busy for long time util we restart datanode. So I think we should avoid the duplicated DataTransfer.
@@ -2646,6 +2655,7 @@ public void run() { | |||
} catch (Throwable t) { | |||
LOG.error("Failed to transfer block {}", b, t); | |||
} finally { | |||
transferringBlock.remove(b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we are cleaning this up and won't leave garbage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't catch you advice, transferringBlock.add and transferringBlock.remove could be only called DataTransfer.run, I think won't leave garbage.
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
When I speed up the decommission, I found that some datanode's io is busy, then I found host's load is very high, and ten thousands data transfer thread are running.
Then I find log like below.
You will see last datatranfser thread was done on 13:54:08, but next datatranfser was start at 13:52:36.
If datatranfser was not done in 10min(pending timeout + check interval), then next datatranfser for same block will be running. Then disk and network are heavy.