-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-14318:dn cannot be recognized and must be restarted to recognize the Repaired disk #1104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Would you like to add new unit test to cover this update?
- It is better to submit single one PR for one improvement. I think this PR is same as HDFS-14318:dn cannot be recognized and must be restarted to recognize the Repaired disk #538 .
@@ -404,6 +404,11 @@ public static InetSocketAddress createSocketAddr(String target) { | |||
private final StorageLocationChecker storageLocationChecker; | |||
|
|||
private final DatasetVolumeChecker volumeChecker; | |||
volatile FsDatasetSpi<? extends FsVolumeSpi> allData = null; | |||
private Thread checkDiskThread = null; | |||
private final int checkDiskInterval = 5*1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is better as one parameter to config.
private Thread checkDiskThread = null; | ||
private final int checkDiskInterval = 5*1000; | ||
private Object checkDiskMutex = new Object(); | ||
private List<StorageLocation> errorDisk = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allData = data + errorDisk
right? It seems that allData
is redundant variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allData = data + errorDisk
is right. allData
will use in volumeChecker.checkAllVolumes(allData)
.
If no allData
, we must make a variable before volumeChecker.checkAllVolumes
, the variable is data + errorDisk
. So I think allData
is ok .
@@ -819,6 +825,14 @@ public IOException call() { | |||
} else { | |||
effectiveVolumes.add(volume.toString()); | |||
LOG.info("Successfully added volume: {}", volume); | |||
if (errorDisk != null && !errorDisk.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be not necessary to check if errorDisk
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. If refreshVolumes
is invoked by reconfigurePropertyImpl
, errorDisk
may be null.
public String reconfigurePropertyImpl(String property, String newVal)
throws ReconfigurationException {
switch (property) {
case DFS_DATANODE_DATA_DIR_KEY: {
IOException rootException = null;
try {
LOG.info("Reconfiguring {} to {}", property, newVal);
this.refreshVolumes(newVal);
...
}
@@ -2191,6 +2208,44 @@ public void shutdown() { | |||
tracer.close(); | |||
} | |||
|
|||
public void startCheckDiskThread() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to use ScheduledExecutorService
to schedule this task at fixed rate. since,
- same target,
- simpler,
checkDiskMutex
, it seems to me that it it no need to concurrency control.
...-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
Show resolved
Hide resolved
errorDisk = new ArrayList<>(); | ||
} | ||
List<StorageLocation> tmpDisk = Lists.newArrayList(errorDisk); | ||
errorDisk = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this statement is unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errorDisk=null
should be errorDisk.clear()
...-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
Show resolved
Hide resolved
Thanks @hunshenshi for reporting the issue and the PR. Agree it would be great if you can add a unit test that repro the issue. The Datanode has a disk volume checker that checks volumes every 15m (by default), you may use the callback to populate the healthy volumes and failed volumes. |
Thanks, I will fix that. |
@xiaoyuyao ok. I will read it. |
@Hexiaoqiao Do you have any idea for the UT? |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@Hexiaoqiao |
💔 -1 overall
This message was automatically generated. |
…ot reflected in code In the current code, the window is only applied and checked on the last retry. However, the check should be done at all retries. This was found during apache#1104 rmatharu please take a look Author: Daniel Nishimura <dnishimura@gmail.com> Reviewers: Ray Matharu <rmatharu@linkedin.com> Closes apache#1108 from dnishimura/samza-2277-cluster-manager-retry-window-bug-fix and squashes the following commits: 23db835e [Daniel Nishimura] Address minor comments from @rmatharu 0a7c9270 [Daniel Nishimura] Trigger build b/c of flakey test. 0c56499d [Daniel Nishimura] Fix checkstyle 2674f75e [Daniel Nishimura] Address @rmatharu's comments. ecfe63eb [Daniel Nishimura] SAMZA-2277: Semantics for cluster-manager.container.retry.window.ms not reflected in code
…host-affinity allocations (apache#1104) * SAMZA-2266: Introduce a backoff when there are repeated failures for host-affinity allocations
dn detected that disk a has failed. After disk a is repaired, dn cannot be recognized and must be restarted to recognize
I make a patch to dn for recognize the repaired disk without restart dn