-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-16521. DFS API to retrieve slow datanodes #4107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think such a similar stuff is there and exposed via JMX and most of the automations I believe use that only.
I believe cdp as well has a doc
https://docs.cloudera.com/runtime/7.2.10/scaling-namespaces/topics/hdfs-detecting-slow-datanodes.html
Shooting some periodic RPCs and then building some automation doesn't look something an ideal system should do.
If it is there as part of metrics, I think the systems should adapt to that and build on top of it only rather than shooting some RPCs periodically.
@ayushtkn While I agree that JMX metric for slownode is already available, not every downstreamer might have access to it directly, for instance in K8S managed clusters, unless port forward is enabled (not so common case in prod), HDFS downstreamer would not be able to access JMX metrics. We have similar case with Moreover, it's not only about downstreamer using the API, we should also provide |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
I agree with @ayushtkn that modifying ClientProtocol is overkill for the use case. @virajjasani
Thanks to JMXJsonServlet, we can get metrics in JSON format via HTTP/HTTPS port of NameNode without additional configuration. JSON on HTTP is usually easier to access from outside/downstream than Protobuf on RPC.
How about enhancing metrics if the current information in the SlowPeersReport is insufficient? |
@iwasakims @ayushtkn I was earlier thinking about adding SLOW_NODE in When HDFS throughput is affected, it would be really great for operators to check for slownode details (similar command to retrieve decommission, dead, live nodes) using
We can do this but I believe if we can add more info to slownode only when required i.e. by user triggered API (similar to ClientProtocol), that would be less overhead than continuously exposing additional details in the metrics. WDYT?
Yes this is helpful for sure but only if Namenode port is exposed to downstream application. |
I meant only http port (not rpc). |
I've wanted to build a UI to expose the slow datanode metrics more easily. For example at the NameNode itself or in Cloudera Manager Chart System. Never got the time to make one. But the biggest complaint from the users was that it is disabled by default and it's annoying to restart the cluster just to refresh the configuration and wait for the slow node to show up again. It would be much more useful it can be made available on demand at runtime. |
HDFS-16396 has good attempt to make it reconfigurable for datanodes, we can extend the support for namenode as well in a follow-up Jira. |
You mean HDFS-16327(#3716) needs follow-up? |
I mean HDFS-16396 (#3827) can be extended for namenode as well (reconfig of |
@jojochuang @iwasakims @ayushtkn Could you please help review this PR? |
@jojochuang @iwasakims @ayushtkn @aajisaka Could you please take a look? |
I have updated Jira/PR description to summarize the above points. |
To provide more insights, FanOutOneBlockAsyncDFSOutput in HBase currently has to rely on it's own way of marking and excluding slow nodes while 1) creating pipelines and 2) handling ack, based on factors like the data length of the packet, processing time with last ack timestamp, whether flush to replicas is finished etc. If it can utilize slownode API from HDFS to exclude nodes appropriately while writing block, a lot of it's own post-ack computation of slow nodes can be saved or improved or based on further experiment, we could find better solution to manage slow node detection logic both in HDFS and HBase. However, in order to collect more data points and run more POC around this area, at least we should expect HDFS to provide API for downstreamers to efficiently utilize slownode info for such critical low-latency use-case (like writing WALs). |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
DatanodeInfo[] slowDataNodesReport() throws IOException { | ||
String operationName = "slowDataNodesReport"; | ||
DatanodeInfo[] datanodeInfos; | ||
checkSuperuserPrivilege(operationName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to require super user privilege?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, removed, thanks.
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSAdmin.java
Show resolved
Hide resolved
@@ -632,6 +638,20 @@ private static void printDataNodeReports(DistributedFileSystem dfs, | |||
} | |||
} | |||
|
|||
private static void printSlowDataNodeReports(DistributedFileSystem dfs, boolean listNodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a sample output? It would be confusing, I guess. I suspect you would need some kind of header to distinguish from the other data node reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you would need some kind of header to distinguish from the other data node reports.
This is called only if condition listAll || listSlowNodes
is true:
if (listAll || listSlowNodes) {
printSlowDataNodeReports(dfs, listSlowNodes, "Slow");
}
Sample output:
Header:
-------------------------------------------------
Slow datanodes (n):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment on the slow datanode report is that it seems to say nothing about why the NN thinks it slow; it does not note the attributes that are in excess of normal. Should it? (Hard to do when this info is not part of DatanodeInfo) For example, say something about how in excess a DNs latency is? (Perhaps this could be added later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment on the slow datanode report is that it seems to say nothing about why the NN thinks it slow;
It's a datanode that determines whether it's peer datanodes are slower, NN just aggregates all DN reports.
For example, say something about how in excess a DNs latency is? (Perhaps this could be added later)
Sure, this can be added as an additional info. Will create a follow-up Jira. Thanks @saintstack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic. Thanks for offering the output screenshot.
*/ | ||
@Idempotent | ||
@ReadOnly | ||
DatanodeInfo[] getSlowDatanodeReport() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to check with every one that it is okay to have an array of objects as the return value.
I think it's fine but just want to check with every one, because once we decide the the interface it can't be changed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought List is also fine but kept it Array to keep the API contract in line with getDatanodeReport()
so that both APIs can use same underlying utility methods (e.g. getDatanodeInfoFromDescriptors() ).
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
- Did you consider adding slow node attribute to datanode report? (Perhaps asked already)
- It is beyond this PR, but this PR might include a pointer to the definition of what a slownode is (including how an operator might edit the qualifying boundaries).
Thanks
* | ||
* @param outlier outlier directly set by tests. | ||
*/ | ||
public void setTestOutliers(Map<String, Double> outlier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little awkward? Add comment on the testOutlier data member that it is for test only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's very difficult to reproduce the actual slow node in UT, hence had to do this way. Sure, added comment on testOutlier
member as well (in addition to this setter method Javadoc).
@@ -632,6 +638,20 @@ private static void printDataNodeReports(DistributedFileSystem dfs, | |||
} | |||
} | |||
|
|||
private static void printSlowDataNodeReports(DistributedFileSystem dfs, boolean listNodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment on the slow datanode report is that it seems to say nothing about why the NN thinks it slow; it does not note the attributes that are in excess of normal. Should it? (Hard to do when this info is not part of DatanodeInfo) For example, say something about how in excess a DNs latency is? (Perhaps this could be added later)
@@ -394,7 +394,7 @@ Usage: | |||
|
|||
| COMMAND\_OPTION | Description | | |||
|:---- |:---- | | |||
| `-report` `[-live]` `[-dead]` `[-decommissioning]` `[-enteringmaintenance]` `[-inmaintenance]` | Reports basic filesystem information and statistics, The dfs usage can be different from "du" usage, because it measures raw space used by replication, checksums, snapshots and etc. on all the DNs. Optional flags may be used to filter the list of displayed DataNodes. | | |||
| `-report` `[-live]` `[-dead]` `[-decommissioning]` `[-enteringmaintenance]` `[-inmaintenance]` `[-slownodes]` | Reports basic filesystem information and statistics, The dfs usage can be different from "du" usage, because it measures raw space used by replication, checksums, snapshots and etc. on all the DNs. Optional flags may be used to filter the list of displayed DataNodes. Filters are either based on the DN state (e.g. live, dead, decommissioning) or the nature of the DN (e.g. slow nodes). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we explain a 'slownode' is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, some good explanation can be found on OutlierDetector class itself, but I get your point, we should have this documented on the site as well.
a093225
to
0f63900
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider adding slow node attribute to datanode report? (Perhaps asked already)
Yes, they are available but the additional attributes like the actual latency vs aggregate are not available, we can followup in a separate Jira.
It is beyond this PR, but this PR might include a pointer to the definition of what a slownode is (including how an operator might edit the qualifying boundaries).
Agree, we need better docs around slow nodes, can be taken up as a follow-up task.
Thanks
* | ||
* @param outlier outlier directly set by tests. | ||
*/ | ||
public void setTestOutliers(Map<String, Double> outlier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's very difficult to reproduce the actual slow node in UT, hence had to do this way. Sure, added comment on testOutlier
member as well (in addition to this setter method Javadoc).
@@ -632,6 +638,20 @@ private static void printDataNodeReports(DistributedFileSystem dfs, | |||
} | |||
} | |||
|
|||
private static void printSlowDataNodeReports(DistributedFileSystem dfs, boolean listNodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment on the slow datanode report is that it seems to say nothing about why the NN thinks it slow;
It's a datanode that determines whether it's peer datanodes are slower, NN just aggregates all DN reports.
For example, say something about how in excess a DNs latency is? (Perhaps this could be added later)
Sure, this can be added as an additional info. Will create a follow-up Jira. Thanks @saintstack
@@ -394,7 +394,7 @@ Usage: | |||
|
|||
| COMMAND\_OPTION | Description | | |||
|:---- |:---- | | |||
| `-report` `[-live]` `[-dead]` `[-decommissioning]` `[-enteringmaintenance]` `[-inmaintenance]` | Reports basic filesystem information and statistics, The dfs usage can be different from "du" usage, because it measures raw space used by replication, checksums, snapshots and etc. on all the DNs. Optional flags may be used to filter the list of displayed DataNodes. | | |||
| `-report` `[-live]` `[-dead]` `[-decommissioning]` `[-enteringmaintenance]` `[-inmaintenance]` `[-slownodes]` | Reports basic filesystem information and statistics, The dfs usage can be different from "du" usage, because it measures raw space used by replication, checksums, snapshots and etc. on all the DNs. Optional flags may be used to filter the list of displayed DataNodes. Filters are either based on the DN state (e.g. live, dead, decommissioning) or the nature of the DN (e.g. slow nodes). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, some good explanation can be found on OutlierDetector class itself, but I get your point, we should have this documented on the site as well.
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me.
This PR adds exposure of 'slow nodes' and makes it so downstreamers can in-line ask about the phenomenon (w/o having to query via a different channel/protocol and serialization). There is work to do still around what is a 'slow node', giving operators clues on why and how to address 'slownodeness', but being able to list what the NN has accumulated in this regard makes for a good start tackling the phenomenon.
💔 -1 overall
This message was automatically generated. |
5622c61
to
3837a76
Compare
Conflicts resolved with latest changes |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is good to go.
Thanks everyone for the reviews, here is the branch-3.3 backport PR #4259 |
A follow-up PR to expose latency of slow nodes as perceived by their reporting nodes #4323 |
Signed-off-by: stack <stack@apache.org> Signed-off-by: Wei-Chiu Chuang <weichiu@apache.org>
Description of PR
Providing DFS API to retrieve slow nodes would help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters.
The other purpose of such API is for HDFS downstreamers without direct access to namenode http port (only rpc port accessible) to retrieve slownodes.
Created follow-up Jira HDFS-16528 to support enabling slow peer stats without having to restart Namenode.
FanOutOneBlockAsyncDFSOutput in HBase currently has to rely on it's own way of marking and excluding slow nodes while 1) creating pipelines and 2) handling ack, based on factors like the data length of the packet, processing time with last ack timestamp, whether flush to replicas is finished etc. If it can utilize slownode API from HDFS to exclude nodes appropriately while writing block, a lot of it's own post-ack computation of slow nodes can be saved or improved or based on further experiment, we could find better solution to manage slow node detection logic both in HDFS and HBase. However, in order to collect more data points and run more POC around this area, HDFS should provide API for downstreamers to efficiently utilize slownode info for such critical low-latency use-case (like writing WALs).
How was this patch tested?
Dev cluster:

For code changes: