-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-16917 Add transfer rate quantile metrics for DataNode reads #5397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
3c9d4b3
to
3f8ce0c
Compare
* @return the number of megabytes/second of the transfer rate | ||
*/ | ||
public static long transferRateMBs(long bytes, long durationMS) { | ||
if (durationMS == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is <= 0, just return -1? Let's add a check for bytes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont feel we should handle other cases. This is a Utils method and any unexpected data should be left for the client to interpret. For some clients the negative values might even make sense.
The idea behind handling for durationMS = 0 is to take care of DivideByZero for cases when data transfer did not happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we specify our function as: "we expect both inputs to be positive. Otherwise, this function will return -1".
Then returning -1 is a clear signal we don't know how to handle such inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -61,6 +61,8 @@ public class DataNodeMetrics { | |||
@Metric MutableCounterLong bytesRead; | |||
@Metric("Milliseconds spent reading") | |||
MutableCounterLong totalReadTime; | |||
@Metric MutableRate bytesReadTransferRate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to readTransferRateMBs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -370,6 +370,7 @@ Each metrics record contains tags such as SessionId and Hostname as additional i | |||
|:---- |:---- | | |||
| `BytesWritten` | Total number of bytes written to DataNode | | |||
| `BytesRead` | Total number of bytes read from DataNode | | |||
| `BytesReadTransferRate`*num*`s(50/75/90/95/99)thPercentileRate` | The 50/75/90/95/99th percentile of the transfer rate of bytes read from the DataNode. The transfer rate is measured in megabytes per second. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to ReadTransferRateMBs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
4271417
to
b3267f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The UTs failing in the build are unrelated to my change. Verified locally that the same tests fail after rebasing to current trunk. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
if (durationMS == 0) { | ||
durationMS = 1; | ||
} | ||
return bytes / (1024 * 1024) * 1000 / durationMS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to change the division to be done based on float/double. Then return a float/double as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Metrics take long, so eventually we will have to convert to long anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this bytes per second and let the consumer do the conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to have the consumer do the conversion, then we can just keep it to bytes/milliseconds. bytes and milliseconds are the units used for most of the metrics in HDFS.
Thoughts @xinglin @mkuchenbecker ?
@@ -1122,6 +1124,7 @@ public void copyBlock(final ExtendedBlock block, | |||
datanode.metrics.incrBytesRead((int) read); | |||
datanode.metrics.incrBlocksRead(); | |||
datanode.metrics.incrTotalReadTime(duration); | |||
datanode.metrics.addReadTransferRateMBs(DFSUtil.transferRateMBs(read, duration)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will be publishing the metrics, including the -1 here. Assuming duration is 0, I would expect bytes read to be 0 as well (only instance with sub millisecond transfers should be when there's an error prior to the transfer).
Bytes Read 0, Duration 0: Return 0.
Bytes Read N, duration 0: Exception which we log.
Bytes Read N, duration K: Normal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we can publish -1 for the error scenarios which can be visualized from the graphs as well.
For duration of 0, changing it to 1 will give us the number of bytes if transferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to log for cases when we get unexpected data.
- Bytes read is negative
- Duration is negative or 0.
Since we are logging these cases not emitting the data to metrics.
* @param durationMS duration in milliseconds | ||
* @return the number of megabytes/second of the transfer rate | ||
*/ | ||
public static long transferRateMBs(long bytes, long durationMS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function needs unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit tests
b3267f3
to
4ddbda7
Compare
4ddbda7
to
b00ea94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
31407d7
to
27a79aa
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 other than one comment.
@@ -632,6 +632,11 @@ public void readBlock(final ExtendedBlock block, | |||
datanode.metrics.incrBytesRead((int) read); | |||
datanode.metrics.incrBlocksRead(); | |||
datanode.metrics.incrTotalReadTime(duration); | |||
if (read < 0 || duration <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You duplicated this logic, I would put behind a private helper so people don't forget this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a Utils method
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Thanks @xinglin and @mkuchenbecker for the reviews and @omalley for helping to merge the change. |
@omalley Can you also help in backporting the PR to branch 3.3 |
Co-authored-by: Ravindra Dingankar <rdingankar@linkedin.com>
…che#5397) Co-authored-by: Ravindra Dingankar <rdingankar@linkedin.com>
@rdingankar after updating our cluster to hadoop 3.3.6, we are now seeing |
…ads (apache#5397)" Reason for revert: After updating cluster to hadoop 3.3.6, all datanodes started to log `WARN hdfs.DFSUtil: Unexpected value for data transfer bytes=XXXX duration=0`, every minute, where XXXX ranges from ~4-5kb to ~200kb. This reverts commit 0ca5686.
I agree this is way too verbose. I created https://issues.apache.org/jira/browse/HDFS-17262 |
Description of PR
Transfer rate metric for datanode reads will be calculated as the rate at which bytes are read ( bytes per ms )
With quantiles we will get a distribution of this rate which will be helpful in identifying slow datanodes.
How was this patch tested?
Updated Unit Tests to verify metrics will be emitted.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?