-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-17055 Export HAState as a metric from Namenode for monitoring #5764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
When nodes are all shutdown, we should use pass a `null` to getCurrentBlockPoolID(), instead of `cluster`. `UpgradeUtilities.getCurrentBlockPoolID(null)` instead of `UpgradeUtilities.getCurrentBlockPoolID(cluster)`
Hi @goiri, thanks for taking a look at this PR and approving it. Please don't merge it yet. I am still working on fixing some unit test failures. |
Hi @goiri, Summary of changes to fix unit test failures.
TestRollingUpgrade passed all tests on my laptop. Let's see how the build goes. |
💔 -1 overall
This message was automatically generated. |
...-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
Show resolved
Hide resolved
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
TestObserverNode unit test failure does not seem to be related with change in this PR. The error is connection error in RPC (Client.getRpcResponse()).
passed all unit tests when running at my laptop as well.
trigger another build. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
We have two clean builds. This PR is ready to merge. Thanks, |
@melissayou further comments? |
lgtm to merge. Thanks! @goiri |
Thanks @goiri / @melissayou for reviewing this PR. thanks @goiri for merging into trunk! |
Description of PR
We'd like measure the uptime for Namenodes: percentage of time when we have the active/standby/observer node available (up and running). We could monitor the namenode from an external service, such as ZKFC. But that would require the external service to be available 100% itself. And when this third-party external monitoring service is down, we won't have info on whether our Namenodes are still up.
We propose to take a different approach: we will emit Namenode state directly from namenode itself. Whenever we miss a data point for this metric, we consider the corresponding namenode to be down/not available. In other words, we assume the metric collection/monitoring infrastructure to be 100% reliable.
One implementation detail: in hadoop, we have the NameNodeMetrics class, which is currently used to emit all metrics for NameNode.java. However, we don't think that is a good place to emit NameNode HAState. HAState is stored in NameNode.java and we should directly emit it from NameNode.java. Otherwise, we basically duplicate this info in two classes and we would have to keep them in sync. Besides, NameNodeMetrics class does not have a reference to the NameNode object which it belongs to. An NameNodeMetrics is created by a static function initMetrics() in NameNode.java.
We shouldn't emit HA state from FSNameSystem.java either, as it is initialized from NameNode.java and all state transitions are implemented in NameNode.java.
How was this patch tested?
mvn test -Dtest="TestHAMetrics"