-
Couldn't load subscription status.
- Fork 2.3k
Add FS health check Failure metric #17949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
❌ Gradle check result for ebe8756: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@arpitpatawat related failures. Can you fix them and we should be good to get this in? |
bacbcd9 to
4450222
Compare
|
❌ Gradle check result for 4450222: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 166088c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
|
||
| import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap; | ||
| import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY; | ||
| import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for traces not metrics.
| long differenceInVersion = version - appliedVersion; | ||
| clusterManagerMetrics.incrementCounter( | ||
| clusterManagerMetrics.lagCounter, | ||
| (double) differenceInVersion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't publish version diff.
| COUNTER_METRICS_UNIT | ||
| ); | ||
|
|
||
| lagCounter = metricsRegistry.createCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the usecase of emitting this metric ? We will have to consider Term and Version together for determining the cluster-state version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure consistency, at the end of the periodic run should we add a method emitMetrics which invokes getHealth method to get a consistent state of the health and accordingly emit a metrics?
I feel that should simplify the code and make it more maintainable
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
3522d20 to
518c343
Compare
|
❌ Gradle check result for 518c343: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
| private static final Logger logger = LogManager.getLogger(Coordinator.class); | ||
| public static final String NODE_LEFT_REASON_LAGGING = "lagging"; | ||
| public static final String NODE_LEFT_REASON_DISCONNECTED = "disconnected"; | ||
| public static final String NODE_LEFT_REASON_HEALTHCHECK_FAIL = "health check failed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add this as a single string and follow telemetry standards like health.check.failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are just the name for the actual reason, however the requested convention is done.
server/src/main/java/org/opensearch/cluster/coordination/LagDetector.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java
Outdated
Show resolved
Hide resolved
|
|
||
| clusterManagerMetrics.incrementCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets move this logic outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the requested changes, this metric is no longer required, removed these changes.
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
78c2721 to
c3ad4a8
Compare
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
c3ad4a8 to
ff5e7ed
Compare
|
❕ Gradle check result for ff5e7ed: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17949 +/- ##
============================================
+ Coverage 72.47% 72.57% +0.10%
- Complexity 67034 67150 +116
============================================
Files 5478 5473 -5
Lines 310132 310102 -30
Branches 45087 45058 -29
============================================
+ Hits 224775 225069 +294
+ Misses 66960 66697 -263
+ Partials 18397 18336 -61 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
This PR will add a metric when there is FS health check failure. The tag will contain detail about the node whose FS health check has failed. This PR also adds the node tag to leader/follower check failure metric.
Related Issues
None
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.