Skip to content

Conversation

@arpitpatawat
Copy link

Description

This PR will add a metric when there is FS health check failure. The tag will contain detail about the node whose FS health check has failed. This PR also adds the node tag to leader/follower check failure metric.

Related Issues

None

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❌ Gradle check result for ebe8756: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@owaiskazi19
Copy link
Member

@arpitpatawat related failures. Can you fix them and we should be good to get this in?

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testLoggingOnHungIO" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testLoggingOnHungIO FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:52392A5319A37360]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnHungIOBeyondHealthyTimeout FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:9A8CB2C2BE593BE9]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testSchedulesHealthCheckAtRefreshIntervals" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testSchedulesHealthCheckAtRefreshIntervals FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:6755A9BE8FD65332]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnSinglePathWriteFailure" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnSinglePathWriteFailure FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:49BF9A3A99B7457D]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnIOException" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnIOException FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:8FB87904E30A3C74]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnSinglePathFsyncFailure" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnSinglePathFsyncFailure FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:57DE9DA58242E08]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

@github-actions
Copy link
Contributor

❌ Gradle check result for 4450222: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 166088c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?


import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap;
import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY;
import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for traces not metrics.

long differenceInVersion = version - appliedVersion;
clusterManagerMetrics.incrementCounter(
clusterManagerMetrics.lagCounter,
(double) differenceInVersion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't publish version diff.

COUNTER_METRICS_UNIT
);

lagCounter = metricsRegistry.createCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usecase of emitting this metric ? We will have to consider Term and Version together for determining the cluster-state version.

Copy link
Contributor

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure consistency, at the end of the periodic run should we add a method emitMetrics which invokes getHealth method to get a consistent state of the health and accordingly emit a metrics?
I feel that should simplify the code and make it more maintainable

Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 518c343: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

private static final Logger logger = LogManager.getLogger(Coordinator.class);
public static final String NODE_LEFT_REASON_LAGGING = "lagging";
public static final String NODE_LEFT_REASON_DISCONNECTED = "disconnected";
public static final String NODE_LEFT_REASON_HEALTHCHECK_FAIL = "health check failed";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add this as a single string and follow telemetry standards like health.check.failed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are just the name for the actual reason, however the requested convention is done.

Comment on lines 215 to 216

clusterManagerMetrics.incrementCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move this logic outside.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the requested changes, this metric is no longer required, removed these changes.

Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
Signed-off-by: Arpit Patawat <arpitptw@amazon.com>
@github-actions
Copy link
Contributor

❕ Gradle check result for ff5e7ed: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link

codecov bot commented Apr 29, 2025

Codecov Report

Attention: Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.57%. Comparing base (54e02a7) to head (ff5e7ed).
Report is 133 commits behind head on main.

Files with missing lines Patch % Lines
...g/opensearch/cluster/coordination/Coordinator.java 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17949      +/-   ##
============================================
+ Coverage     72.47%   72.57%   +0.10%     
- Complexity    67034    67150     +116     
============================================
  Files          5478     5473       -5     
  Lines        310132   310102      -30     
  Branches      45087    45058      -29     
============================================
+ Hits         224775   225069     +294     
+ Misses        66960    66697     -263     
+ Partials      18397    18336      -61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stalled Issues that have stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants