Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health indicators based on Service Level Objectives #21311

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jkschneider
Copy link
Contributor

@jkschneider jkschneider commented May 4, 2020

This feature adds support for commonly requested functionality for an application to be able to aggregate some set of metrics key performance indicators down to a health indicator.

I fully expect some changes, probably significant changes, based on feedback iterations on this, but want to offer this up early in the 2.4.0 release iteration so we have time to iterate and also dogfood any autoconfigured service level objectives.

Some indicators are known to be broadly applicable to a wide range of Java applications, and those could be autoconfigured. An example of a set of such indicators is defined here and autoconfigured by this pull request (JvmServiceLevelObjectives.MEMORY).

In many cases, users would like to configure a load balancer to avoid instances that are failing a key performance indicator by configuring an HTTP health check on the load balancer. In fact, some applications may already be doing this for the health indicators Spring Boot or users already provide. Example platform load balancer configurations that can be pointed to /actuator/health:

metadata:
  name: instance-reported-utilization
  annotations:
    service.beta.kubernetes.io/do-loadbalancer-healthcheck-port: "80"
    service.beta.kubernetes.io/do-loadbalancer-healthcheck-protocol: "http"
    service.beta.kubernetes.io/do-loadbalancer-healthcheck-path: "/actuator/health"

See micrometer-metrics/micrometer#2055 for more detail.

The HealthMeterRegistry

As of 1.6.0, Micrometer has a new implementation: micrometer-registry-health. An autoconfiguration was added to spring-boot-actuator-autoconfigure for this new implementation.

Any @Bean ServiceLevelObjective is configured onto the HealthMeterRegistry and bound as a Spring Boot HealthIndicator.

What it looks like in /actuator/health

image

About ServiceLevelObjective

Service level objectives broadly have the following capabilities:

  • Are defined as a single or multi-indicator test against a set of time series registered to HealthMeterRegistry.
  • Can define required MeterBinder that contain the measurements that they need to determine availability.
  • Contains a filterable and transformable name and tag set that is mapped to the Spring Boot bean name and Health#details map, respectively.
  • Optionally contains a readable base unit that is mapped to health details.
  • Can pretty-print values and thresholds for human-readable interpretation of an SLO at some instant.
  • Can be defined to look back and aggregate over a time window in different ways.

API error ratio property-driven configuration

management.metrics.export.health.api-error-budgets.api.customer=0.01
management.metrics.export.health.api-error-budgets.admin=0.02

The above properties result in two service level objective health indicators called apiErrorRatioApiCustomer and apiErrorRatioAdmin, which check for a SERVER_ERROR outcome to total throughput ratio of less than 1% for requests to paths starting with /api/customer and 2% for requests to paths starting with /admin, respectively.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label May 4, 2020
@jkschneider
Copy link
Contributor Author

jkschneider commented May 4, 2020

Open questions

We build health indicators with AbstractHealthIndicator(slo.getFailedMessage()). It's unclear to me if the failed message ever appears in /actuator/health response body output.

Some of the SLOs are a combination of two or more indicators. For example, in jvmTotalMemory, we set a relatively low threshold on GC overhead (20% of CPU time over the last 5 minutes) if there is 90% pool utilization as well. These composite SLOs are registered with the relatively new CompositeHealthContributor.fromMap(..) API. Unfortunately there is no way I can see to provide details and a failed message name on the composite. I'd like to add details and a failed message for each contributing health indicator and potentially a different one for what it means for a set of such indicators to fail together. @philwebb you may have suggestions? An example is included below of what I think might be nice (specifically the details directly underneath jvmTotalMemory)?

"jvmTotalMemory": {
  "status": "UP",
  "details": { 
     "someTag": "someValue"
  },
  "components": {
    "jvmGcOverhead": {
      "status": "UP",
      "details": {
        "value": "0.01%",
        "mustBe": "<20%",
        "unit": "percent CPU time spent"
      }
    },
    "jvmMemoryConsumption": {
      "status": "UP",
      "details": {
        "value": "9.09%",
        "mustBe": "<90%",
        "unit": "maximum percent used in last 5 minutes"
      }
    }
  }
}

@jkschneider jkschneider force-pushed the health-slos branch 3 times, most recently from 220c8ba to d907ba5 Compare May 5, 2020 13:26
@philwebb philwebb added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged labels May 5, 2020
@philwebb philwebb added this to the 2.4.x milestone May 5, 2020
@philwebb
Copy link
Member

philwebb commented May 5, 2020

Thanks @jkschneider! I'll target this for 2.4.x so we remember to take a look as soon the 2.3.0 release crunch is over.

@snicoll snicoll added the for: team-attention An issue we'd like other members of the team to review label Sep 9, 2020
@bclozel bclozel modified the milestones: 2.4.x, 2.x Sep 28, 2020
@bclozel bclozel added status: blocked An issue that's blocked on an external project change and removed for: team-attention An issue we'd like other members of the team to review labels Sep 28, 2020
@bclozel
Copy link
Member

bclozel commented Sep 28, 2020

We haven't had a chance to take a look at this change, nor upgrade to Micrometer 1.6.
We're already quite late in the Milestone cycle and we don't think we'll have time to address this change properly.
We need to take a look at this change and its implications (including the new concepts introduced and the Health endpoint format).

@mbhave
Copy link
Contributor

mbhave commented Sep 16, 2021

@snicoll and I discussed this today. There are a few things that came up:

  1. Since we decided that the diskspace health indicator should ideally be something that can be configured in the monitoring system, this feels very much along those lines. If we decide to surface the SLO's as a health indicator, we should align our strategy for diskspace accordingly. Even with the deprecation of the diskspace indicator, we could surface that information in health via the SLOs.
  2. We are not sure if having a top-level component for every SLO is the best way to do this. Maybe having some sort of nested structure for the SLOs might be a better alternative.
  3. From an API perspective, we could have an API to expose SLOs which we could use to create the composite rather than the current method which registers beans within a bean method.

Flagging for team-meeting so that we can discuss this on the next team call.

@mbhave mbhave added for: team-meeting An issue we'd like to discuss as a team to make progress and removed status: blocked An issue that's blocked on an external project change labels Sep 16, 2021
@wilkinsona
Copy link
Member

wilkinsona commented Sep 17, 2021

We discussed this some more as a team today and our feeling is that we're not sure that we have a strong enough opinion to auto-configure SLOs has health indicators. We can see that it may make sense for some users but not for others. For example, in some cases, a proxy will already be aware of the error rate for requests that it routes to an application instance. In this case, exposing the information via a health endpoint that it will also be monitoring will be of minimal value, and may even be harmful depending on how things behave when the application's health changes. For users that do want to expose SLOs as health indicators, we could provide some classes that make it easier to do so.

Since this proposal was made, we've also introduced the concept of application state. It may be that some users want to configure things such that an unmet objective results in a change to the application state to indicate that it's no longer ready, for example. We could provide some helper classes that a user can configure to connect SLOs to application state.

We discussed possibly auto-configuring the HealthMeterRegistry, automatically adding any ServiceLevelObjective beans to it. We could auto-configure some ServiceLevelObjective beans such as JvmServiceLevelObjectives.MEMORY and OperatingSystemServiceLevelObjectives.DISK rather than hard-coding them as proposed here. This would align with our auto-configuring of Micrometer's various Jvm…Metrics classes.

Overall, our feeling was that we would stop short of anything that exposes the SLOs externally, instead auto-configuring the HealthMeterRegistry and supporting beans and making it easier for a user to then plug the SLOs into health or application state in a way that meets their specific needs.

@shakuzen @jonatan-ivanov Could we have your input here please? Are we right to be cautious and just give users the parts they need and leave them to join things together or is there some clearly established usage of HealthMeterRegistry and SLOs that means that we can proceed with confidence in a particular direction?

@wilkinsona wilkinsona added status: blocked An issue that's blocked on an external project change and removed for: team-meeting An issue we'd like to discuss as a team to make progress labels Sep 17, 2021
@mbhave mbhave removed their assignment Sep 17, 2021
@philwebb philwebb added status: pending-design-work Needs design work before any code can be developed and removed status: blocked An issue that's blocked on an external project change labels Sep 20, 2021
@mjf1310

This comment has been minimized.

@philwebb philwebb force-pushed the main branch 3 times, most recently from 1ca278f to 902dd0b Compare November 19, 2021 20:17
@philwebb philwebb modified the milestones: 2.x, 3.x Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: pending-design-work Needs design work before any code can be developed type: enhancement A general enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants