Actuator - support optional downstream services #25151

jaredpetersen · 2021-02-09T18:16:16Z

I'm using the Actuator project to display health statuses for downstream services. My application is utilizing an optional downstream service; the service is used all the time normally but we have fallbacks in place if it goes down for whatever reason.

Optional services are not supported at this time and I'd like to request that we change that. The problem is that the HealthIndicator for the optional service will show the downstream service as being down, that DOWN status will be aggregated to the overall service status, and the health check will fail because of a non-200 status code. This is not desirable. Optional downstream services should show up as "DOWN" but not have that status aggregated.

This enhancement has been requested in the past a couple of times over the years:

The general recommendation was to build a custom HealthAggregator. HealthAggregator has been deprecated so the replacement recommendation is a StatusAggregator + Health Groups. This does not work.

Here's a StatusAggregator that I built for a health group named "optional":

@Qualifier("optional")
@Component
public class OptionalStatusAggregator implements StatusAggregator {
    @Override
    public Status getAggregateStatus(Set<Status> statuses) {
        return Status.UP;
    }
}

And here's my actuator configuration:

management:
  endpoint:
    health:
      show-details: always
      group:
        optional:
          include:
            - someservice
  endpoints:
    web:
      base-path: /
      exposure:
        include:
          - info
          - health

The intent is to always show "UP" for the optional health group since none of these are critical for the overall system health. The components in the group will still have their individual statuses but they are to be ignored once aggregated.

If you hit the health group's health endpoint (localhost:8080/health/optional), the aggregation works as intended. However, my custom status aggregator is not used for the overall health endpoint (localhost:8080/health) so the "DOWN" status is propagated up to the service health.

The only solution I can see is to create a StatusAggregator that's used for everything. However, the StatusAggregator does not provide where the statuses came from, only the raw Status values of "UP", "DOWN", etc. in the form of a Set. There is no way to have the aggregator ignore the statuses of some downstream services because there's no context. It was proposed that this be fixed in #23759 but the team decided not to extend the API of the StatusAggregator.

The recommendation was to instead create a custom status called "PARTIAL_DOWN" and create a status aggregator to handle that. The problem with this recommendation is that the downstream service that is actually down will show up with the status of "PARTIAL_DOWN" which is not true and the overall status will be "PARTIAL_DOWN" (kind of true) or "UP" (true) depending on how the StatusAggregator is implemented.

I'd like you to reconsider adding the component context information into StatusAggregator as an enhancement. This information is needed in order to do any sort of meaningful status aggregation. In the current state, the existing SimpleStatusAggregator is effectively the only possible way to aggregate statuses unless you choose to hardcode the status like I did in my example (which is not useful for the overall health endpoint status aggregation). At this time, I will not be able to add my optional downstream services to the health endpoint and there will unfortunately not be any visibility into how they are doing from the application's perspective.

The text was updated successfully, but these errors were encountered:

bclozel · 2021-02-10T10:28:31Z

Indeed, this has been asked a few times and so far the team pointed at a custom HealthAggregator or health groups as the preferred solutions for this.

I understand this is still a valid concern, so let's try to take a step back. The main problem here is that:

we want custom health indicators to give us information about the broader environment, such as remote services availability, rate-limiting for API tokens, or just the health of services that are not vital to the application (since we might have sensible fallbacks for them)
an indicator being down doesn't mean that the entire application should be considered as unavailable. Again, the application can have fallbacks for those.
there should be a difference between "the application has a problem" and "the application is really broken and needs to be restarted". If we don't make that difference, a 3rd party system having issues might take down all services depending on it.
there should also be a difference between "the application is not available" and "the application is not ready to serve traffic". When applications need to perform long-running tasks at startup before serving traffic, it can be challenging to configure the health checks to adapt to that while not configuring super long timeouts overall.
platform health checks are usually pointing at /actuator/health - so if something fails, the whole application is considered as failing and the platform takes action (usually restarting the app, raising notifications, etc)

Since #19593, Spring Boot ships support for liveness and readiness probes. I think they're solving all issues listed above.
Once enabled, developers can point the health checks to /actuator/health/liveness and /actuator/health/readiness for optimal platform checks. We can still look at the global /actuator/health for more information.

#22825 will make those health groups enabled by default and I think this approach makes more sense in the long run.

Would this approach solve your current use case? If not, could you explain what's missing?

jaredpetersen · 2021-02-10T19:13:00Z

@bclozel Thanks for your quick response!

Points 1 - 3 and 5 are a correct assessment of the problem. 4 is not really relevant because if it's an optional downstream service with appropriate fallbacks, the main service will always be ready for traffic.

I took a look at the liveness and readiness probes a bit more in detail and used this as my application config to get a grasp on it:

management:
  endpoint:
    health:
      show-details: always
      probes:
        enabled: true
      group:
        readiness:
          include:
            - db
            - redis
        liveness:
          include:
            - db
            - redis
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true

It looks like you need the services you actually want to affect the probes explicitly defined in the readiness/liveness groups. So since we have a fallback for our optional service, we don't want that to be included in either group.

This looks like it solves the problem of automated health checks since the readiness/liveness probes only show the components that are needed to make the application run. If you want greater details of all of the individual components, humans can look at the regular health endpoint.

However, I think that this is a bit of a bandaid. The greater health endpoint shows the statuses of everything but lies to you (the human) by saying that the overall application is down. I'd say that the problem is only partially solved until the StatusAggregator interface is adjusted to provide the component name context.

bclozel · 2021-02-10T19:51:52Z

@jaredpetersen thanks for your reply!

Point 4) is still key - in your case, how is the platform responding to an /actuator/health endpoint being "DOWN" in general? Also, how does the platform decides that an application instance is up and can process traffic? What happens if an application is available but takes time to process startup tasks (like ApplicationRunner tasks)?

The issue here is that the "DOWN" state for an application is subjective. For sure, it's down if you can't even connect to it. But with fallbacks, optional services, or health indicators used to track the state of external systems - there are a lot of shades to consider.

We decided in the past to not extend this part because we felt like it would introduce more complexity in the design and not really solve the core problem: the health of an application is a matter of perspective. We thought that creating several specialized views with a simple status aggregation logic (health groups) would be easier to reason about.

Now that liveness and readiness are part of the picture, we need to decide whether health groups are the best way to go or if we should improve the contracts here.

I'm marking this issue for team attention and we'll discuss this. Thanks!

jaredpetersen · 2021-02-10T21:19:39Z

@bclozel

Point 4) is still key - in your case, how is the platform responding to an /actuator/health endpoint being "DOWN" in general? Also, how does the platform decides that an application instance is up and can process traffic? What happens if an application is available but takes time to process startup tasks (like ApplicationRunner tasks)?

Sure, this is relevant generally speaking. You want your application manager (Kubernetes, etc.) to know that your service is up and ready to receive traffic. But in the context of downstream dependencies with suitable fallbacks, these indicators are always going to show healthy so that your application isn't killed off. In that case, we just want some way for a human to see from the application's perspective that performance is degraded.

The issue here is that the "DOWN" state for an application is subjective. For sure, it's down if you can't even connect to it. But with fallbacks, optional services, or health indicators used to track the state of external systems - there are a lot of shades to consider.

100%. I view "DOWN" as the application is basically dead, please send help. And even with these liveness and readiness checks in place which can be set to ignore the health of these services with fallbacks, the overall health endpoint still shows that something is very broken.

We decided in the past to not extend this part because we felt like it would introduce more complexity in the design and not really solve the core problem: the health of an application is a matter of perspective. We thought that creating several specialized views with a simple status aggregation logic (health groups) would be easier to reason about.

I definitely understand where you're coming from here. But it's more confusing when an "optional" service makes the entire application health look like it's totally broken.

Ideally, there would be a simple config where you can pass which checks are optional and should not be aggregated. In that case, no one would need to override the StatusAggregator at all to remove these optional services from aggregation. This isn't a totally off-the-wall problem so having it built right into the configuration would be very useful. But I would settle for an enhanced contract so that I can do my thing without impacting others in a way they may find confusing.

I'm marking this issue for team attention and we'll discuss this. Thanks!

Thank you so much for your help on this and your careful, nuanced thoughts on the matter. Looking forward to hearing the results of your discussion 😄

jaredpetersen · 2021-02-27T05:16:33Z

I see that the label changed -- what was the result of the meeting?

wilkinsona · 2021-03-02T08:27:32Z

We need to do some design work to figure out what the next steps should be.

rellikt · 2021-07-13T10:29:02Z

Is there any progress? Is there a chance that this will be implemented in one of the upcoming releases?

wilkinsona · 2021-07-13T10:51:28Z

@rellikt We don't have any firm plans at the moment and have not yet had a chance to do the necessary design work.

FinalFrag · 2022-06-20T16:00:31Z

Are there any updates on this topic? I'm looking to set up exactly what is being discussed here.

Is there a workaround to get a Map of the name and the status in the StatusAggregator?

wilkinsona · 2022-06-20T16:17:09Z

I'm afraid not. Updates will be added to this issue when we have them.

wimdeblauwe · 2022-07-04T09:42:42Z

I just created https://stackoverflow.com/questions/72854464/exclude-mail-health-indicator-from-default-actuator-health-endpoint, but now I found this issue that seems to be related.

In my example, I don't want /actuator/health to return DOWN if the mail server is down, otherwise Azure will think my application is down and will try to restart it.

I guess for now the only workaround would be to create a new group (e.g. withoutmail), exclude mail and point Azure to that group (/actuator/health/withoutmail)?

FinalFrag · 2022-07-04T09:55:45Z

Mine was a very similar situation. In the end we decided to put all mandatory dependencies in the health check and move all optional dependencies to the info section by implementing an InfoContributor for each dependency.

imosapatryk · 2023-04-07T16:45:53Z

Any news about the feature being introduced? I was trying to aggregate the statuses for composite contributors with specific behaviour for specific composite. Passing the name of the composite to the status aggregator would be really enough, especially when the name is carried by the aggregate method.

wilkinsona · 2023-04-11T10:43:06Z

@imosapatryk please see #25151 (comment).

imosapatryk · 2023-04-11T17:27:57Z

I understand but it's been almost a year and still nothing on the horizon. Would be really nice get to know if at least you are planning to introduce this change.

wilkinsona · 2023-04-12T07:32:14Z

If we had no plans to introduce the change, the issue would have been closed. The issue is in the general backlog milestone which is described as an "unordered set of issues that we want to solve in a future release". That's an accurate description of the state of this issue. It's an enhancement that we'd like to make but we have no specific plans for the timing of the change.

ferrarimartin · 2023-08-14T07:30:54Z

Hello, just to add my use case for this, I was hoping to use /actuator/health as both a readiness probe and a source for firing alerts. So, if redis is down, the service would still show as ready, but there could be alerts fired stating that somethins is wrong with redis, and maybe alerts about the service being degraded. That would be really natural to think about and implement, I believe. It can be done with health groups, but it involves more work declaring the groups, and the /actuator/health endpoint would still show a state that's not real (the whole service down if there's an optional component down). Thanks!!

voliveirajr · 2023-08-29T14:49:45Z

Checking if there is any expectation of this getting implemented soon and what would be the suggestion for a workaround till we get it.
In my situation we have a service that supports multiple "flows", the health status of it is relevant for different clients, Client 1 uses flows A and B, Client 2 flows A and C.
The main problem for us is even using the agregators to compose it any unhealthy flow will set the whole application down.
Suggestions?

bclozel · 2023-08-29T16:09:58Z

@voliveirajr I think your flows use case points to health groups, which is already implemented in Spring Boot. You can point each client to a different health group. I don't see how this particular issue would solve your problem, as clients would have no way to select which "flow" they're getting when they're querying the root health endpoint.

bclozel · 2023-08-29T16:14:03Z

Since this issue was created, I think many platforms are now considering the liveness and readiness aspects of an application and use separate endpoints. The root endpoint and aggregated status doesn't reflect the "overall status of the application", but rather the aggregated status of all configured indicators.

I think we should now close this issue; the status aggregator route introduces complex logic where this preference should be configured in a declarative way by creating health groups.

jaredpetersen · 2023-09-25T04:02:40Z

The focus here is on services that have fallbacks and wanting a way to cleanly get statuses for these aggregated up properly. Services/features that have a fallback are called "optional" here; the application works, but maybe in a deteriorated state. You don't want it to fail the whole health check yet. If the fallback fails (everything in the group fails), then you want to fail the health check. So the specific request is for a more nuanced take on health checks and status aggregation.

I'll admit that this could have been more effectively distilled back in 2021, but you live and learn 🙂.

Readiness and liveness aren't a replacement here. That's something else entirely, and it already existed at the time. You don't want a service/feature with a functioning fallback to fail either probes as that will get your application killed.

Health groups were never a solution either, and also existed at the time. If anything in the health group fails, the whole group fails. There's no nuance, no support for optionality.

The solutions seems to be:

Native support for optionality
Custom status aggregation written by the developer, as referenced previously
Custom status check written by the developer that handles that fallback nuance internally -- has the downside of only being able to expose UP/DOWN externally

If the organization wants to reject it on the basis of introducing too much complexity or not having enough value, that's okay! There are other options here aside from native support. But this isn't something that is solved by the framework today without some extra work.

sheetalj2205 · 2024-03-21T08:35:16Z

Hey @jaredpetersen @bclozel is there any update on the aggregate status of the service? I also want to implement this, I don't want the aggregate status to be DOWN if any of the optional services is DOWN.

sheetalj2205 · 2024-04-03T13:11:22Z

Mine was a very similar situation. In the end we decided to put all mandatory dependencies in the health check and move all optional dependencies to the info section by implementing an InfoContributor for each dependency.

Hey, @FinalFrag How you're configuring this in yaml file so it will also get triggered after every 15 seconds like /health?

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Feb 9, 2021

bclozel added the status: waiting-for-feedback We need additional information before we can continue label Feb 10, 2021

spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 10, 2021

bclozel added the for: team-meeting An issue we'd like to discuss as a team to make progress label Feb 10, 2021

philwebb added status: pending-design-work Needs design work before any code can be developed and removed for: team-meeting An issue we'd like to discuss as a team to make progress labels Feb 19, 2021

wilkinsona added type: enhancement A general enhancement and removed status: feedback-provided Feedback has been provided labels Jul 13, 2021

wilkinsona removed the status: waiting-for-triage An issue we've not yet triaged label Apr 26, 2022

wilkinsona added this to the General Backlog milestone Apr 26, 2022

bclozel mentioned this issue Aug 30, 2023

Configurable Health Indicator to observe HTTP Endpoints #36953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actuator - support optional downstream services #25151

Actuator - support optional downstream services #25151

jaredpetersen commented Feb 9, 2021 •

edited

Loading

bclozel commented Feb 10, 2021 •

edited

Loading

jaredpetersen commented Feb 10, 2021 •

edited

Loading

bclozel commented Feb 10, 2021

jaredpetersen commented Feb 10, 2021

jaredpetersen commented Feb 27, 2021

wilkinsona commented Mar 2, 2021

rellikt commented Jul 13, 2021 •

edited

Loading

wilkinsona commented Jul 13, 2021

FinalFrag commented Jun 20, 2022

wilkinsona commented Jun 20, 2022

wimdeblauwe commented Jul 4, 2022

FinalFrag commented Jul 4, 2022

imosapatryk commented Apr 7, 2023 •

edited

Loading

wilkinsona commented Apr 11, 2023

imosapatryk commented Apr 11, 2023

wilkinsona commented Apr 12, 2023

ferrarimartin commented Aug 14, 2023

voliveirajr commented Aug 29, 2023

bclozel commented Aug 29, 2023

bclozel commented Aug 29, 2023

jaredpetersen commented Sep 25, 2023

sheetalj2205 commented Mar 21, 2024

sheetalj2205 commented Apr 3, 2024

Actuator - support optional downstream services #25151

Actuator - support optional downstream services #25151

Comments

jaredpetersen commented Feb 9, 2021 • edited Loading

bclozel commented Feb 10, 2021 • edited Loading

jaredpetersen commented Feb 10, 2021 • edited Loading

bclozel commented Feb 10, 2021

jaredpetersen commented Feb 10, 2021

jaredpetersen commented Feb 27, 2021

wilkinsona commented Mar 2, 2021

rellikt commented Jul 13, 2021 • edited Loading

wilkinsona commented Jul 13, 2021

FinalFrag commented Jun 20, 2022

wilkinsona commented Jun 20, 2022

wimdeblauwe commented Jul 4, 2022

FinalFrag commented Jul 4, 2022

imosapatryk commented Apr 7, 2023 • edited Loading

wilkinsona commented Apr 11, 2023

imosapatryk commented Apr 11, 2023

wilkinsona commented Apr 12, 2023

ferrarimartin commented Aug 14, 2023

voliveirajr commented Aug 29, 2023

bclozel commented Aug 29, 2023

bclozel commented Aug 29, 2023

jaredpetersen commented Sep 25, 2023

sheetalj2205 commented Mar 21, 2024

sheetalj2205 commented Apr 3, 2024

jaredpetersen commented Feb 9, 2021 •

edited

Loading

bclozel commented Feb 10, 2021 •

edited

Loading

jaredpetersen commented Feb 10, 2021 •

edited

Loading

rellikt commented Jul 13, 2021 •

edited

Loading

imosapatryk commented Apr 7, 2023 •

edited

Loading