Skip to content

[QUERY] On Nature of the Spring Boot CosmosHealthIndicator, Aggressive Health Checks, and Control Plane Resource Constraints #41980

Open
@MapFilterMagic

Description

Query/Question
I've been digging into some oddity with my application's Cosmos DB health checks timing out sporadically but on an interval. For example, at random times of the day or night, many of my instances health checks will start timing out at a predictable rate of every 30-45 minutes for a few seconds across many instances, some of which are collocated and some aren't. Sometimes, the issue auto-resolves itself; sometimes, I must restart them to stop this cycle. I can't reproduce it; it just pops up randomly. I've looked at other GitHub issues for the SDK around connection issues in general, and I have seen people's problems resolve as they jump to a newer SDK version. I am actively working on getting my application to the highest jdk and SDK versions, and I will say that, i.e., TCPEndpointRediscovery is not enabled by default with my current version of the SDK but is enabled in later versions, so there are a lot of performance gains to be had. However, in the meantime, this has sparked my interest since I have been seeing advisor recommendations in the Azure portal for the following, along with 429 rate-limiting notices around metadata operations from time to time for a while now:

Screenshot 2024-09-19 at 4 47 31 PM

These are the errors that seemingly pop up in bunches across instances
Screenshot 2024-09-23 at 8 09 07 AM

Screenshot 2024-09-23 at 8 16 39 AM


I have traced it back to the CosmosHealthIndicator, which aligns with the timeout value constant I see it's using of 3 seconds.

Am I correct in assuming that the following code from the CosmosHealthIndicator is, in fact, a metadata operation?

CosmosDatabaseResponse response = this.cosmosAsyncClient.getDatabase(database)
    .read()
    .block(timeout);

I'm thinking that it has to be because I specifically provision throughput at the container level, and this operation does not know about my container. So, if I am correct, is this not antithetical to the performance recommendations for the Java SDK?:

Screenshot 2024-09-19 at 1 06 44 PM

Since there is a relatively strict constraint on throughput and requests for the control plane:
Screenshot 2024-09-23 at 8 26 52 AM

If that is also correct, what went into the decision to tie an indicator that will be picked up by the actuator for which we can assume will be invoked somewhat regularly?

Additionally, since I have read that metadata operations primarily go through the gateway node, does that also hold for these health checks despite my application being set to Direct Connection mode? Asking again for performance considerations since the recommendation states we should use Direction Connection mode whenever possible.

Suppose all of or even some of the above are correct. In that case, I'm almost wondering if my application(s) would be better suited rolling their own health checks given how aggressively they are called. If I create a separate health check container in my DB with a single near-empty document, I can do an optimized point lookup (~1 RU) on it with a minimal RU hit against a considerably larger manual throughput pool in comparison to that of the control plane that I provision at the container level. If I can get the document, I know the DB is healthy; otherwise, I know it's not. We currently have 3 to 4 different health check invocations that will trigger this cosmos indicator across internal and external load balancers, along with an internal metrics solution that checks the health of every single application instance (currently 30+, but eventually a couple hundred instances) every 15-20 seconds., where each one of these invocations costs 2.0 RUs according to the output.


Why is this not a Bug or a feature Request?
It could be a bug; I'm not sure. I'm trying to discern whether my assumptions are correct and then, from there, figure out why this was implemented in this way. It could turn into a feature request depending on the answer.

Setup (please complete the following information if applicable):

  • DB Information:
    • Cosmos DB Core API
    • Client Information:
      • Using CosmosAsync client
      • Direct Connection Mode enabled by default
      • No other autoconfigs set except for enabling populate-query-metrics
  • Library/Libraries:
    • JDK11/Azure SDK for Java 3.20.0
    • Library Azure Cosmos DB (Core API); single write region, multiple read region
    • org.springframework.boot:spring-boot-starter-parent:2.6.6
    • Managed dependencies:
      • org.springframework.cloud: spring-cloud-dependencies:2021.0.1
      • com.azure.spring:spring-cloud-azure-dependencies:4.1.0
    • com.azure.azure-spring-data-cosmos
    • com.azure.spring.spring-cloud-azure-starter-actuator
    • com.azure.spring spring-cloud-azure-starter-data-cosmos

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Query Added
  • Setup information Added

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions