[QUERY] On Nature of the Spring Boot CosmosHealthIndicator, Aggressive Health Checks, and Control Plane Resource Constraints

**Query/Question**
I've been digging into some oddity with my application's Cosmos DB health checks timing out sporadically but on an interval. For example, at random times of the day or night, many of my instances health checks will start timing out at a predictable rate of every 30-45 minutes for a few seconds across many instances, some of which are collocated and some aren't. Sometimes, the issue auto-resolves itself; sometimes, I must restart them to stop this cycle. I can't reproduce it; it just pops up randomly. I've looked at other GitHub issues for the SDK around connection issues in general, and I have seen people's problems resolve as they jump to a newer SDK version. I am actively working on getting my application to the highest jdk and SDK versions, and I will say that, i.e., `TCPEndpointRediscovery` is not enabled by default with my current version of the SDK but is enabled in later versions, so there are a lot of performance gains to be had. However, in the meantime, this has sparked my interest since I have been seeing advisor recommendations in the Azure portal for the following, along with 429 rate-limiting notices around metadata operations from time to time for a while now:

![Screenshot 2024-09-19 at 4 47 31 PM](https://github.com/user-attachments/assets/df8d0ec6-de38-4c24-9cf2-3de186807181)

These are the errors that seemingly pop up in bunches across instances
![Screenshot 2024-09-23 at 8 09 07 AM](https://github.com/user-attachments/assets/1b13137e-f8f6-4dae-adcb-d8789b805d48)

![Screenshot 2024-09-23 at 8 16 39 AM](https://github.com/user-attachments/assets/55d39cbe-073f-4516-9409-97d76d42502b)

<br>

I have traced it back to the `CosmosHealthIndicator`, which aligns with the timeout value constant I see it's using of 3 seconds.

**_Am I correct in assuming that the following code from the [CosmosHealthIndicator](https://github.com/Azure/azure-sdk-for-java/blob/695284b9ff4b1cd21d03e5dfea95bfe63db74458/sdk/spring/spring-cloud-azure-actuator/src/main/java/com/azure/spring/cloud/actuator/implementation/cosmos/CosmosHealthIndicator.java#L56) is, in fact, a metadata operation?_**

```java
CosmosDatabaseResponse response = this.cosmosAsyncClient.getDatabase(database)
    .read()
    .block(timeout);
```
<br>

I'm thinking that it has to be because I specifically provision throughput at the container level, and this operation does not know about my container.  **_So, if I am correct, is this not antithetical to the [performance recommendations for the Java SDK](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/best-practice-java)?:_**

![Screenshot 2024-09-19 at 1 06 44 PM](https://github.com/user-attachments/assets/441119f5-95db-4dab-b481-4d87a4a5e374)

[Since there is a relatively strict constraint on throughput and requests for the control plane](https://learn.microsoft.com/en-us/azure/cosmos-db/concepts-limits#control-plane):
![Screenshot 2024-09-23 at 8 26 52 AM](https://github.com/user-attachments/assets/d7a3934b-c967-4bd9-81c4-d55d857e9161)


**_If that is also correct, what went into the decision to tie an indicator that will be picked up by the actuator for which we can assume will be invoked somewhat regularly?_**

**_Additionally, since I have read that metadata operations primarily go through the gateway node, does that also hold for these health checks despite my application being set to Direct Connection mode?_** Asking again for performance considerations since the recommendation states we should use Direction Connection mode whenever possible.

Suppose all of or even some of the above are correct. In that case, I'm almost wondering if my application(s) would be better suited rolling their own health checks given how aggressively they are called. If I create a separate health check container in my DB with a single near-empty document, I can do an optimized point lookup (~1 RU) on it with a minimal RU hit against a considerably larger manual throughput pool in comparison to that of the control plane that I provision at the container level. If I can get the document, I know the DB is healthy; otherwise, I know it's not. We currently have 3 to 4 different health check invocations that will trigger this cosmos indicator across internal and external load balancers, along with an internal metrics solution that checks the health of every single application instance (currently 30+, but eventually a couple hundred instances) every 15-20 seconds., where each one of these invocations costs 2.0 RUs according to the output.

---

***Why is this not a Bug or a feature Request?***
It could be a bug; I'm not sure. I'm trying to discern whether my assumptions are correct and then, from there, figure out why this was implemented in this way. It could turn into a feature request depending on the answer.

**Setup (please complete the following information if applicable):**
 - DB Information:
   * Cosmos DB Core API
   * Client Information:
     * Using `CosmosAsync` client
     * Direct Connection Mode enabled by default
     * No other autoconfigs set except for enabling `populate-query-metrics`
 - Library/Libraries:
    * JDK11/Azure SDK for Java 3.20.0
    * Library Azure Cosmos DB (Core API); single write region, multiple read region
    * `org.springframework.boot:spring-boot-starter-parent:2.6.6`
    * Managed dependencies:
      * `org.springframework.cloud: spring-cloud-dependencies:2021.0.1`
      * `com.azure.spring:spring-cloud-azure-dependencies:4.1.0`
    * `com.azure.azure-spring-data-cosmos`
    * `com.azure.spring.spring-cloud-azure-starter-actuator`
    * `com.azure.spring spring-cloud-azure-starter-data-cosmos`

---

 **Information Checklist**
 Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- [X] Query Added
- [X] Setup information Added


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUERY] On Nature of the Spring Boot CosmosHealthIndicator, Aggressive Health Checks, and Control Plane Resource Constraints #41980

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUERY] On Nature of the Spring Boot CosmosHealthIndicator, Aggressive Health Checks, and Control Plane Resource Constraints #41980

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions