Skip to content

Healthcheck for zero ingress connection count #3719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 12, 2025

Conversation

yacovm
Copy link
Contributor

@yacovm yacovm commented Feb 13, 2025

Why this should be merged

This PR adds a new health check that monitors if the node is a primary network validator and has no node connecting to it.

Our current health checks only monitor whether the node is connected to enough stake. However, a node may be connected to enough stake by opening egress connections to all other validators, but also at the same time, not be reachable from any validator or non-validator node.

This health check does just that - it alerts when:

  1. The node is a primary network validator.
  2. The node has zero ingress connections to it.
  3. More than 10 minutes (configurable) passed since its startup.

How this works

A node can establish a connection through two pathways:

  1. It initiates a connection with a remote node
  2. A remote node initiates connection to it.

We count the number of connections of the latter type via an atomic counter, and add a health check that checks the three conditions mentioned earlier.

How this was tested

Added unit tests and also did a manual test:

I launched a validator node on Fuji (testnet) and I added it to the validators:

03-03|17:34:09.380] INFO <P Chain> validators/logger.go:50 node added to validator set {"subnetID": "11111111111111111111111111111111LpoYY", "nodeID": "NodeID-49DuEYJ8Kub8pdHz6yucyVqKBhEw5urMZ", "publicKey": "0x80f073c922fbfe9fe8ee22a5c4d47f8b1c30d6c95574d7f5fb9d7d673f5170141751d51b2c9e6ca0a1ca2f2acde2477b", "txID": "2XuGeB5LeKcrR5N2yjPvWMJj19NVE7kTiLk6gzU437sL9qjDZN", "weight": 2991109255}

~/avalanchego$ curl -s  -H 'Content-Type: application/json' --data '{
    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"health.health",
    "params": {
        "tags": ["11111111111111111111111111111111LpoYY", "29uVeLPJB1eQJkzRemU8g8wZDw5uJRqpab5U2mX9euieVwiEbL"]
    }
}' 'http://localhost:9650/ext/health' | jq '.result.checks.network.message["primary network validator health"]'
{
  "ingressConnectionCount": 1708,
  "primary network validator": true
}


I then caused all other nodes to disconnect from it by blocking port 9651:

sudo iptables -I INPUT -p tcp --dport 9651 -j DROP

After a short time, the health check started failing:

[03-03|19:08:12.853] INFO health/worker.go:261 check started passing {"name": "health", "name": "bootstrapped", "tags": ["application"]}
[03-03|19:08:12.853] INFO health/worker.go:261 check started passing {"name": "readiness", "name": "bootstrapped", "tags": ["application"]}
[03-03|19:10:42.853] WARN health/worker.go:252 check started failing {"name": "health", "name": "network", "tags": ["application"], "error": "network layer is unhealthy reason: primary network validator is unreachable"}

and probing the health check API showed it:

~/avalanchego$ curl -s  -H 'Content-Type: application/json' --data '{
    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"health.health",
    "params": {
        "tags": ["11111111111111111111111111111111LpoYY", "29uVeLPJB1eQJkzRemU8g8wZDw5uJRqpab5U2mX9euieVwiEbL"]
    }
}' 'http://localhost:9650/ext/health' | jq '.result.checks.network'
{
  "message": {
    "connectedPeers": 227,
    "primary network validator health": {
      "ingressConnectionCount": 0,
      "primary network validator": true
    },
    "sendFailRate": 0,
    "timeSinceLastMsgReceived": "853.896432ms",
    "timeSinceLastMsgSent": "853.896432ms"
  },
  "error": "network layer is unhealthy reason: primary network validator is unreachable",
  "timestamp": "2025-03-03T19:11:12.853915018Z",
  "duration": 25832,
  "contiguousFailures": 2,
  "timeOfFirstFailure": "2025-03-03T19:10:42.853036554Z"
}


I then restored connectivity via deleting the iptables rule:

sudo iptables -D INPUT -p tcp --dport 9651 -j DROP

and observed that the health check recovered:

[03-03|19:12:12.853] INFO health/worker.go:261 check started passing {"name": "health", "name": "network", "tags": ["application"]}
~/avalanchego$ curl -s  -H 'Content-Type: application/json' --data '{
    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"health.health",
    "params": {
        "tags": ["11111111111111111111111111111111LpoYY", "29uVeLPJB1eQJkzRemU8g8wZDw5uJRqpab5U2mX9euieVwiEbL"]
    }
}' 'http://localhost:9650/ext/health' | jq '.result.healthy'
true

Need to be documented in RELEASES.md?

Added a section about the health check added.

@yacovm yacovm marked this pull request as draft February 13, 2025 20:00
@yacovm yacovm force-pushed the alertIfNoConnections branch 13 times, most recently from a5a21ff to b8ad75a Compare February 18, 2025 15:42
@yacovm yacovm marked this pull request as ready for review February 18, 2025 16:26
@yacovm yacovm force-pushed the alertIfNoConnections branch from de6b79f to d7fdf19 Compare February 18, 2025 16:32
@yacovm yacovm changed the title alert if validator connections drop quickly Healthcheck for zero ingress connection count Feb 18, 2025
Copy link
Contributor

@tsachiherman tsachiherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great. I've added few comments.

@yacovm
Copy link
Contributor Author

yacovm commented Feb 18, 2025

Overall, looks great. I've added few comments.

Thanks, addressed/responded to your comments.

@tsachiherman
Copy link
Contributor

tsachiherman commented Feb 18, 2025 via email

@yacovm yacovm force-pushed the alertIfNoConnections branch from cd6d014 to a52673e Compare February 27, 2025 00:50
@yacovm yacovm force-pushed the alertIfNoConnections branch from d820099 to 2005233 Compare March 2, 2025 20:20
@yacovm yacovm marked this pull request as draft March 2, 2025 20:20
@yacovm
Copy link
Contributor Author

yacovm commented Mar 2, 2025

Parking this as a draft before I manually test it again after this code change

@yacovm yacovm force-pushed the alertIfNoConnections branch 11 times, most recently from 4169d09 to 1cf5056 Compare March 3, 2025 19:15
@yacovm yacovm marked this pull request as ready for review March 3, 2025 19:24
@yacovm
Copy link
Contributor Author

yacovm commented Mar 3, 2025

@StephenButtolph the PR is ready again for re-review.

@yacovm yacovm force-pushed the alertIfNoConnections branch from ff2344e to d1b4d92 Compare March 10, 2025 18:33
yacovm added 5 commits March 12, 2025 00:08
This commit adds a healthcheck that fails if:

- The node is a validator of the primary network
- The node has zero ingress connections
- Enough time (defaults to 20 min) has passed since the node finished bootstrapping

Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
@yacovm yacovm force-pushed the alertIfNoConnections branch from cf8c3d6 to 9c9ca56 Compare March 11, 2025 23:08
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
@StephenButtolph StephenButtolph added this pull request to the merge queue Mar 12, 2025
Merged via the queue into ava-labs:master with commit a83e692 Mar 12, 2025
23 checks passed
cam-schultz pushed a commit that referenced this pull request Mar 24, 2025
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants