Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthcheck token impersonation issue ("sunken tokens") #12145

Open
blaggacao opened this issue Jan 20, 2022 · 5 comments · May be fixed by #12560
Open

healthcheck token impersonation issue ("sunken tokens") #12145

blaggacao opened this issue Jan 20, 2022 · 5 comments · May be fixed by #12560
Labels
theme/consul-nomad Consul & Nomad shared usability theme/health-checks Health Check functionality theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature

Comments

@blaggacao
Copy link

blaggacao commented Jan 20, 2022

This initial error report aims at getting feedback from developers to put me into a position where I can better pin down the error.

consul: v1.10.3

In a given cluster (ceteris paribus), we have the following commands run from within a nomad allocation:

  • A
curl -XPUT $CONSUL_HTTP_ADDR/v1/agent/service/register \
-H 'Content-Type: application/json' -d '{"name":"testnet-staging-database","id": "testnet-staging-database/pg-2", "address": "10.32.153.145", "port": 5432, "check": { "interval": "5s", "timeout": "2s", "DeregisterCriticalServiceAfter": "150.0s", "http": "https://10.32.153.145:8008/replica"}, "tags": [ "replica" ], "enable_tag_override": true }'
  • B
curl -XPUT $CONSUL_HTTP_ADDR/v1/agent/service/register?token=$CONSUL_HTTP_TOKEN \
-H 'Content-Type: application/json' -d '{"name":"testnet-staging-database","id": "testnet-staging-database/pg-2", "address": "10.32.153.145", "port": 5432, "check": { "interval": "5s", "timeout": "2s", "DeregisterCriticalServiceAfter": "150.0s", "http": "https://10.32.153.145:8008/replica"}, "tags": [ "replica" ], "enable_tag_override": true }'

The nomad allocation recovers a token lease (CONSUL_HTTP_TOKEN) from {{ with secret "consul/creds/patroni" }} on startup.

  • A produces an agent state where the agent can update the server health-check at all times
  • B produces an agent state where the agent can NOT update the server health-check at all times
    • Specifically, the consul-agent cannot update the servers any more after the token lease has been expired.
    • Nomad appears to be managing the token lease lifecycle, that is, freeing the lease when the job is stopped.
    • Note!!: agent has correct state (health-check: critical) when job stops. Server has not.

It appears as if the agent's healthcheck would inherit the token lease from the initiating party which breaks the purpose of healthchecks notification if that token does not outlive the initiating party.

This is critical since an effectively dead service instance is never updated againts the cluster and hence the cluster effectively serves invalid DNS responses.

@Amier3 Amier3 added theme/health-checks Health Check functionality type/bug Feature does not function as expected theme/consul-nomad Consul & Nomad shared usability labels Jan 21, 2022
@dnephin
Copy link
Contributor

dnephin commented Jan 24, 2022

Thank you for reporting this issue!

As you noticed, the client agent will sync the state to the catalog. To perform this sync, the client agent uses the ACL token that was passed along with the service or check registration.

If there is no token in the registration the client agent will use the acl.tokens.agent token to do the deregistration.

Currently if there is any token, and the sync fails, it will keep retrying with the registration token. Instead of retrying with the same token, we could retry the request with the agent token. That way if the original registration token is removed, the second attempt to sync the delete should succeed with the agent token.

That change would need to happen around here: https://github.com/hashicorp/consul/blob/v1.11.2/agent/local/state.go#L1278

Instead of only logging the problem, we could repeat the RPC request with a different token.

@dnephin dnephin added type/enhancement Proposed improvement or new feature and removed type/bug Feature does not function as expected labels Jan 24, 2022
@blaggacao
Copy link
Author

@dnephin a theleological analysis would warrant a type/bug, though.

@blaggacao
Copy link
Author

blaggacao commented Mar 11, 2022

The moment I ask myself: "now what?"

image


The answer is:

  1. log into the client host (ip to the right) -- identify ip's that are no more scheduled in nomad
    image

  2. re-register with a "outliving" token scope:

srvname="infra-database"
suffix="pg-2"
ip="10.32.153.145"
tag="replica"
curl -XPUT $CONSUL_HTTP_ADDR/v1/agent/service/register \
-H 'Content-Type: application/json' -d '{"name":"$srvname","id": "$srvname/$suffix", "address": "$ip", "port": 5432, "check": { "interval": "5s", "timeout": "2s", "DeregisterCriticalServiceAfter": "10.0s", "http": "https://$ip:8008/$tag"}, "tags": [ "$tag" ], "enable_tag_override": true }'

@jkirschner-hashicorp jkirschner-hashicorp added the theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner label Mar 11, 2022
@johnalotoski
Copy link

Alternative workaround to clean up dead services and on-going related logging errors due to this issue is to use the API to de-register the no-longer-alive service on the respective node:

curl -v -XPUT \
  -H 'Content-Type: application/json' \
  -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" \
  -d "{\"Datacenter\":\"$DATACENTER\",\"Node\":\"$NODE\",\"ServiceID\":\"$SERVICEID\"}" \
  $CONSUL_HTTP_ADDR/v1/catalog/deregister

And then also restart the consul local agent on machines which will otherwise repeatedly log ACL RPC errors, forever, ex:

2022-03-14T16:13:41.389Z [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=$IP:8300 error="rpc error making call: rpc error making call: ACL not found"
2022-03-14T16:13:41.389Z [WARN]  agent: Service deregistration blocked by ACLs: service=$SERVICE/$TAG accessorID=

<...repeated...>

Would be great to see the solution suggested above added @dnephin. Thanks!

@kisunji
Copy link
Contributor

kisunji commented Mar 15, 2022

possibly relates to #11949

@blaggacao blaggacao linked a pull request Mar 15, 2022 that will close this issue
@blaggacao blaggacao changed the title healthcheck token impersonation issue. healthcheck token impersonation issue ("sunken tokens") Mar 15, 2022
blaggacao added a commit to blaggacao/consul that referenced this issue Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/consul-nomad Consul & Nomad shared usability theme/health-checks Health Check functionality theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants