-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
healthcheck token impersonation issue ("sunken tokens") #12145
Comments
Thank you for reporting this issue! As you noticed, the client agent will sync the state to the catalog. To perform this sync, the client agent uses the ACL token that was passed along with the service or check registration. If there is no token in the registration the client agent will use the Currently if there is any token, and the sync fails, it will keep retrying with the registration token. Instead of retrying with the same token, we could retry the request with the agent token. That way if the original registration token is removed, the second attempt to sync the delete should succeed with the agent token. That change would need to happen around here: https://github.com/hashicorp/consul/blob/v1.11.2/agent/local/state.go#L1278 Instead of only logging the problem, we could repeat the RPC request with a different token. |
@dnephin a theleological analysis would warrant a |
Alternative workaround to clean up dead services and on-going related logging errors due to this issue is to use the API to de-register the no-longer-alive service on the respective node:
And then also restart the consul local agent on machines which will otherwise repeatedly log ACL RPC errors, forever, ex:
Would be great to see the solution suggested above added @dnephin. Thanks! |
possibly relates to #11949 |
This initial error report aims at getting feedback from developers to put me into a position where I can better pin down the error.
consul: v1.10.3
In a given cluster (ceteris paribus), we have the following commands run from within a nomad allocation:
The nomad allocation recovers a token lease (
CONSUL_HTTP_TOKEN
) from{{ with secret "consul/creds/patroni" }}
on startup.critical
) when job stops. Server has not.It appears as if the agent's healthcheck would inherit the token lease from the initiating party which breaks the purpose of healthchecks notification if that token does not outlive the initiating party.
This is critical since an effectively dead service instance is never updated againts the cluster and hence the cluster effectively serves invalid DNS responses.
The text was updated successfully, but these errors were encountered: