Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A way to force or reset consul CA root during leadership failure scenario. #6375

Open
ericbrumfield opened this issue Aug 22, 2019 · 3 comments
Labels
theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-vault Relating to Consul & Vault interactions type/enhancement Proposed improvement or new feature

Comments

@ericbrumfield
Copy link

While testing and feeling out consul we got it configured with consul connect ca's vault provider and things worked well, however at one point we assumed that we could empty vault and that consul would be able to setup/change the root CA that is baked into the raft data. Once our test consul cluster was in this state, when coming online it would fall in a really fast loop failing to establish leadership with the following error repeated from the server nodes:

consul: failed to establish leadership: stored CA root "06:e7:b6:ab:8f:93:c2:50:45:bf:b1:8c:b6:75:74:8f:52:dd:47:85" is not the active root (f1:40:88:39:b7:ef:39:7e:28:ed:4d:f7:89:45:22:5f:75:06:e2:4c)

After a lot of hunting through docs and trying different ways to force a leader and get the certificate rolled or switched out we ended up just rebuilding the 3 server nodes to fix this. I think we learned our lesson to never mess around with the vault pki mounts that consul connect ca uses, otherwise the cluster gets into this state and it doesn't seem like you can ever bring it back online. Where it's stuck electing a leader it doesn't seem you can even work with a server node to attempt to fix or roll the CA cert out for a new one. It's actually quite easy to mess this up, all one has to do is mess with the pki mount in vault that consul connect ca is configured to use.

Are there any plans to force, expunge or get rid of the root CA in consul in a scenario like this in order to get things running again and a leader elected? Possibly a way to "re-bootstrap" the consul CA bits?

@mkeeler mkeeler added this to the 1.6.x milestone Aug 22, 2019
@hanshasselberg hanshasselberg added the theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies label Aug 23, 2019
@stale
Copy link

stale bot commented Oct 22, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

@stale stale bot added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Oct 22, 2019
@rboyer rboyer added the type/enhancement Proposed improvement or new feature label Oct 22, 2019
@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Oct 22, 2019
@hanshasselberg hanshasselberg modified the milestones: 1.6.x, 1.7.x Jan 13, 2020
@dnephin dnephin removed this from the 1.7.x milestone Jun 30, 2020
@jsosulska jsosulska added theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/consul-vault Relating to Consul & Vault interactions labels Sep 29, 2020
@jstachowiak
Copy link

jstachowiak commented Jun 14, 2021

This also affected us when we stopped ACL replication between datacenters by changing the primary_datacenter parameter. Consul was failing to establish leadership complaining about stored CA root. As a workaround we disabled Consul Connect but this didn't resolve the underlying issue. When we enabled it again we saw frequent failed leadership elections but this time without the error message.

Consul v1.9.0
Revision a417fe510
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

@jstachowiak
Copy link

I would probably consider this a bug because after you disable ACL replication the built-in CA generates and stores the new root certificate but the ActiveRootID still points to the primary root certificate. This causes frequent failed leadership elections which make it impossible to trigger a rotation process in the hope of updating the active root certificate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-vault Relating to Consul & Vault interactions type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

7 participants