-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault 1.11 Multi-Issuer CA breaks Connect CA Intermediate CAs (<-> Vault Provider) #15217
Comments
Hi @exo-cedric, Thank you for reporting this. We'll have to look into the changes resulting from Vault 1.11's multi-issuer feature, as you described. Separately, you mentioned that the intermediate CA list grows indefinitely. That should be fixed as of 1.11.9, 1.12.5*, and 1.13.2 by PR #14429. That PR prunes expired certificates. In the case of your observations, it sounds like you might have a growing list of not-yet-expired certificates (one added per hour?), so even if you were to upgrade to from 1.12.3 to 1.12.6, you might still see a growing list (until some of the certs pass their expiration). *If you consider upgrading from 1.12.3 (to a later 1.12.x or 1.13.x), I strongly recommend you first review this guidance first: https://developer.hashicorp.com/consul/docs/upgrading/upgrade-specific#modify-vault-policy-for-vault-ca-provider. For example, if you stay on 1.12.x, I would go straight to 1.12.6 (skip 1.12.5). |
Thank you for the quick feedback.
The list is growing, on an hourly basis with the same IntermediateCert (the one that is erroneously being kept in use):
Thanks for the heads-up! (that was already on our radar... and applied as the first potential culprit for the case at hand :-) ) |
Hi @exo-cedric, Just to confirm, where/when are you seeing the following error?
Are you seeing that because the new leaf certificates generated after step 3 (>50% of intermediate CA cert TTL) have an ever-growing list of intermediate certs, so services in the mesh using those new leaf certificates begin to fail the TLS handshake when communicating with other services? |
I don't have a precise analysis of when/how exactly this error shows up, except it did at the same time we started experiencing the Typical log message is: |
@exo-cedric : We're actively looking into this to understand cause(s), potential workaround(s), and fix(es) we could make in Consul. We'll reach out if we have further questions as we go. Thank you for the detailed report! |
@exo-cedric: Are your Consul client agents using either auto-config or auto-encrypt? I'm wondering if you are just seeing the TLS handshake error messages on control plane traffic, and whether that's because Consul client agents are using certificates issued by the Connect CA (which only happens if using auto-config or auto-encrypt). |
Thanks for actively looking into this and your feedback. We're using |
Status update For anyone using Vault 1.11.0+ as Consul's Connect CA provider, we've published this knowledge base article with more details on the issue, including the recommended workaround. We've also added mentions of this known issue to relevant places in the Consul and Vault docs. The Consul team is working on fixes to be included in an upcoming Consul patch release. Refer to PR #15253. |
Awesome! Thanks to all parties involved for quickly addressing and fixing this issue. |
Latest status as of Dec 2: At this time, we recommend that multi-datacenter deployments wait until an upcoming patch release for a full fix. Consul 1.12.7, 1.13.4, and 1.14.2 were released on Nov 30 with a fix that resolves this issue in primary datacenters. An additional fix is still needed to resolve this issue in the secondary datacenters that exist in multi-datacenter deployments using WAN federation. For now, those affected should continue to refer to the knowledge base article with more details on the issue, including the recommended workaround. |
@exo-cedric : I updated the previous comment to reflect our latest understanding. For now, we recommend that multi-DC deployments use the workaround and wait until the next patch release before upgrading. |
Hello @jkirschner-hashicorp We've updated our preprod Consul servers to 1.13.4 and the default issuer appears to be updated as expected:
On the other hand, we don't see the "obsolete" issuers being cleaned up in Vault PKI "store". I'm afraid this might lead to issues (on the medium (to very long?) term); maybe something worth keeping on the radar too ? PS: Retrospectively, I now believe this to be a Vault (1.11+) issue; API behavior should not have changed in such significant/breaking manner (?). |
My non-expert understanding is that Vault performance may be affected once the number of issuers in a PKI secrets engine approaches ~100+. It should take a while to reach that point assuming intermediate TTL isn't really low. Vault 1.13.0+ will include the For now, an operator could manually remove "obsolete" issuers if the number of obsolete issuers becomes problematic. We realize that's not ideal long-term. Once Vault 1.13 is released, perhaps Consul could set Alternatively, we could consider having Consul try to delete issuers itself, but that would require giving Consul additional privileges to do that in its Vault token ( What are your thoughts? |
Yes. I think this would be the best approach
Entirely agree this approach might not be desirable |
I can now confirm our 32-day Intermediate CA has been successfully rotated at 50% its lifetime, without manual intervention, including Vault default issuer:
As far as I'm concerned, this issue may be considered Solved |
Hi @exo-cedric, I'm glad to hear that. With both your confirmation and the merge/release of the fix for secondary datacenters (#15661), I'm marking this closed. The full fix (for both primary and secondary datacenters) is available in:
|
Overview of the Issue
The Vault 1.11-introduced Multi-Issuer feature breaks Consul Connect CA (Vault Provider).
When time comes to issue a new ICA (<->
IntermediateCertTTL
), Consul successfullypki/.../root/sign-intermediate
andpki/.../intermediate/set-signed
but fails to pick-up the new ICA, because it omits to switch the default issuer.Reproduction Steps
IntermediateCertTTL
for the sake of the test (we stumbled on the issue using168h
)"path":"pki/.../root/sign-intermediate/"
)PUT https://vault:8200/v1/pki/.../sign/leaf-cert\nCode: 400. Errors:\n\n* cannot satisfy request, as TTL would result in notAfter 2022-11-03T16:21:21.775770518Z that is beyond the expiration of the CA certificate at 2022-11-02T14:56:33Z
error message(another tell-tale sign of the issue is the
tls: handshake message of length 118637 bytes exceeds maximum of 65536 bytes
error message, assumedly because of the (non-sensically) ever-growing Intermediate CA list)Consul info for both Client and Server
Consul Server and Agent: 1.12.3
Vault Server: 1.11.4
Logs etc.
Problem:
Mark the
notAfter=Nov 2 14:56:33 2022 GMT
; that Intermediate CA should have been renewed/rotated long ago (and fails issuing new 72h Leaf Cert)Code base
I believe the issue lies in those portions of the code base:
https://github.com/hashicorp/consul/blob/v1.12.3/agent/connect/ca/provider_vault.go#L542-L574
consul/agent/connect/ca/provider_vault.go
Lines 457 to 471 in 2f005f2
https://github.com/hashicorp/consul/blob/v1.12.3/agent/connect/ca/provider_vault.go#L485-L515
More specifically in the
ActivateIntermediate
function, which ought to PUT pki/.../issuers/config with the ID of the new IntermediateCerts CA.Validation
We have validated the above hypothesis by:
IntermediateCerts
)The text was updated successfully, but these errors were encountered: