Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

Enable TLS #313

Merged
merged 26 commits into from
Jan 10, 2020
Merged

Enable TLS #313

merged 26 commits into from
Jan 10, 2020

Conversation

ishustava
Copy link
Contributor

@ishustava ishustava commented Dec 13, 2019

This PR builds on the work done in #137 and proposes the following changes:

  • Optionally allows you to turn on TLS by setting global.tls.enabled to true.
  • Exposes the default HTTPS port 8501 on clients and servers if TLS enabled.
  • Optionally allows you to disable HTTP ports on clients and servers by setting global.tls.httpsOnly to true. This is set to true by default.
  • Adds TLS bootstrapping job that generates Consul cluster CA and server certificate.
  • Adds tls-init-cleanup job that deletes certs created by the tls-init job when the Helm chart is deleted.
  • Consul client certs are generated in an init container, so that we can add HOST_IP as a SAN to enable other clients, e.g. sync-catalog, to talk to it. This will be replaced by auto_encrypt in the future once it supports providing additional SANs to the client certificate.
  • Doesn't expose verify_* config options in the Helm chart as it was proposed in the original PR Support TLS encryption #137. Instead, it assumes secure defaults if global.tls.enabled is true.

Note 1: This currently doesn't support external CAs. It also doesn't work if servers aren't deployed on k8s. Once we support external CAs, we can also support various deployment configurations. Hence, if servers are disabled, tls-init will not run.

Note 2: We're not using mTLS for HTTP API. That is because in Consul's threat model we recommend using ACLs instead. To prevent clients from sending unencrypted RPC requests to the servers, we set verify_incoming_rpc to true.

Fixes #136

@ishustava ishustava requested a review from a team December 13, 2019 20:24
@ishustava ishustava added enhancement New feature or request theme/tls About running Consul with TLS labels Dec 13, 2019
@@ -52,6 +52,35 @@ global:
# consul-k8s v0.8.0+.
bootstrapACLs: false

# Enables TLS encryption across the cluster to verify authenticity of the
# servers and clients that connect. Note: It is HIGHLY recommended that you also
# enable Gossip encryption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really link to our own docs page on enabling gossip encryption... if we had one.

templates/client-daemonset.yaml Show resolved Hide resolved
-H "Authorization: Bearer $( cat /var/run/secrets/kubernetes.io/serviceaccount/token )" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d "{ \"kind\": \"Secret\", \"apiVersion\": \"v1\", \"metadata\": { \"name\": \"{{ template "consul.fullname" . }}-server-cert\", \"namespace\": \"${NAMESPACE}\" }, \"type\": \"kubernetes.io/tls\", \"data\": { \"tls.crt\": \"$( cat {{ .Values.global.datacenter }}-server-{{ .Values.global.domain }}-0.pem | base64 | tr -d '\n' )\", \"tls.key\": \"$( cat {{ .Values.global.datacenter }}-server-{{ .Values.global.domain }}-0-key.pem | base64 | tr -d '\n' )\" } }" > /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one cert for all the servers right? So isn't this weird it's suffixed with -0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what consul CLI does by default. It doesn't matter so much though because we're uploading only the contents here.

It's maybe weird for client certs. Each client cert has the -0 suffix, and so it could be a bit confusing. We could rename them, but with auto_encrypt we won't need this anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, yeah for servers this is only place we see -0 since the key is tls.crt so this is cool

@ishustava
Copy link
Contributor Author

This PR requires connectInject to support TLS before merging.

@lkysow lkysow self-assigned this Dec 16, 2019
Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to look at

  • UI
  • how it interacts with ACLs
  • how we might update the certs

but thought it best to add this feedback so far

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
@@ -11,6 +11,15 @@ IMPROVEMENTS:
false in your local values file. NOTE: If `connectInject.enabled` is false,
then central config is not enabled so this change will not affect you.

* Optionally allow enabling TLS for servers and clients [[GH-313](https://github.com/hashicorp/consul-helm/pull/313/files#)].

Note that consul-k8s components don't currently work with HTTPS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you fix the injector you should update that Consul connect won't work at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix the injector before this PR gets merged.

resources:
- secrets
resourceNames:
- {{ template "consul.fullname" . }}-ca-cert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍻 for template

templates/client-daemonset.yaml Show resolved Hide resolved
templates/client-daemonset.yaml Show resolved Hide resolved
test/unit/client-daemonset.bats Outdated Show resolved Hide resolved
test/unit/client-daemonset.bats Outdated Show resolved Hide resolved
values.yaml Outdated Show resolved Hide resolved
values.yaml Outdated Show resolved Hide resolved
values.yaml Show resolved Hide resolved
@lkysow
Copy link
Member

lkysow commented Dec 17, 2019

Outstanding question from me: How do users rotate their certificates?

@ishustava ishustava force-pushed the enable-tls branch 4 times, most recently from c04993a to 12ea51b Compare December 23, 2019 17:47
@ishustava
Copy link
Contributor Author

ishustava commented Dec 23, 2019

@lkysow I've made the updates from your review and also added support for some consul-k8s components.
Changes made since your last review:

Outstanding work that will be submitted separately:

  • Bringing your own CA
  • Different deployment architectures, e.g. clients outside k8s or servers outside k8s

@tpetrychyn
Copy link

Testing this branch in one of my clusters and I ran in to an issue that may be related to these changes.

Upgrading from consul-helm version 0.8.1 results in:
Error creating: pods "consul-consul-tls-init-" is forbidden: error looking up service account sd/consul-consul-tls-init: serviceaccount "consul-consul-tls-init" not found

I opted to manually create a service account which allowed consul-consul-tls-init to report complete so I deleted it and re-ran helm upgrade but then consul-consul-server-2 would not initialize, giving the error: MountVolume.SetUp failed for volume "tls-server-cert" : secret "consul-consul-server-cert" not found

I finally just blew away my existing consul installation and ran this chart clean.

I will say that running it on a clean environment worked great and the Golang Hashicorp Consul api with TlsConfig plugged in seamlessly so thank you for the work on this! Will be watching closely for using my own CA.

@ishustava
Copy link
Contributor Author

@tpetrychyn thank you so much for trying it out and leaving feedback here. Really appreciate it!

It looks like I forgot to mark the tls-init-{service-account,clusterrole,clusterrolebinding} as a pre-upgrade hook. I've made the change, and it should work now.

@tpetrychyn
Copy link

Encountered another issue that I've been banging my head against for days..

Using commit 12ea51b I am now getting the error in consul-sync-catalog:

[GET /health/ready] Error getting leader status: Get https://10.137.132.250:8501/v1/status/leader: dial tcp 10.137.132.250:8501: i/o timeout

I am using Digital Ocean managed k8 v1.16.2-do.1, my CNI is cilium v1.6.4.

Using v0.8.1 of this chart works fine in this environment.

The oddest thing is 2 weeks I setup a completely clean DO cluster on 1.16.2-do.1 and this chart worked out of the box, but now even a clean cluster made today has that error..

P.S. commit ^6b659a2 breaks because consul-k8s 0.11.0 is not published on dockerhub

@ishustava
Copy link
Contributor Author

@tpetrychyn This PR requires a new version of consul-k8s for many components, such as Connect Inject and ACL bootstrapping, to work correctly. I optimistically updated the version. This PR, however, should not be merged without that docker image being published first. Sorry if that was confusing!

Have you tried deploying it with TLS disabled? Do you see the same error? At a first glance, it doesn't seem TLS related since the TLS handshake happens after the TCP connection is established.

@tpetrychyn
Copy link

Thanks for the clarification! Non-tls also has that issue. SettingexposeGossipPorts: true appears to fix it.. Does that provide any insight?

# Note: remember to switch it back to true once the rollout is complete.
# Please see this guide for more details:
# https://learn.hashicorp.com/consul/security-networking/certificates
verify: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkysow this flag is for incremental TLS rollout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A couple things came up. Here's what I've tested so far.

  • review new verify: setting
  • components (with httpsonly)
    • ACL bootstrapping
    • connect
    • sync catalog
    • ent license
    • snapshot
    • mesh gateway (just that they came up)
    • consul UI
    • consul DNS
  • pod security polices (does not work)
  • gradual deployment
  • upgrades

CHANGELOG.md Show resolved Hide resolved
@@ -232,6 +280,38 @@ spec:
- name: aclconfig
mountPath: /consul/aclconfig
{{- end }}
{{- if .Values.global.tls.enabled }}
- name: client-tls-init
image: {{ .Values.global.image }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is very unlikely to make a difference but I think we should use

image: "{{ default .Values.global.image .Values.client.image }}"

here to keep it consistent with the image for the main container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! thanks for calling this out. I'll update.

-ca-file=/consul/tls/ca/tls.crt \
-http-addr="https://${HOST_IP}:8501"
{{- else }}
-http-addr="${HOST_IP}:8500"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed http:// from this. Makes sense to keep it I think so it's clear.

@@ -79,8 +83,13 @@ spec:
{{- if .Values.global.bootstrapACLs}}
-config-dir=/consul/aclconfig \
{{- end }}
-http-addr=http://${HOST_IP}:8500
{{- if (or .Values.global.bootstrapACLs (and .Values.client.snapshotAgent.configSecret.secretName .Values.client.snapshotAgent.configSecret.secretKey) ) }}
{{- if .Values.global.tls.enabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in other places we're setting the CONSUL_HTTP_ADDR and CONSUL_CACERT env vars instead of using flags

resources: ["podsecuritypolicies"]
resourceNames:
- {{ template "consul.fullname" . }}-server
verbs:
- use
{{- else }}
{{- else}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

CHANGELOG.md Outdated
@@ -1,5 +1,15 @@
## Unreleased

IMPROVEMENTS:

* Optionally allow enabling TLS [[GH-313](https://github.com/hashicorp/consul-helm/pull/313/files#)].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Optionally allow enabling TLS [[GH-313](https://github.com/hashicorp/consul-helm/pull/313/files#)].
* Optionally allow enabling TLS [[GH-313](https://github.com/hashicorp/consul-helm/pull/313)] for Consul communication.

@@ -10,13 +10,13 @@ metadata:
release: {{ .Release.Name }}
{{- if .Values.global.enablePodSecurityPolicies }}
rules:
- apiGroups: ["policy"]
- apiGroups: ["policy"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks this template. Either unindent or indent everything nested below

{{- end }}
-dc={{ .Values.global.datacenter }} \
-domain={{ .Values.global.domain }}
curl -s -X POST --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here these POSTs this won't replace any existing secrets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -32,6 +32,12 @@ spec:
spec:
restartPolicy: Never
serviceAccountName: {{ template "consul.fullname" . }}-server-acl-init
{{- if .Values.global.tls.enabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs pod security policy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the PSP for this job you've created in #326, I don't think anything will need to change once that's merged. This PR only adds secrets, and the PSP already allows attaching secret volumes. But let me know if I misunderstood you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we're good now with my PR. That was just a reminder that we needed to do it for acl init.

@@ -0,0 +1,52 @@
# tls-init-cleanup job deletes Kubernetes secrets created by tls-init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs pod security policy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -166,17 +207,17 @@ spec:
- name: aclconfig
mountPath: /consul/aclconfig
{{- end }}
lifecycle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not something that needs to be changed but we might be adding this back as consul force-leave -prune due to hashicorp/consul#6897 (comment)

Todd Radel and others added 4 commits January 8, 2020 13:43
This commit exposes TLS configuration in the Helm chart and
makes the following changes:

 * Optionally allows you to turn on TLS by setting global.tls.enabled
   to true.
 * Exposes the default HTTPS port 8501 on clients and servers
   if TLS enabled.
 * Optionally allows you to disable HTTP ports on clients and servers
   by setting global.tls.httpsOnly to true.
   This is recommended, however, disabling HTTP will break existing
   consul-k8s components, which currently do not support TLS.
 * Adds TLS bootstrapping job that generates Consul cluster CA,
   as well as server and client certificates signed by that CA.
 * Exposes UI service on port 443 if TLS is enabled.
* Rename tls-secret job to tls-init.
* Ensure principle of least privilege for the tls-init-clusterrole
* Set httpsOnly to false by default
* Client certs are generated in an init container, so that we can add
  HOST_IP as a SAN to enable other clients, e.g. sync-catalog, to talk to it
* Update pod security policies to account for HTTPS ports exposed as
  as hostPort if TLS is enabled
* Don't expose various verify_* config options in the Helm chart and,
  instead, assume secure defaults if global.tls.enabled is true.
* Ensure only the CA certificate file is present on the filesystem
  of the server and client containers.
* Don't run tls-init job if servers are disabled
* Add tls-init-cleanup job that deletes certs created by the tls-init
  job when the Helm chart is deleted.
* Add missing tests.
* Use Consul CLI to set enterprise license
* Don't generate CLI clients certs because we are not verifying incoming
  for HTTP requests (ACLs are recommended).
* Remove Helm hooks annotations for RBAC resources.
  They are not needed because resources are already managed
  correctly by Helm.
* separate CA cert and key into their own secrets
* remove duplicate/redundant tests
* increase server certificate validity to 2 years
* add server service DNS names as SANs to the server certificate
* set env variables for consul server and clients pods to be able
  to easily talk to consul when 'exec'ing into the container
* fix graceful termination for the servers (previously
  terminationGracePeriod was set to 10sec, which wasn't enough times
  for the servers to terminate).
* Add pod security policies
* use http everywhere explicitly for CONSUL_HTTP_ADDR
* Fix templating issue in server-clusterrole.yaml
@ishustava
Copy link
Contributor Author

@lkysow I've made the changes you requested.

Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔐 🚀
Be sure to do a release of consul-k8s first

@ishustava ishustava merged commit c979a7f into master Jan 10, 2020
@ishustava ishustava deleted the enable-tls branch January 10, 2020 22:47
@ishustava ishustava mentioned this pull request Jan 11, 2020
@tehmoon tehmoon mentioned this pull request Jan 14, 2020
{{- if .Values.connectInject.centralConfig.enabled }}
-enable-central-config=true \
{{- end }}
{{- if (and .Values.connectInject.centralConfig.enabled .Values.connectInject.centralConfig.defaultProtocol) }}
-default-protocol="{{ .Values.connectInject.centralConfig.defaultProtocol }}" \
{{- end }}
{{- if .Values.connectInject.certs.secretName }}
{{- if .Values.connectInject.certs.secretName }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ishustava does the generated envoy config file include the TLS context (cert/key)? I see that the CA is added inline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The envoy TLS context for the consul client (aka local_agent) doesn't need cert/key since it's not using mTLS client auth to talk to Consul. That's why it only has the CA cert.

If your question is about authentication, we recommend using ACLs for production environments for that purpose, and when they are enabled in the Helm chart, Envoy will include an ACL token when talking to Consul.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I was more curious about mTLS for service to service (via envoy) communication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. mTLS for service to service is configured dynamically as you add and remove services by the Consul client agent. These docs talk a bit about that.

Copy link

@fouadsemaan fouadsemaan May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the dynamically generated envoy-bootstrap.yaml file via the consul connect envoy -bootrap command? Is that where the cert/key should appear?

Specifically, do I expect something like this in the generated file:

tls_context:
common_tls_context:
tls_certificates:
- certificate_chain:
filename: "/etc/example-com.crt"
private_key:
filename: "/etc/example-com.key"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that config is used only for bootstrapping envoy itself. Consul updates envoy config for services dynamically through the API, so you won't find them in a file. You could look at the config though through the envoy admin endpoint. All you need to do is this:

kubectl port-forward po/<your pod that has envoy sidecar container> 19000

Then go to the browser at http://localhost:19000. There you will see an option to dump the config. Alternatively, you could just curl the endpoint to get the config curl http://localhost:19000/config_dump

You could do other things through that endpoint too, like change log levels. Here are the docs for the admin endpoint in case you'll find them useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! I appreciate your help. Thank you! This is awesome!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request theme/tls About running Consul with TLS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support TLS encryption
4 participants