Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add webhook with cert-manager recipe #2100

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 19 additions & 15 deletions cmd/manager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ import (
"strings"
"time"

"github.com/elastic/cloud-on-k8s/pkg/about"
// allow gcp authentication
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"

"github.com/elastic/cloud-on-k8s/pkg/about"
estype "github.com/elastic/cloud-on-k8s/pkg/apis/elasticsearch/v1beta1"
"github.com/elastic/cloud-on-k8s/pkg/controller/apmserver"
asesassn "github.com/elastic/cloud-on-k8s/pkg/controller/apmserverelasticsearchassociation"
"github.com/elastic/cloud-on-k8s/pkg/controller/common/certificates"
Expand All @@ -29,16 +30,11 @@ import (
"github.com/elastic/cloud-on-k8s/pkg/dev"
"github.com/elastic/cloud-on-k8s/pkg/dev/portforward"
"github.com/elastic/cloud-on-k8s/pkg/utils/net"

// TODO (sabo): re-enable when webhooks are usable
// "github.com/elastic/cloud-on-k8s/pkg/webhook"

clientgoscheme "k8s.io/client-go/kubernetes/scheme"
ctrl "sigs.k8s.io/controller-runtime"

"github.com/spf13/cobra"
"github.com/spf13/viper"
"k8s.io/client-go/kubernetes"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/cache"
logf "sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/manager/signals"
Expand All @@ -47,6 +43,7 @@ import (
const (
MetricsPortFlag = "metrics-port"
DefaultMetricPort = 0 // disabled
WebhookPort = 9443

AutoPortForwardFlagName = "auto-port-forward"
NamespacesFlag = "namespaces"
Expand All @@ -56,6 +53,7 @@ const (
CertValidityFlag = "cert-validity"
CertRotateBeforeFlag = "cert-rotate-before"

EnableWebhooksFlag = "enable-webhooks"
AutoInstallWebhooksFlag = "auto-install-webhooks"
OperatorNamespaceFlag = "operator-namespace"
WebhookSecretFlag = "webhook-secret"
Expand Down Expand Up @@ -122,6 +120,11 @@ func init() {
certificates.DefaultRotateBefore,
"Duration representing how long before expiration TLS certificates should be reissued",
)
Cmd.Flags().Bool(
EnableWebhooksFlag,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This somewhat duplicates what we used to express with the webhook role. I am fine with moving away from the operator roles concept but I think we should not have both the role and the explicit flag at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that is an oversight on my part, I forgot about the roles. I'll change this, we'll just want to change the default to not-webhook now I think. And change it back when we have automatic webhook configuration

false,
"enables serving webhooks from the operator pod. Webhook configuration must also be created to function",
)
Cmd.Flags().Bool(
AutoInstallWebhooksFlag,
true,
Expand Down Expand Up @@ -236,6 +239,8 @@ func execute() {
}
opts.MetricsBindAddress = fmt.Sprintf(":%d", metricsPort) // 0 to disable

// override the default of 443 because we run as non-root
opts.Port = WebhookPort
mgr, err := ctrl.NewManager(cfg, opts)
if err != nil {
log.Error(err, "unable to create controller manager")
Expand Down Expand Up @@ -313,13 +318,12 @@ func execute() {
os.Exit(1)
}
}

// TODO (sabo): re-enable when webhooks are usable
// log.Info("Setting up webhooks")
// if err := webhook.AddToManager(mgr, roles, newWebhookParameters); err != nil {
// log.Error(err, "unable to register webhooks to the manager")
// os.Exit(1)
// }
if viper.GetBool(EnableWebhooksFlag) {
if err = (&estype.Elasticsearch{}).SetupWebhookWithManager(mgr); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only validates Elasticsearch resources but we used to have validations on (license) Secrets as well. Just flicking through the docs I think it should be possible to re-introduce them, but it might be worth revisiting the decision to validate secrets because it can potentially have a big impact on the stability of the k8s installation. Maybe worth re-introducing a license CRD?

log.Error(err, "unable to create webhook", "webhook", "Elasticsearch")
os.Exit(1)
}
}

log.Info("Starting the manager", "uuid", operatorInfo.OperatorUUID,
"namespace", operatorNamespace, "version", operatorInfo.BuildInfo.Version,
Expand Down
168 changes: 168 additions & 0 deletions config/recipes/cert-manager/webhook.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
[id="{p}-webhook"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not add this file in our current documentation ? Not sure I would look into the recipes directory if I was looking for information about how to configure the webhook with the cert manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with moving it there. I think I had it here with the idea that this would be a rarely used option once we have automatic configuration enabled

== Validating webhook

A validating webhook provides additional validation of Elasticsearch resources beyond what can be specified in OpenAPI specs. This allows advanced validation to help prevent Elasticsearch resources that are invalid and will not function correctly from being admitted by the Kubernetes API server, which can aid troubleshooting. To enable the validating webhook, you must first install cert-manager v0.11+ as described in their instructions link:from being admitted by the Kubernetes API server[here].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link at the end of the sentence if not set

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to cert-manager doc is missing, probably a copy/paste mistake


Once that is complete, you can apply this yaml to create all of the resources necessary for the validating webhook to function. This same yaml is also available in `cloud-on-k8s/config/recipes/cert-manager/webhook.yaml`:

[source,yaml]
----
cat $$<<$$EOF | kubectl apply -f -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$$<<$$EOF is rendered "as is" with github, not sure why

---
# this configures
# - a self signed cert-manager issuer
# - a service to point to the webhook
# - a self signed certificate for the webhook service
# - a validating webhook configuration
# - updates the ECK operator statefulset to enable the webhook
apiVersion: cert-manager.io/v1alpha2
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: elastic-system
spec:
selfSigned: {}
...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why you put document end markers ("...") between the documents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got in the habit of it, I can take it out if there's a reason I'm missing

---
apiVersion: cert-manager.io/v1alpha2
kind: Certificate
metadata:
name: elastic-webhook
namespace: elastic-system
spec:
commonName: elastic-webhook.elastic-system.svc
dnsNames:
- elastic-webhook.elastic-system.svc.cluster.local
- elastic-webhook.elastic-system.svc
issuerRef:
kind: Issuer
name: selfsigned-issuer
secretName: elastic-webhook-server-cert
...
---
apiVersion: v1
kind: Service
metadata:
name: elastic-webhook
namespace: elastic-system
spec:
ports:
- port: 443
protocol: TCP
targetPort: 9443
selector:
control-plane: elastic-operator
sessionAffinity: None
type: ClusterIP
...
---
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
metadata:
name: elastic-webhook.k8s.elastic.co
annotations:
cert-manager.io/inject-ca-from: elastic-system/elastic-webhook
webhooks:
- clientConfig:
caBundle: Cg==
service:
name: elastic-webhook
namespace: elastic-system
# this is the path controller-runtime automatically generates
path: /validate-elasticsearch-k8s-elastic-co-v1beta1-elasticsearch
failurePolicy: Ignore
name: elastic-es-validation.k8s.elastic.co
timeoutSeconds: 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this error when applying the ValidationWebhookConfiguration on GKE v1.13.12-gke.8 :
error: error validating "STDIN": error validating data: ValidationError(ValidatingWebhookConfiguration.webhooks[0]): unknown field "timeoutSeconds" in io.k8s.api.admissionregistration.v1beta1.Webhook; if you choose to ignore these errors, turn validation off with --validate=false

TBH I don't understand why... 😕

It's fine on GKE 1.14, so unless someone else understand the error I have on GKE 1.13 I would remove timeoutSeconds and keep the default (which is 30 seconds)

Copy link
Collaborator

@pebrc pebrc Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field simply did not exist prior to 1.14 see kubernetes/api@c33fb2b and kubernetes/kubernetes#74562

sideEffects: None
rules:
- apiGroups:
- elasticsearch.k8s.elastic.co
apiVersions:
- v1beta1
operations:
- CREATE
- UPDATE
resources:
- elasticsearches
...
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
control-plane: elastic-operator
name: elastic-operator
namespace: elastic-system
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
control-plane: elastic-operator
serviceName: elastic-operator
template:
metadata:
creationTimestamp: null
labels:
control-plane: elastic-operator
spec:
containers:
- args:
- manager
- --operator-roles
- all
- --enable-debug-logs=true
- --enable-webhooks=true
env:
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: OPERATOR_IMAGE
value: docker.elastic.co/eck/eck-operator:1.0.0-beta1
# this may need to be updated to a snapshot build
image: docker.elastic.co/eck/eck-operator:1.0.0-beta1
imagePullPolicy: Always
name: manager
ports:
- containerPort: 9443
name: webhook-server
protocol: TCP
volumeMounts:
# this is the path controller-runtime looks for certs and should not be changed
- mountPath: /tmp/k8s-webhook-server/serving-certs
name: cert
readOnly: true
resources:
limits:
cpu: "1"
memory: 150Mi
requests:
cpu: 100m
memory: 50Mi
volumes:
- name: cert
secret:
defaultMode: 420
secretName: elastic-webhook-server-cert
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: elastic-operator
serviceAccountName: elastic-operator
terminationGracePeriodSeconds: 10
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
...
EOF
----


=== Troubleshooting

You can change the `failurePolicy` of the webhook configuration to `Fail`, which will cause creations to error out if there is an error contacting the webhook.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why we have a default failurePolicy set to Ignore, I have some mixed feelings about this as I have been able to inject a incorrect manifest while testing certificate renewal: certificate was not up to date in the operator, got this message in ECK: http: TLS handshake error from 10.64.1.1:38142: remote error: tls: bad certificate but the resource was created anyway.

As a side note this part is not specific to the cert-manager so I guess we should eventually include it in a more generic documentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to have Fail but that caused a lot of issues with users having trouble with the webhooks. We decided because of these issues to loosen the policy to strike a balance between giving a better experience through immediate validation and not blocking users (especially because we were also validating k8s secrets in < 1.0)

Loading