Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Deployer - Error when upgrading in GCP marketplace #4299

Closed
Bobgy opened this issue Jul 31, 2020 · 3 comments · Fixed by #4320
Closed

Cache Deployer - Error when upgrading in GCP marketplace #4299

Bobgy opened this issue Jul 31, 2020 · 3 comments · Fixed by #4320
Assignees
Labels

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Jul 31, 2020

What steps did you take:

  1. install 0.5.1 in GCP marketplace
  2. delete KFP application
  3. reinstall 1.0.0 in GCP marketplace to the same namespace
  4. kubectl get pod and logs cache-deployer, cache-deployer crashed for the first run, but started running for the second (but resources were already created).

What happened:

Here's the error log for cache deployer:

(base) ➜ ~ k logs cache-deployer-deployment-7c75f45485-zcbz7 --previous

+ echo 'Start deploying cache service to existing cluster:'
+ NAMESPACE=kfp2
+ MUTATING_WEBHOOK_CONFIGURATION_NAME=cache-webhook-kfp2
+ WEBHOOK_SECRET_NAME=webhook-server-tls
Start deploying cache service to existing cluster:
+ kubectl get mutatingwebhookconfigurations cache-webhook-kfp2 --namespace kfp2 --ignore-not-found
+ kubectl get secrets webhook-server-tls --namespace kfp2 --ignore-not-found
+ webhook_config_exists=false
+ grep cache-webhook-kfp2 -w
cache-webhook-kfp2   2020-07-01T05:04:03Z
+ webhook_config_exists=true
+ webhook_secret_exists=false
+ grep webhook-server-tls -w
+ '[' true '==' true ]
+ '[' false '==' true ]
+ '[' true '==' true ]
Warning: Webhook config exists, but the secret does not exist. Reinstalling.
+ echo 'Warning: Webhook config exists, but the secret does not exist. Reinstalling.'
+ kubectl delete mutatingwebhookconfigurations cache-webhook-kfp2 --namespace kfp2
Error from server (Forbidden): mutatingwebhookconfigurations.admissionregistration.k8s.io "cache-webhook-kfp2" is forbidden: User "system:serviceaccount:kfp2:kubeflow-pipelines-cache-deployer-sa" cannot delete resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
+ true
+ '[' false '==' true ]
+ export 'CA_FILE=ca_cert'
+ rm -f ca_cert
+ touch ca_cert
+ ./webhook-create-signed-cert.sh --namespace kfp2 --cert_output_path ca_cert --secret webhook-server-tls
+ [[ 6 -gt 0 ]]
+ case ${1} in
+ namespace=kfp2
+ shift
+ shift
+ [[ 4 -gt 0 ]]
+ case ${1} in
+ cert_output_path=ca_cert
+ shift
+ shift
+ [[ 2 -gt 0 ]]
+ case ${1} in
+ secret=webhook-server-tls
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z ']'
+ service=cache-server
+ '[' -z webhook-server-tls ']'
+ '[' -z kfp2 ']'
+ '[' -z ca_cert ']'
++ command -v openssl
+ '[' '!' -x /usr/bin/openssl ']'
+ csrName=cache-server.kfp2
++ mktemp -d
+ tmpdir=/tmp/tmp.fGKJek
+ echo 'creating certs in tmpdir /tmp/tmp.fGKJek '
creating certs in tmpdir /tmp/tmp.fGKJek
+ cat
+ openssl genrsa -out /tmp/tmp.fGKJek/server-key.pem 2048
Generating RSA private key, 2048 bit long modulus (2 primes)
................................+++++
.................+++++
e is 65537 (0x010001)
+ openssl req -new -key /tmp/tmp.fGKJek/server-key.pem -subj /CN=cache-server.kfp2.svc -out /tmp/tmp.fGKJek/server.csr -config /tmp/tmp.fGKJek/csr.conf
start running kubectl...
+ echo 'start running kubectl...'
+ kubectl delete csr cache-server.kfp2
+ true
+ + cat
kubectl create -f -
++ cat /tmp/tmp.fGKJek/server.csr
++ base64
++ tr -d '\n'
certificatesigningrequest.certificates.k8s.io/cache-server.kfp2 created
+ true
+ kubectl get csr cache-server.kfp2
NAME                AGE   REQUESTOR                                                         CONDITION
cache-server.kfp2   1s    system:serviceaccount:kfp2:kubeflow-pipelines-cache-deployer-sa   Pending
+ '[' 0 -eq 0 ']'
+ break
+ kubectl certificate approve cache-server.kfp2
certificatesigningrequest.certificates.k8s.io/cache-server.kfp2 approved
++ seq 10
+ for x in $(seq 10)
++ kubectl get csr cache-server.kfp2 -o 'jsonpath={.status.certificate}'
+ serverCert=<redacted>
+ break
+ openssl base64 -d -A -out /tmp/tmp.fGKJek/server-cert.pem
<redacted>
+ kubectl create secret generic webhook-server-tls --from-file=key.pem=/tmp/tmp.fGKJek/server-key.pem --from-file=cert.pem=/tmp/tmp.fGKJek/server-cert.pem --dry-run -o yaml
+ kubectl -n kfp2 apply -f -
secret/webhook-server-tls created
+ echo 'Signed certificate generated for cache server'
Signed certificate generated for cache server
+ NAMESPACE=kfp2 ./webhook-patch-ca-bundle.sh --cert_input_path ca_cert
+ [[ 2 -gt 0 ]]
+ case ${1} in
+ cert_input_path=ca_cert
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z ca_cert ']'
++ cat ca_cert
<redacted>
CA_BUNDLE patched successfully
+ echo 'CA_BUNDLE patched successfully'
+ cat ./cache-configmap-ca-bundle.yaml
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  name: cache-webhook-kfp2
webhooks:
  - name: cache-server.kfp2.svc
    clientConfig:
      service:
        name: cache-server
        namespace: kfp2
        path: "/mutate"
      caBundle: <redacted>
    rules:
    - operations: [ "CREATE" ]
      apiGroups: [""]
      apiVersions: ["v1"]
      resources: ["pods"]
+ kubectl apply -f ./cache-configmap-ca-bundle.yaml --namespace kfp2
Error from server (Forbidden): error when applying patch:
{"$setElementOrder/webhooks":[{"name":"cache-server.kfp2.svc"}],"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"admissionregistration.k8s.io/v1beta1\",\"kind\":\"MutatingWebhookConfiguration\",\"metadata\":{\"annotations\":{},\"name\":\"cache-webhook-kfp2\"},\"webhooks\":[{\"clientConfig\":{\"caBundle\":\"<redacted>",\"service\":{\"name\":\"cache-server\",\"namespace\":\"kfp2\",\"path\":\"/mutate\"}},\"name\":\"cache-server.kfp2.svc\",\"rules\":[{\"apiGroups\":[\"\"],\"apiVersions\":[\"v1\"],\"operations\":[\"CREATE\"],\"resources\":[\"pods\"]}]}]}\n"}},"webhooks":[{"clientConfig":{"caBundle":"<redacted>"},"name":"cache-server.kfp2.svc","rules":[{"apiGroups":[""],"apiVersions":["v1"],"operations":["CREATE"],"resources":["pods"]}]}]}
to:
Resource: "admissionregistration.k8s.io/v1beta1, Resource=mutatingwebhookconfigurations", GroupVersionKind: "admissionregistration.k8s.io/v1beta1, Kind=MutatingWebhookConfiguration"
Name: "cache-webhook-kfp2", Namespace: ""
Object: &{map["apiVersion":"admissionregistration.k8s.io/v1beta1" "kind":"MutatingWebhookConfiguration" "metadata":map["annotations":map["kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"admissionregistration.k8s.io/v1beta1\",\"kind\":\"MutatingWebhookConfiguration\",\"metadata\":{\"annotations\":{},\"name\":\"cache-webhook-kfp2\"},\"webhooks\":[{\"clientConfig\":{\"caBundle\":\"<redacted>\",\"service\":{\"name\":\"cache-server\",\"namespace\":\"kfp2\",\"path\":\"/mutate\"}},\"name\":\"cache-server.kfp2.svc\",\"rules\":[{\"apiGroups\":[\"\"],\"apiVersions\":[\"v1\"],\"operations\":[\"CREATE\"],\"resources\":[\"pods\"]}]}]}\n"] "creationTimestamp":"2020-07-01T05:04:03Z" "generation":'\x01' "name":"cache-webhook-kfp2" "resourceVersion":"22335559" "selfLink":"/apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations/cache-webhook-kfp2" "uid":"4828a40a-bb58-11ea-8006-42010a8c00b5"] "webhooks":[map["admissionReviewVersions":["v1beta1"] "clientConfig":map["caBundle":"<redacted>" "service":map["name":"cache-server" "namespace":"kfp2" "path":"/mutate"]] "failurePolicy":"Ignore" "name":"cache-server.kfp2.svc" "namespaceSelector":map[] "rules":[map["apiGroups":[""] "apiVersions":["v1"] "operations":["CREATE"] "resources":["pods"] "scope":"*"]] "sideEffects":"Unknown" "timeoutSeconds":'\x1e']]]}
for: "./cache-configmap-ca-bundle.yaml": mutatingwebhookconfigurations.admissionregistration.k8s.io "cache-webhook-kfp2" is forbidden: User "system:serviceaccount:kfp2:kubeflow-pipelines-cache-deployer-sa" cannot patch resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

What did you expect to happen:

cache-deployer should succeed for the first run

Environment:

How did you deploy Kubeflow Pipelines (KFP)?
GCP marketplace

KFP version: Upgrade from 0.5.1 to 1.0.0

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind bug

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 31, 2020

/assign @Ark-kun
I suspect there's sth not expected going on here. Why did the first cache deployer run that got many permission errors keep going, did we forget to add abort when there's an exception configuration.

However, that seems to magically create enough resources so that the second run have all we need.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 31, 2020

and cache-server is no longer working properly, when I create new runs, they don't get cached results.
and

$ kubectl logs cache-server-xxxx
2020/07/31 02:29:40 http: TLS handshake error from 10.48.2.1:45878: remote error: tls: bad certificate
2020/07/31 02:30:25 http: TLS handshake error from 10.140.15.196:34740: remote error: tls: bad certificate
2020/07/31 02:30:57 http: TLS handshake error from 10.140.15.196:34912: remote error: tls: bad certificate
2020/07/31 02:31:11 http: TLS handshake error from 10.140.15.196:35038: remote error: tls: bad certificate
2020/07/31 02:31:11 http: TLS handshake error from 10.140.15.194:42100: remote error: tls: bad certificate

I'm getting these error messages.

@Bobgy Bobgy added the status/triaged Whether the issue has been explicitly triaged label Jul 31, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 4, 2020

Thank you for finding and reporting this issue.
I will fix it ASAP.
I think the main error is this:

Error from server (Forbidden): mutatingwebhookconfigurations.admissionregistration.k8s.io "cache-webhook-kfp2" is forbidden: User "system:serviceaccount:kfp2:kubeflow-pipelines-cache-deployer-sa" cannot delete resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

The deployer could not re-install the config, so the secret and the config became out of sync.

It's pretty easy to fix - we just need to add the proper roles to the deployer.

k8s-ci-robot pushed a commit that referenced this issue Aug 5, 2020
)

* Backend - Cache - Fixed reinstallation by adding missing roles

* Stop ignoring the deletion errors

* Added patch permission as well

It should not be triggered, but might be useful in the future.
chensun pushed a commit to chensun/pipelines that referenced this issue Aug 7, 2020
…4299 (kubeflow#4320)

* Backend - Cache - Fixed reinstallation by adding missing roles

* Stop ignoring the deletion errors

* Added patch permission as well

It should not be triggered, but might be useful in the future.
Ark-kun added a commit to Ark-kun/pipelines that referenced this issue Aug 17, 2020
…4299 (kubeflow#4320)

* Backend - Cache - Fixed reinstallation by adding missing roles

* Stop ignoring the deletion errors

* Added patch permission as well

It should not be triggered, but might be useful in the future.
Bobgy pushed a commit that referenced this issue Aug 18, 2020
)

* Backend - Cache - Fixed reinstallation by adding missing roles

* Stop ignoring the deletion errors

* Added patch permission as well

It should not be triggered, but might be useful in the future.
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020
…4299 (kubeflow#4320)

* Backend - Cache - Fixed reinstallation by adding missing roles

* Stop ignoring the deletion errors

* Added patch permission as well

It should not be triggered, but might be useful in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants