Make controller clients more resilient to token changes #3498

liggitt · 2015-06-29T19:11:28Z

Controllers are using service account tokens to make API calls for builds/deployments/replication controllers.

There are circumstances where their API tokens are no longer valid:

Token is deleted/rotated
Signing key is changed

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

smarterclayton · 2017-01-23T03:50:10Z

I question that we view this as p2 1.5 years later :)

liggitt · 2017-01-23T03:51:08Z

more things are using controllers now. I'd call it p1

adelton · 2017-10-16T13:33:55Z

Could you provide steps to reproduce the situation which currently leads to failure?

liggitt · 2017-10-16T13:41:32Z

with the controllers running, remove the service account tokens in the openshift-infra or kube-system namespaces. the controllers start encountering 401 errors, but do not exit, which would allow them to pick up new valid credentials

adelton · 2017-10-17T06:44:07Z

I have OpenShift 3.7 all-on-one setup.

I've created deployment via

oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

and got

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #1 deployed 20 hours ago - 1 pod

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

There are no secrets in

# oc get secrets -n kuby-system
No resources found.

but there's a bunch of them in openshift-infra:

# oc get secrets -n openshift-infra
NAME                                                      TYPE                                  DATA      AGE
[...]
service-serving-cert-controller-dockercfg-3jls5           kubernetes.io/dockercfg               1         20h
service-serving-cert-controller-token-0vlf6               kubernetes.io/service-account-token   4         20h
service-serving-cert-controller-token-p6672               kubernetes.io/service-account-token   4         20h
serviceaccount-controller-dockercfg-xt97c                 kubernetes.io/dockercfg               1         20h
serviceaccount-controller-token-5tp8n                     kubernetes.io/service-account-token   4         20h
serviceaccount-controller-token-zmnf9                     kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-dockercfg-1clmr    kubernetes.io/dockercfg               1         20h
serviceaccount-pull-secrets-controller-token-41gmz        kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-token-7p29z        kubernetes.io/service-account-token   4         20h
[...]

I've deleted the two serviceaccount-controller-tokens:

# oc delete secret/serviceaccount-controller-token-5tp8n -n openshift-infra
secret "serviceaccount-controller-token-5tp8n" deleted
# oc delete secret/serviceaccount-controller-token-zmnf9 -n openshift-infra
secret "serviceaccount-controller-token-zmnf9" deleted

I have redeployed with

# oc rollout latest dc/ruby-ex
deploymentconfig "ruby-ex" rolled out

but things are still passing:

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #2 deployed 4 minutes ago - 1 pod
    deployment #1 deployed 20 hours ago

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

with no (relevant) errors in the node log.

What should I have done differently to reproduce the issue?

liggitt · 2017-10-17T12:59:03Z

There are no secrets in

# oc get secrets -n kuby-system

kube-system :)

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

adelton · 2017-10-17T15:29:22Z

kube-system :)

Auch. Sorry about that.

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

I've deleted

# oc delete secret/replication-controller-token-8zfdh secret/replication-controller-token-v41d4 -n kube-system

and that seems to have disrupted the rc/ruby-ex:

ruby-ex-5-deploy   0/1       Error       0          1h

Where do I find the controller logs? In the -deploy pod log, there does not seem anything specific about the 401 errors:

# oc logs ruby-ex-5-deploy
--> Scaling up ruby-ex-5 from 0 to 1, scaling down ruby-ex-4 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ruby-ex-5 up to 1
error: timed out waiting for "ruby-ex-5" to be synced

liggitt · 2017-10-17T15:58:13Z

in whatever process is running the openshift controllers, which depends on your setup method:

stdout of openshift start master
output of the apiserver container with oc cluster up
journalctl -u atomic-openshift-controllers.service on an installed system
etc

adelton · 2017-10-20T12:09:07Z

Thanks, I've got the reproducer and the error messages now.

As for

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

is the re-retrieval of the token already done somewhere in the code base so that we could reuse the same code, or is this the first time something like this is implemented in Kubernetes/OpenShift?

adelton · 2017-10-24T13:45:49Z

Observation on Kubernetes master: when secret/replication-controller-token-* is deleted, it gets automatically recreated:

$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-czbln       kubernetes.io/service-account-token   3         3m
$ cluster/kubectl.sh delete secret replication-controller-token-czbln -n kube-system
secret "replication-controller-token-czbln" deleted
$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-fkfbn       kubernetes.io/service-account-token   3         2s
$

If I keep creating pods from

apiVersion: v1
kind: Pod
metadata:
  generateName: test-security-context-
spec:
  restartPolicy: Never
  containers:
  - name: test-security-context
    image: centos:latest
    command:
    - "sleep"
    - "infinity"

via cluster/kubectl.sh create -f pod.yaml in a loop and I also keep deleting secret/replication-controller-token-*, the containers still get started and running.

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8? Or am I way off in trying to reproduce and observe the behaviour?

liggitt · 2017-10-24T14:02:27Z

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8?

No. The tokens get recreated, but a running controller manager keeps trying to use the old tokens until restarted.

adelton · 2017-10-24T14:35:56Z

Ah, mea culpa, I was using pod file when I wanted to work with replication controller file.

With

apiVersion: v1
kind: ReplicationController
metadata:
  name: test-token-removal-rc-1
spec:
  replicas: 1
  selector:
    app: test-token-removal
  template:
    metadata:
      name: test-token-removal
      labels:
        app: test-token-removal
    spec:
      containers:
      - name: test-token-removal-pod-1
        image: centos:latest
        command:
        - "sleep"
        - "infinity"

I see the replication_controller.go:422] Unauthorized and the pod does not get created, after the token was rotated.

adelton · 2017-10-26T08:32:36Z

What's the preferred method to go about it?

Should the individual controllers exit when they start getting errors.IsUnauthorized(err) and the controller manager notice that they are gone and restart them?

Or should we add a way for the controller manager to call into the controllers and pass the new tokens to the client objects?

liggitt · 2017-11-02T18:49:39Z

not sure... if it were easy we would have done it earlier :)

options that immediately come to mind:

modify the controllers to exit on persistent 401s, and the controller manager to restart them (large, lots of upstream changes)
modify the clientbuilder to create clients for the controllers that embed the ability to fetch a new credential when a persistent 401 is encountered (would likely involve a custom WrapTransport in the config, rather than populating the BearerToken)

openshift-bot · 2018-02-24T11:23:42Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-03-26T11:29:30Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-04-25T11:34:29Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

liggitt · 2018-08-30T16:24:39Z

/unassign

enj · 2019-10-16T15:05:34Z

/unassign

@stlaz @sttts @mfojtik

liggitt added kind/enhancement priority/P2 area/security component/auth labels Jun 29, 2015

liggitt self-assigned this Jun 29, 2015

liggitt mentioned this issue Oct 5, 2016

use service accounts as clients for controllers kubernetes/kubernetes#33310

Merged

liggitt added priority/P1 and removed priority/P2 labels Jan 23, 2017

soltysh added this to the 1.6.0 milestone Feb 3, 2017

soltysh assigned enj Feb 3, 2017

liggitt added priority/P2 and removed priority/P1 labels Apr 1, 2017

liggitt modified the milestone: 3.6.0 Jun 14, 2017

enj added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 9, 2017

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 26, 2018

openshift-ci-robot closed this as completed Apr 25, 2018

enj added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 26, 2018

enj reopened this Apr 26, 2018

openshift-ci-robot unassigned liggitt Aug 30, 2018

openshift-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/enhancement labels Apr 14, 2019

openshift-ci-robot unassigned enj Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make controller clients more resilient to token changes #3498

Make controller clients more resilient to token changes #3498

liggitt commented Jun 29, 2015

smarterclayton commented Jan 23, 2017

liggitt commented Jan 23, 2017

adelton commented Oct 16, 2017

liggitt commented Oct 16, 2017

adelton commented Oct 17, 2017 •

edited

Loading

liggitt commented Oct 17, 2017 •

edited

Loading

adelton commented Oct 17, 2017

liggitt commented Oct 17, 2017

adelton commented Oct 20, 2017

adelton commented Oct 24, 2017

liggitt commented Oct 24, 2017

adelton commented Oct 24, 2017

adelton commented Oct 26, 2017

liggitt commented Nov 2, 2017

openshift-bot commented Feb 24, 2018

openshift-bot commented Mar 26, 2018

openshift-bot commented Apr 25, 2018

liggitt commented Aug 30, 2018

enj commented Oct 16, 2019

Make controller clients more resilient to token changes #3498

Make controller clients more resilient to token changes #3498

Comments

liggitt commented Jun 29, 2015

smarterclayton commented Jan 23, 2017

liggitt commented Jan 23, 2017

adelton commented Oct 16, 2017

liggitt commented Oct 16, 2017

adelton commented Oct 17, 2017 • edited Loading

liggitt commented Oct 17, 2017 • edited Loading

adelton commented Oct 17, 2017

liggitt commented Oct 17, 2017

adelton commented Oct 20, 2017

adelton commented Oct 24, 2017

liggitt commented Oct 24, 2017

adelton commented Oct 24, 2017

adelton commented Oct 26, 2017

liggitt commented Nov 2, 2017

openshift-bot commented Feb 24, 2018

openshift-bot commented Mar 26, 2018

openshift-bot commented Apr 25, 2018

liggitt commented Aug 30, 2018

enj commented Oct 16, 2019

adelton commented Oct 17, 2017 •

edited

Loading

liggitt commented Oct 17, 2017 •

edited

Loading