Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make controller clients more resilient to token changes #3498

Open
liggitt opened this issue Jun 29, 2015 · 19 comments
Open

Make controller clients more resilient to token changes #3498

liggitt opened this issue Jun 29, 2015 · 19 comments
Labels
area/security component/auth help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/P2

Comments

@liggitt
Copy link
Contributor

liggitt commented Jun 29, 2015

Controllers are using service account tokens to make API calls for builds/deployments/replication controllers.

There are circumstances where their API tokens are no longer valid:

  1. Token is deleted/rotated
  2. Signing key is changed

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

@smarterclayton
Copy link
Contributor

I question that we view this as p2 1.5 years later :)

@liggitt
Copy link
Contributor Author

liggitt commented Jan 23, 2017

more things are using controllers now. I'd call it p1

@soltysh soltysh added this to the 1.6.0 milestone Feb 3, 2017
@liggitt liggitt modified the milestone: 3.6.0 Jun 14, 2017
@enj enj added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 9, 2017
@adelton
Copy link
Contributor

adelton commented Oct 16, 2017

Could you provide steps to reproduce the situation which currently leads to failure?

@liggitt
Copy link
Contributor Author

liggitt commented Oct 16, 2017

with the controllers running, remove the service account tokens in the openshift-infra or kube-system namespaces. the controllers start encountering 401 errors, but do not exit, which would allow them to pick up new valid credentials

@adelton
Copy link
Contributor

adelton commented Oct 17, 2017

I have OpenShift 3.7 all-on-one setup.

I've created deployment via

oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

and got

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #1 deployed 20 hours ago - 1 pod

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

There are no secrets in

# oc get secrets -n kuby-system
No resources found.

but there's a bunch of them in openshift-infra:

# oc get secrets -n openshift-infra
NAME                                                      TYPE                                  DATA      AGE
[...]
service-serving-cert-controller-dockercfg-3jls5           kubernetes.io/dockercfg               1         20h
service-serving-cert-controller-token-0vlf6               kubernetes.io/service-account-token   4         20h
service-serving-cert-controller-token-p6672               kubernetes.io/service-account-token   4         20h
serviceaccount-controller-dockercfg-xt97c                 kubernetes.io/dockercfg               1         20h
serviceaccount-controller-token-5tp8n                     kubernetes.io/service-account-token   4         20h
serviceaccount-controller-token-zmnf9                     kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-dockercfg-1clmr    kubernetes.io/dockercfg               1         20h
serviceaccount-pull-secrets-controller-token-41gmz        kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-token-7p29z        kubernetes.io/service-account-token   4         20h
[...]

I've deleted the two serviceaccount-controller-tokens:

# oc delete secret/serviceaccount-controller-token-5tp8n -n openshift-infra
secret "serviceaccount-controller-token-5tp8n" deleted
# oc delete secret/serviceaccount-controller-token-zmnf9 -n openshift-infra
secret "serviceaccount-controller-token-zmnf9" deleted

I have redeployed with

# oc rollout latest dc/ruby-ex
deploymentconfig "ruby-ex" rolled out

but things are still passing:

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #2 deployed 4 minutes ago - 1 pod
    deployment #1 deployed 20 hours ago

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

with no (relevant) errors in the node log.

What should I have done differently to reproduce the issue?

@liggitt
Copy link
Contributor Author

liggitt commented Oct 17, 2017

There are no secrets in

# oc get secrets -n kuby-system

kube-system :)

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

@adelton
Copy link
Contributor

adelton commented Oct 17, 2017

kube-system :)

Auch. Sorry about that.

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

I've deleted

# oc delete secret/replication-controller-token-8zfdh secret/replication-controller-token-v41d4 -n kube-system

and that seems to have disrupted the rc/ruby-ex:

ruby-ex-5-deploy   0/1       Error       0          1h

Where do I find the controller logs? In the -deploy pod log, there does not seem anything specific about the 401 errors:

# oc logs ruby-ex-5-deploy
--> Scaling up ruby-ex-5 from 0 to 1, scaling down ruby-ex-4 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ruby-ex-5 up to 1
error: timed out waiting for "ruby-ex-5" to be synced

@liggitt
Copy link
Contributor Author

liggitt commented Oct 17, 2017

in whatever process is running the openshift controllers, which depends on your setup method:

  • stdout of openshift start master
  • output of the apiserver container with oc cluster up
  • journalctl -u atomic-openshift-controllers.service on an installed system
  • etc

@adelton
Copy link
Contributor

adelton commented Oct 20, 2017

Thanks, I've got the reproducer and the error messages now.

As for

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

is the re-retrieval of the token already done somewhere in the code base so that we could reuse the same code, or is this the first time something like this is implemented in Kubernetes/OpenShift?

@adelton
Copy link
Contributor

adelton commented Oct 24, 2017

Observation on Kubernetes master: when secret/replication-controller-token-* is deleted, it gets automatically recreated:

$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-czbln       kubernetes.io/service-account-token   3         3m
$ cluster/kubectl.sh delete secret replication-controller-token-czbln -n kube-system
secret "replication-controller-token-czbln" deleted
$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-fkfbn       kubernetes.io/service-account-token   3         2s
$ 

If I keep creating pods from

apiVersion: v1
kind: Pod
metadata:
  generateName: test-security-context-
spec:
  restartPolicy: Never
  containers:
  - name: test-security-context
    image: centos:latest
    command:
    - "sleep"
    - "infinity"

via cluster/kubectl.sh create -f pod.yaml in a loop and I also keep deleting secret/replication-controller-token-*, the containers still get started and running.

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8? Or am I way off in trying to reproduce and observe the behaviour?

@liggitt
Copy link
Contributor Author

liggitt commented Oct 24, 2017

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8?

No. The tokens get recreated, but a running controller manager keeps trying to use the old tokens until restarted.

@adelton
Copy link
Contributor

adelton commented Oct 24, 2017

Ah, mea culpa, I was using pod file when I wanted to work with replication controller file.

With

apiVersion: v1
kind: ReplicationController
metadata:
  name: test-token-removal-rc-1
spec:
  replicas: 1
  selector:
    app: test-token-removal
  template:
    metadata:
      name: test-token-removal
      labels:
        app: test-token-removal
    spec:
      containers:
      - name: test-token-removal-pod-1
        image: centos:latest
        command:
        - "sleep"
        - "infinity"

I see the replication_controller.go:422] Unauthorized and the pod does not get created, after the token was rotated.

@adelton
Copy link
Contributor

adelton commented Oct 26, 2017

What's the preferred method to go about it?

Should the individual controllers exit when they start getting errors.IsUnauthorized(err) and the controller manager notice that they are gone and restart them?

Or should we add a way for the controller manager to call into the controllers and pass the new tokens to the client objects?

@liggitt
Copy link
Contributor Author

liggitt commented Nov 2, 2017

not sure... if it were easy we would have done it earlier :)

options that immediately come to mind:

  • modify the controllers to exit on persistent 401s, and the controller manager to restart them (large, lots of upstream changes)
  • modify the clientbuilder to create clients for the controllers that embed the ability to fetch a new credential when a persistent 401 is encountered (would likely involve a custom WrapTransport in the config, rather than populating the BearerToken)

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 26, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@enj enj added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 26, 2018
@enj enj reopened this Apr 26, 2018
@liggitt
Copy link
Contributor Author

liggitt commented Aug 30, 2018

/unassign

@openshift-ci-robot openshift-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/enhancement labels Apr 14, 2019
@enj
Copy link
Contributor

enj commented Oct 16, 2019

/unassign

@stlaz @sttts @mfojtik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security component/auth help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/P2
Projects
None yet
Development

No branches or pull requests

7 participants