-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose rollout for Docker shared PID namespace #207
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# Shared PID Namespace | ||
|
||
Pods share namespaces where possible, but a requirement for sharing the PID | ||
namespace has not been defined due to lack of support in Docker. Docker began | ||
supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt, | ||
cri-o, hyper) have already implemented a shared PID namespace. | ||
|
||
This proposal defines a shared PID namespace as a requirement of the Container | ||
Runtime Interface and links its rollout in Docker to that of the CRI. | ||
|
||
## Motivation | ||
|
||
Sharing a PID namespace is discussed in [#1615](https://issues.k8s.io/1615), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: add "sharing in a pod". |
||
and enables: | ||
|
||
1. signaling between containers, which is useful for side cars (e.g. for | ||
signaling a daemon process after rotating logs). | ||
2. easier troubleshooting of pods. | ||
3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the | ||
infra container. | ||
|
||
## Goals and Non-Goals | ||
|
||
Goals include: | ||
- Changing default behavior in the Docker runtime as implemented by the CRI | ||
- Making Docker behavior compatible with the other Kubernetes runtimes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the purpose of this goal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. I had questions about the goals. The first one sounds more like an approach to achieve a goal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My primary goal is to enable the shared PID namespace in the docker runtime. That's what I want to happen. The secondary goal is to make this consistent across runtimes (where feasible). A shared PID namespace is consistent with the pod abstraction. That's the way I think things should be, but it wasn't my original goal. It was suggested by a reviewer. |
||
|
||
Non-goals include: | ||
- Creating an init solution that works for all runtimes | ||
- Supporting isolated PID namespace indefinitely | ||
|
||
## Modification to the Docker Runtime | ||
|
||
We will modify the Docker implementation of the CRI to use a shared PID | ||
namespace when running with a version of Docker >= 1.12. The legacy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didn't v1.12 have issues with reparenting? Should we wait for v1.13? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, 1.12 reparents orphans incorrectly and they are reaped outside of the container. That's a no-op in our use case since we don't really care what it is that reaps the zombies. Regular zombies are unaffected. Choosing whatever version > 1.12 fixes this bug would be a reasonable policy choice, but it's not technically necessary. Maybe we could target the next version of docker SIG Node wants to certify for Kubernetes, which I expect won't be until after the CRI because default. If we know that's 1.13, then I can add that here, otherwise we could use a placeholder until it's decided. |
||
`dockertools` implementation will not be changed. | ||
|
||
Linking this change to the CRI means that Kubernetes users who care to test such | ||
changes can test the combined changes at once. Users who do not care to test | ||
such changes will be insulated by Kubernetes not recommending Docker >= 1.12 | ||
until after switching to the CRI. | ||
|
||
Other changes that must be made to support this change: | ||
|
||
1. Ensure all containers restart if the infra container responsible for the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what this means. kubelet already restarts all containers if the sandbox/infra container is dead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This means we should verify this behavior and maybe add a test. I added this item because a comment in another PR says differently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a change in the code itself. Suggest making it clear by rewording it as "Adding a test to verify" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Switched to "add a test", but maybe we can drop this entirely if it's inherent in the concept of the sandbox. |
||
PodSandbox dies. (Note: With Docker 1.12 if the source of the PID namespace | ||
dies all containers sharing that namespace are killed as well.) | ||
2. Modify the Infra container used by the Docker runtime to reap orphaned | ||
zombies ([#36853](https://pr.k8s.io/36853)). | ||
|
||
## Rollout Plan | ||
|
||
SIG Node is planning to switch to the CRI as a default in 1.6, at which point | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We agreed to have shared pid namespace by default for CRI v1alpha1 interface. In this case, if the user use docker 1.12 or above, enabling shared pid namespace means early exposure to the end users is good thing for us to catch any potential issues. On another side, I think there is a real concern that this is a behavior change for Kubernetes container runtime in production. Even other runtime implementations are having shared pid namespace by default today, but in production, the majority of Kubernetes users are not using such features, and we are not sure if this might cause any user issues. If any problem involving with shared pid namespace resulted in a CRI disabling might be too dramatic. Can we introduce a flag to disable the shared pid namespace instead of disabling entire CRI? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SGTM, I'll update the proposal. |
||
users with Docker >= 1.12 will be able to test Shared namespaces. Switching | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/Shared namespaces/shared PID namespaces There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed, thanks. |
||
back to isolated PID namespaces will require disabling the CRI. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm slightly concerned about bundling the PID namespace with the CRI rollout, i.e., there are no way to disable shared PID namespaces in CRI. The majority of the CRI implementation should not change the behavior of the applications, but this (the shared pid namespace) does not fall into this category. This may affect the adoption of CRI in 1.6. @dchen1107, can we discuss more whether we want to bundle the two, as this is currently not included in the CRI rollout plan? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's very safe to assume SIG Node will discuss this more. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The entire proposal revolves around the idea of bundling rollout, hence it deserves more discussion. |
||
|
||
At some point, say 1.7, SIG Node will remove support for disabling the CRI. | ||
After this point users must roll back to a previous version of Kubernetes or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why rolling back? It seems to me that the users won't even choose to upgrade their kubernetes cluster if this is the case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Choose not to upgrade in order to avoid the additional shared namespace? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I didn't get why users would upgrade and then roll back for the shared pid namespace. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think they would roll back if they found their container images were incompatible with a new version of Kubernetes. |
||
Docker to achieve PID namespace isolation. This is acceptable because: | ||
|
||
* No one has been able to identify a concrete use case requiring isolated PID | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm....I am not sure what has been done to reach a broader audience. Could you elaborate? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No attempt has been made to reach a broader audience. I was referring to reviewers of this proposal. |
||
namespaces. | ||
* The lack of use cases means we can't justify the complexity required to make | ||
PID namespace type configurable. | ||
* Users will already be looking for issues due to the major version upgrade and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does "major version upgrade" mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1.X -> 1.Y |
||
prepared for a rollback to the previous release. | ||
|
||
Alternatively, we could create a flag in the kublet to disable shared PID | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/kublet/kubelet There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed, thanks. |
||
namespace, but this wouldn't be especially useful to users of a hosted | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? The maintainer of the hosted clusters can still choose to disable this feature.... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There would need a mechanism for user's to affect flags of the running kubelet. I'd be surprised if this was common. Or perhaps you mean they would choose to disable shared PID namespaces for all users on principle, which seems a stretch for a feature that so far no one opposes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having an escape hatch is better. The other option is to make a new patch release which can incur a lot of latency. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I can't comment on behalf of all the provider of hosted kubernetes clusters...
From past experiences, it's hard to gather feedback from users. Erring on the side of caution might be good. Is making shared pid namespace configurable for the transitional period really such a big hurdle? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nope, it's not a big hurdle. It can be done with an annotation and adding an option to the CRI. |
||
Kubernetes cluster. | ||
|
||
|
||
[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ | ||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-pid-namespace.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we should change an old design proposal. This should be documented somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change should be reflected somewhere, and SIG Node pointed me here.
I guess the broader question is whether proposals are living documents, but that's out of scope for this PR. If you want me to make the change somewhere else please suggest a place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do update old proposals to reflect new features or changes to design originally proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's impossible to keep all the proposals up-to-date though. Adding a separate spec for CRI anyway has to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, since CRI v1alpha1 is not out yet, and shared pid namespace is what we agreed for v1alpha1, updating the proposal might be the least effort we could do?