-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose rollout for Docker shared PID namespace #207
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Shared PID Namespace | ||
|
||
Pods share namespaces where possible, but a requirement for sharing the PID | ||
namespace has not been defined due to lack of support in Docker. Docker began | ||
supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt, | ||
cri-o, hyper) have already implemented a shared PID namespace. | ||
|
||
This proposal defines a shared PID namespace as a requirement of the Container | ||
Runtime Interface and links its rollout in Docker to that of the CRI. | ||
|
||
## Motivation | ||
|
||
Sharing a PID namespace between containers in a pod is discussed in | ||
[#1615](https://issues.k8s.io/1615), and enables: | ||
|
||
1. signaling between containers, which is useful for side cars (e.g. for | ||
signaling a daemon process after rotating logs). | ||
2. easier troubleshooting of pods. | ||
3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the | ||
infra container. | ||
|
||
## Goals and Non-Goals | ||
|
||
Goals include: | ||
- Changing default behavior in the Docker runtime as implemented by the CRI | ||
- Making Docker behavior compatible with the other Kubernetes runtimes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the purpose of this goal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. I had questions about the goals. The first one sounds more like an approach to achieve a goal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My primary goal is to enable the shared PID namespace in the docker runtime. That's what I want to happen. The secondary goal is to make this consistent across runtimes (where feasible). A shared PID namespace is consistent with the pod abstraction. That's the way I think things should be, but it wasn't my original goal. It was suggested by a reviewer. |
||
|
||
Non-goals include: | ||
- Creating an init solution that works for all runtimes | ||
- Supporting isolated PID namespace indefinitely | ||
|
||
## Modification to the Docker Runtime | ||
|
||
We will modify the Docker implementation of the CRI to use a shared PID | ||
namespace when running with a version of Docker >= 1.12. The legacy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didn't v1.12 have issues with reparenting? Should we wait for v1.13? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, 1.12 reparents orphans incorrectly and they are reaped outside of the container. That's a no-op in our use case since we don't really care what it is that reaps the zombies. Regular zombies are unaffected. Choosing whatever version > 1.12 fixes this bug would be a reasonable policy choice, but it's not technically necessary. Maybe we could target the next version of docker SIG Node wants to certify for Kubernetes, which I expect won't be until after the CRI because default. If we know that's 1.13, then I can add that here, otherwise we could use a placeholder until it's decided. |
||
`dockertools` implementation will not be changed. | ||
|
||
Linking this change to the CRI means that Kubernetes users who care to test such | ||
changes can test the combined changes at once. Users who do not care to test | ||
such changes will be insulated by Kubernetes not recommending Docker >= 1.12 | ||
until after switching to the CRI. | ||
|
||
Other changes that must be made to support this change: | ||
|
||
1. Add a test to verify all containers restart if the infra container | ||
responsible for the PodSandbox dies. (Note: With Docker 1.12 if the source | ||
of the PID namespace dies all containers sharing that namespace are killed | ||
as well.) | ||
2. Modify the Infra container used by the Docker runtime to reap orphaned | ||
zombies ([#36853](https://pr.k8s.io/36853)). | ||
|
||
## Rollout Plan | ||
|
||
SIG Node is planning to switch to the CRI as a default in 1.6, at which point | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We agreed to have shared pid namespace by default for CRI v1alpha1 interface. In this case, if the user use docker 1.12 or above, enabling shared pid namespace means early exposure to the end users is good thing for us to catch any potential issues. On another side, I think there is a real concern that this is a behavior change for Kubernetes container runtime in production. Even other runtime implementations are having shared pid namespace by default today, but in production, the majority of Kubernetes users are not using such features, and we are not sure if this might cause any user issues. If any problem involving with shared pid namespace resulted in a CRI disabling might be too dramatic. Can we introduce a flag to disable the shared pid namespace instead of disabling entire CRI? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SGTM, I'll update the proposal. |
||
users with Docker >= 1.12 will receive a shared PID namespace by default. | ||
Cluster administrators will be able to disable this behavior by providing a flag | ||
to the kubelet which will cause the dockershim to revert to previous behavior. | ||
|
||
The ability to disable shared PID namespaces is intended as a way to roll back | ||
to prior behavior in the event of unforeseen problems. It won't be possible to | ||
configure the behavior per-pod. We believe this is acceptable because: | ||
|
||
* We have not identified a concrete use case requiring isolated PID namespaces. | ||
* Making PID namespace configurable requires changing the CRI, which we would | ||
like to avoid since there are no use cases. | ||
|
||
In a future release, SIG Node will recommend docker >= 1.12. Unless a compelling | ||
use case for isolated PID namespaces is discovered, we will remove the ability | ||
to disable the shared PID namespace in the subsequent release. | ||
|
||
|
||
[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ | ||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-pid-namespace.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we should change an old design proposal. This should be documented somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change should be reflected somewhere, and SIG Node pointed me here.
I guess the broader question is whether proposals are living documents, but that's out of scope for this PR. If you want me to make the change somewhere else please suggest a place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do update old proposals to reflect new features or changes to design originally proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's impossible to keep all the proposals up-to-date though. Adding a separate spec for CRI anyway has to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, since CRI v1alpha1 is not out yet, and shared pid namespace is what we agreed for v1alpha1, updating the proposal might be the least effort we could do?