Skip to content

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jul 9, 2021

based on (and currently includes) #3081 and #3067. Keeping this a draft before those two are merged.

  1. Fix the inability to freeze the container using its Set() method (with r.Freezer set to Frozen). Add a test.
  2. Avoid unnecessary freeze/thaw from system v1 driver. Add a test.

Review commit-by-commit. Please see individual commits for details.

1.0 backport: #3093

@kolyshkin
Copy link
Contributor Author

As suggested in #3065 (comment), this can be further improved to avoid the freeze entirely if we're sure systemd won't set the deny-all device rule.

Looking at
https://github.com/systemd/systemd/blob/dd376574fd62c6fcf446ee500ca85558a71d645e/src/core/cgroup.c#L1130-L1133
it seems that freeze can be skipped if DevicePolicy is auto (which is the default) AND DeviceAllow list is empty.

Problem is, because of SkipDevices use in update.go, we have to query the current/existing properties first, and only skip freezing if all of the following is true:

  1. Current properties have DevicePolicy=auto
  2. Current properties have empty DeviceAllow list
  3. SkipDevices is true.

@odinuge
Copy link
Contributor

odinuge commented Jul 9, 2021

Together with #3081 (review), I think this PR is pretty neat and good.

@kolyshkin in #3065 (comment)

So, one way to fix this would be to have a code that tells whether systemd is going to do deny-all on SetUnitProperties, and skip the freeze it it won't.

If we want to do that for container as well, yeah. Although I don't know how often containers are allowed access to all devices?

For the k8s usage for managing a control group, a simple flag is ok though, but that is up to you and the other runc maintainers.

This is a kludge to a kludge and it's dependent upon systemd internals (which may change) but in these circumstances this may be the best way to proceed.

Yup, it becomes quite a mess... 🙃

The best fix is probably to just start using cgroup v2 instead. crun has opted to just skip dealing with systemd mostly, and use a delegated scope for cgroup v2, and that works kinda ok I guess, but its not perfect. Not sure how they deal with cgroup v1 tho.

@kolyshkin kolyshkin changed the title libct/cg/sd/v1: Set: optimize freeze cgroups: Set: fix freeze, avoid unnecessary freeze from systemd v1 Jul 11, 2021
@kolyshkin kolyshkin force-pushed the freeze-less branch 3 times, most recently from 0eae769 to 5f5d5c4 Compare July 11, 2021 23:46
@kolyshkin
Copy link
Contributor Author

If we want to do that for container as well, yeah. Although I don't know how often containers are allowed access to all devices?

Container run via docker run --privileged allows access to all devices. I guess podman and some other runtimes have the same or similar thing.

For the k8s usage for managing a control group, a simple flag is ok though, but that is up to you and the other runc maintainers.

Problem is, from the libcontainer/cgroup we do not know if it's a container or a pod cgroup. Even if it's a pod, it can still have some device access rules (kubernetes doesn't currently set any, but this may change), so having something like Pod: true or SkipFreeze: true is a no go.

The only thing why I don't like the current solution (as implemented by libct/cg/sd/v1: Set: avoid unnecessary freeze/thaw commit in this PR) is it does two calls to dbus to find out whether to skip the freeze, but as we don't have/know the current state, this is necessary.

@kolyshkin kolyshkin force-pushed the freeze-less branch 2 times, most recently from fdc81c3 to b1c14c7 Compare July 12, 2021 09:12
@cyphar
Copy link
Member

cyphar commented Jul 13, 2021

@kolyshkin

Looking at
https://github.com/systemd/systemd/blob/dd376574fd62c6fcf446ee500ca85558a71d645e/src/core/cgroup.c#L1130-L1133
it seems that freeze can be skipped if DevicePolicy is auto (which is the default) AND DeviceAllow list is empty.

In that case the device rules would be allow-all, wouldn't they?

                if (c->device_allow || policy != CGROUP_DEVICE_POLICY_AUTO)
                        r = cg_set_attribute("devices", path, "devices.deny", "a");
                else
                        r = cg_set_attribute("devices", path, "devices.allow", "a");

If you have no device_allow list, and the device policy is auto then devices.allow gets set to a which will mean that a container could access devices it shouldn't during the race window (which is worse than the -EACCES inconvenience).

EDIT: Ah, your patch only does the skip if we have SkipDevices and the slice has been configured to be allow-all. Is that a common scenario?

@kolyshkin
Copy link
Contributor Author

EDIT: Ah, your patch only does the skip if we have SkipDevices and the slice has been configured to be allow-all. Is that a common scenario?

@cyphar Yes, at least for kubernetes that uses libcontainter/cgroup to configure pod cgroups. As pod cgroup is a parent for a few containers cgroups, we do unnecessarily freeze all those containers on update, which is definitely not the way to do.

Those pod tests that we're adding recently are trying to emulate those kubernetes usage scenarios.

@kolyshkin kolyshkin marked this pull request as ready for review July 13, 2021 03:07
@kolyshkin kolyshkin added this to the 1.1.0 milestone Jul 13, 2021
@kolyshkin kolyshkin mentioned this pull request Jul 15, 2021
@kolyshkin kolyshkin added the backport/1.0-todo A PR in main branch which needs to be backported to release-1.0 label Jul 15, 2021
m.Freeze method changes m.cgroups.Resources.Freezer field, which should
not be done while we're temporarily freezing the cgroup in Set. If this
field is changed, and r == m.cgroups.Resources (as it often happens),
this results in inability to freeze the container using Set().

To fix, add and use a method which does not change r.Freezer field.

A test case for the bug will be added separately.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The t.Name() usage in libcontainer/integration prevented subtests
to be used, since in such case it returns a string containing "/",
and thus it can't be used to name a container.

Fix this by replacing slashes with underscores where appropriate.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In addition to freezing and thawing a container via Pause/Resume,
there is a way to also do so via Set.

This way was broken though and is being fixed by a few preceding
commits. The test is added to make sure this is fixed and won't regress.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Introduce freezeBeforeSet, which contains the logic of figuring out
whether we need to freeze/thaw around setting systemd unit properties.

In particular, if SkipDevices is set, and the current unit properties
allow all devices, there is no need to freeze and thaw, as systemd
won't write any device rules in this case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was initially added by commit 3e5c199 because Set (with
r.Freezer = Frozen) was not able to freeze a container.

Now (see a few previous commits) Set can do the freeze, so the explicit
Freeze is no longer needed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
TestPodSkipDevicesUpdate checks that updating a pod having SkipDevices: true
does not result in spurious "permission denied" errors in a container
running under the pod. The test is somewhat similar in nature to the
@test "update devices [minimal transition rules]" in tests/integration,
but uses a pod.

This tests the validity of freezeBeforeSet in v1.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

Addressed @cyphar review comments, here are the changes I've made:

diff --git a/libcontainer/cgroups/systemd/v1.go b/libcontainer/cgroups/systemd/v1.go
index 2a567bb8..1a8e1e3c 100644
--- a/libcontainer/cgroups/systemd/v1.go
+++ b/libcontainer/cgroups/systemd/v1.go
@@ -341,7 +341,7 @@ func (m *legacyManager) GetStats() (*cgroups.Stats, error) {
 // (unlike our fs driver, they will happily write deny-all rules to running
 // containers). So we have to freeze the container to avoid the container get
 // an occasional "permission denied" error.
-func (m *legacyManager) freezeBeforeSet(unitName string, r *configs.Resources) (needsFreeze, needsThaw bool, Err error) {
+func (m *legacyManager) freezeBeforeSet(unitName string, r *configs.Resources) (needsFreeze, needsThaw bool, err error) {
        // Special case for SkipDevices, as used by Kubernetes to create pod
        // cgroups with allow-all device policy).
        if r.SkipDevices {
@@ -352,10 +352,13 @@ func (m *legacyManager) freezeBeforeSet(unitName string, r *configs.Resources) (
                // Interestingly, (1) and (2) are the same here because
                // a non-existent unit returns default properties,
                // and settings in (2) are the defaults.
-               devPolicy, err := getUnitProperty(m.dbus, unitName, "DevicePolicy")
-               if err == nil && devPolicy.Value == dbus.MakeVariant("auto") {
-                       devAllow, err := getUnitProperty(m.dbus, unitName, "DeviceAllow")
-                       if err == nil && devAllow.Value == dbus.MakeVariant([]deviceAllowEntry{}) {
+               //
+               // Do not return errors from getUnitProperty, as they alone
+               // should not prevent Set from working.
+               devPolicy, e := getUnitProperty(m.dbus, unitName, "DevicePolicy")
+               if e == nil && devPolicy.Value == dbus.MakeVariant("auto") {
+                       devAllow, e := getUnitProperty(m.dbus, unitName, "DeviceAllow")
+                       if e == nil && devAllow.Value == dbus.MakeVariant([]deviceAllowEntry{}) {
                                needsFreeze = false
                                needsThaw = false
                                return
@@ -367,8 +370,8 @@ func (m *legacyManager) freezeBeforeSet(unitName string, r *configs.Resources) (
        needsThaw = true
 
        // Check the current freezer state.
-       freezerState, Err := m.GetFreezerState()
-       if Err != nil {
+       freezerState, err := m.GetFreezerState()
+       if err != nil {
                return
        }
        if freezerState == configs.Frozen {

@kolyshkin
Copy link
Contributor Author

CI went south :(

Could not connect to azure.archive.ubuntu.com:80 (52.252.75.106), connection timed out

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kolyshkin kolyshkin requested review from AkihiroSuda and mrunalp July 15, 2021 21:14
@mrunalp mrunalp merged commit 2749f1f into opencontainers:master Jul 15, 2021
@kolyshkin kolyshkin added backport/1.0-done A PR in main branch which has been backported to release-1.0 and removed backport/1.0-todo A PR in main branch which needs to be backported to release-1.0 labels Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cgroupv1 area/ci backport/1.0-done A PR in main branch which has been backported to release-1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants