Skip to content

Commit

Permalink
docs: Add background information on mutation (open-policy-agent#2387)
Browse files Browse the repository at this point in the history
Signed-off-by: Max Smythe <smythe@google.com>
Co-authored-by: Sertaç Özercan <852750+sozercan@users.noreply.github.com>
  • Loading branch information
maxsmythe and sozercan authored Feb 17, 2023
1 parent 476269f commit bf7b466
Show file tree
Hide file tree
Showing 5 changed files with 155 additions and 0 deletions.
3 changes: 3 additions & 0 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ Generally, design docs are on Google docs:
These docs may not be design docs in themselves, but seek to influence design
decisions

* [Mutation Convergence Proof](https://docs.google.com/document/d/1mCHHhBABzUwP8FtUuEf_B-FX-HHgh_k4bwZcGUYm7Sw/edit#heading=h.j5thjfnqybpn)
* This looks at the elements of mutation that allow us to assert that any combination of mutators should converge
and should serve as a guide when considering changes in behavior to mutation.
* [Mutation Dynamics](https://docs.google.com/document/d/1WKj-9xYYVO9IckQVZHYVFCziRVS6zxCfvJxWbkbLcOM/view#)
* A dissection of the mutation problem space, attempting to break down the
salient behavioral elements and figure out what's needed for a
Expand Down
70 changes: 70 additions & 0 deletions website/docs/mutation-background.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
id: mutation-background
title: Background Information on Mutation
---

Mutation webhooks in Kubernetes is a nuanced concept with many gotchas. This
page explores some of the background of mutation webhooks in Kubernetes, their
operational and syntactical implications, and how Gatekeeper is trying to provide
value on top of the basic Kubernetes webhook ecosystem.

# Mutation Chaining

A key difference between mutating webhooks and validating webhooks are that
mutating webhooks are called in series, whereas validating webhooks are called in parallel.

This makes sense, since validating webhooks can only approve or deny (or warn) for a given
input and have no other side effects. This means that the result of one validating webhook
cannot impact the result of any other validating webhook, and it's trivial to aggregate
all of the validation responses as they come in: reject if at least one deny comes in, return
all warnings and denies that are encountered back to the user.

Mutation, however, changes what the input resource looks like. This means that the output
of one mutating webhook can have an effect on the output of another mutating webhook.
For example, if one mutating webhook adds a sidecar container, and another webhook sets
`imagePullPolicy` to `Always`, then the new sidecar container means that this second webhook
has one more container to mutate.

The biggest practical issue with this call-in-sequence behavior is latency. Validation webhooks
(which are called in parallel), have a latency equivalent to the slowest-responding webhook.
Mutation webhooks have a total latency that is the sum of all mutating webhooks to be called. This
makes mutation much more latency-sensitive.

This can be particularly harmful for something like external data, where a webhook reaches out to
a secondary service to gather necessary information. This extra hop can be extra expensive,
especially if these external calls are not minimized. Gatekeeper translates external data
references scattered across multiple mutators into a single batched call per external data provider,
and calls each provider in parallel, minimizing latency.

# Mutation Recursion

Not only are mutators chained, but they recurse as well. This is not only due to Kubernetes'
[reinvocation policy](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#reinvocation-policy),
but also due to the nature of the Kubernetes control plane itself, since controllers may modify resources periodically.
Whether because of the reinvocation policy, or because of control plane behavior, mutators are likely to
operate on their own output. This has some operational risk. Consider a mutating webhook that prepends a hostname to a docker
image reference (e.g. prepend `gcr.io/`), if written naievly, each successive mutation would add another prefix, leading to results
like `gcr.io/gcr.io/gcr.io/my-favorite-image:latest`. Because of this, Kubernetes requires mutation webhooks to be
[idempotent](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#idempotence).

This is a good idea, but there is one problem: webhooks that are idempotent in isolation may not be idempotent as a group.
Let's take the above mutator and make it idempotent. We'll give it the following behavior: "if an image reference does
not start with `gcr.io/`, prepend `gcr.io/`". This makes the webhook idempotent, for sure. But, what if there is another
team working on the cluster, and they want their own image mutation rule: "if an image reference for the `billing`
namespace does not start with `billing.company.com/`, prepend `billing.company.com/`". Each of these webhooks would
be idempotent in isolation, but when chained together you'll see results like
`billing.company.com/gcr.io/billing.company.com/gcr.io/my-favorite-image:latest`.

At small scales, with small teams, it's relatively easy to ensure that mutations don't interfere with each other,
but at larger scales, or when multiple non-communicating parties have their own rules that they want to set, it
can be hard, or impossible to maintain this requirement of "global idempotence".

Gatekeeper attempts to make this easier by designing mutation in such a way that "global idempotence" is an
emergent property of all mutators, no matter how they are configured. Here is a [proof](https://docs.google.com/document/d/1mCHHhBABzUwP8FtUuEf_B-FX-HHgh_k4bwZcGUYm7Sw/edit#heading=h.j5thjfnqybpn), where we attempt to show that our language
for expressing mutation always converges on a stable result.

# Summary

By using Gatekeeper for mutation, it is possible to reduce the number of mutation webhooks, which should improve latency
considerations. It should also help prevent decoupled management of mutation policies from violating the Kubernetes API
server's requirement of idempotence.
6 changes: 6 additions & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ module.exports = {
'performance-tuning'
],
},
{
type: 'category',
label: 'Concepts',
collapsed: false,
items: ['mutation-background']
},
{
type: 'category',
label: 'Contributing',
Expand Down
70 changes: 70 additions & 0 deletions website/versioned_docs/version-v3.10.x/mutation-background.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
id: mutation-background
title: Background Information on Mutation
---

Mutation webhooks in Kubernetes is a nuanced concept with many gotchas. This
page explores some of the background of mutation webhooks in Kubernetes, their
operational and syntactical implications, and how Gatekeeper is trying to provide
value on top of the basic Kubernetes webhook ecosystem.

# Mutation Chaining

A key difference between mutating webhooks and validating webhooks are that
mutating webhooks are called in series, whereas validating webhooks are called in parallel.

This makes sense, since validating webhooks can only approve or deny (or warn) for a given
input and have no other side effects. This means that the result of one validating webhook
cannot impact the result of any other validating webhook, and it's trivial to aggregate
all of the validation responses as they come in: reject if at least one deny comes in, return
all warnings and denies that are encountered back to the user.

Mutation, however, changes what the input resource looks like. This means that the output
of one mutating webhook can have an effect on the output of another mutating webhook.
For example, if one mutating webhook adds a sidecar container, and another webhook sets
`imagePullPolicy` to `Always`, then the new sidecar container means that this second webhook
has one more container to mutate.

The biggest practical issue with this call-in-sequence behavior is latency. Validation webhooks
(which are called in parallel), have a latency equivalent to the slowest-responding webhook.
Mutation webhooks have a total latency that is the sum of all mutating webhooks to be called. This
makes mutation much more latency-sensitive.

This can be particularly harmful for something like external data, where a webhook reaches out to
a secondary service to gather necessary information. This extra hop can be extra expensive,
especially if these external calls are not minimized. Gatekeeper translates external data
references scattered across multiple mutators into a single batched call per external data provider,
and calls each provider in parallel, minimizing latency.

# Mutation Recursion

Not only are mutators chained, but they recurse as well. This is not only due to Kubernetes'
[reinvocation policy](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#reinvocation-policy),
but also due to the nature of the Kubernetes control plane itself, since controllers may modify resources periodically.
Whether because of the reinvocation policy, or because of control plane behavior, mutators are likely to
operate on their own output. This has some operational risk. Consider a mutating webhook that prepends a hostname to a docker
image reference (e.g. prepend `gcr.io/`), if written naievly, each successive mutation would add another prefix, leading to results
like `gcr.io/gcr.io/gcr.io/my-favorite-image:latest`. Because of this, Kubernetes requires mutation webhooks to be
[idempotent](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#idempotence).

This is a good idea, but there is one problem: webhooks that are idempotent in isolation may not be idempotent as a group.
Let's take the above mutator and make it idempotent. We'll give it the following behavior: "if an image reference does
not start with `gcr.io/`, prepend `gcr.io/`". This makes the webhook idempotent, for sure. But, what if there is another
team working on the cluster, and they want their own image mutation rule: "if an image reference for the `billing`
namespace does not start with `billing.company.com/`, prepend `billing.company.com/`". Each of these webhooks would
be idempotent in isolation, but when chained together you'll see results like
`billing.company.com/gcr.io/billing.company.com/gcr.io/my-favorite-image:latest`.

At small scales, with small teams, it's relatively easy to ensure that mutations don't interfere with each other,
but at larger scales, or when multiple non-communicating parties have their own rules that they want to set, it
can be hard, or impossible to maintain this requirement of "global idempotence".

Gatekeeper attempts to make this easier by designing mutation in such a way that "global idempotence" is an
emergent property of all mutators, no matter how they are configured. Here is a [proof](https://docs.google.com/document/d/1mCHHhBABzUwP8FtUuEf_B-FX-HHgh_k4bwZcGUYm7Sw/edit#heading=h.j5thjfnqybpn), where we attempt to show that our language
for expressing mutation always converges on a stable result.

# Summary

By using Gatekeeper for mutation, it is possible to reduce the number of mutation webhooks, which should improve latency
considerations. It should also help prevent decoupled management of mutation policies from violating the Kubernetes API
server's requirement of idempotence.
6 changes: 6 additions & 0 deletions website/versioned_sidebars/version-v3.10.x-sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@
"performance-tuning"
]
},
{
"type": "category",
"label": "Concepts",
"collapsed": false,
"items": ["mutation-background"]
},
{
"type": "category",
"label": "Contributing",
Expand Down

0 comments on commit bf7b466

Please sign in to comment.