forked from openshift/cluster-version-operator
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs/user/syncronization: Document release-image application
To make it easier for admins and support staff to troubleshoot when the cluster-version operator gets stuck. The SVGs were generated with: $ go build ./hack/cluster-version-util $ mkdir /tmp/release $ oc image extract quay.io/openshift-release-dev/ocp-release:4.1.0[-1] --path /:/tmp/release $ mkdir /tmp/release/manifests $ ./cluster-version-util task-graph /tmp/release | dot -Tsvg >docs/user/tasks-by-number-and-component.svg $ ./cluster-version-util task-graph --parallel flatten-by-number-and-component /tmp/release | dot -Tsvg >docs/user/tasks-flatten-by-number-and-component.svg using: $ dot -V dot - graphviz version 2.30.1 (20170916.1124) I initially put the utility program in cmd/cluster-version-util, moving the main operator to cmd/cluster-version-operator. But Abhinav was concerned about implied support for tooling that is really just intended for exposing internal CVO logic, so this commit puts it under hack/ with a caveat in the main help text. The PATH documentation is stuffed into Use because cobra lacks direct support for named positional arguments [1]. [1]: spf13/cobra#378
- Loading branch information
Showing
6 changed files
with
1,846 additions
and
32 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Synchronization | ||
|
||
This document describes the cluster-version operator's synchronization logic and explains how the operator applies a release image to the cluster. | ||
|
||
## Release image content | ||
|
||
```console | ||
$ mkdir /tmp/release | ||
$ oc image extract quay.io/openshift-release-dev/ocp-release:4.1.0[-1] --path /:/tmp/release | ||
$ ls /tmp/release/release-manifests | ||
0000_03_authorization-openshift_01_rolebindingrestriction.crd.yaml | ||
0000_03_quota-openshift_01_clusterresourcequota.crd.yaml | ||
0000_03_security-openshift_01_scc.crd.yaml | ||
0000_05_config-operator_02_apiserver.cr.yaml | ||
0000_05_config-operator_02_authentication.cr.yaml | ||
... | ||
0000_90_openshift-controller-manager-operator_02_servicemonitor.yaml | ||
0000_90_openshift-controller-manager-operator_03_operand-servicemonitor.yaml | ||
image-references | ||
release-metadata | ||
$ cat /tmp/release/release-manifests/release-metadata | ||
{ | ||
"kind": "cincinnati-metadata-v0", | ||
"version": "4.1.0", | ||
"previous": [], | ||
"metadata": { | ||
"description": "", | ||
"url": "https://access.redhat.com/errata/RHBA-2019:0758" | ||
} | ||
} | ||
$ cat /tmp/release/release-manifests/image-references | ||
{ | ||
"kind": "ImageStream", | ||
"apiVersion": "image.openshift.io/v1", | ||
"metadata": { | ||
"name": "4.1.0", | ||
"creationTimestamp": "2019-06-03T14:49:14Z", | ||
"annotations": { | ||
"release.openshift.io/from-image-stream": "ocp/4.1-art-latest-2019-05-31-174150", | ||
"release.openshift.io/from-release": "registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-31-174150" | ||
} | ||
}, | ||
"spec": { | ||
"lookupPolicy": { | ||
"local": false | ||
}, | ||
"tags": [ | ||
{ | ||
"name": "aws-machine-controllers", | ||
"annotations": { | ||
"io.openshift.build.commit.id": "d8d8e285fc19920c3311e791f4fe22db7003588f", | ||
"io.openshift.build.commit.ref": "", | ||
"io.openshift.build.source-location": "https://github.com/openshift/cluster-api-provider-aws" | ||
}, | ||
"from": { | ||
"kind": "DockerImage", | ||
"name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7483248489c918e0c65a6b391bd171da0565cb9995b2acc61a1e517b6551e037" | ||
}, | ||
"generation": 2, | ||
"importPolicy": {}, | ||
"referencePolicy": { | ||
"type": "Source" | ||
} | ||
}, | ||
... | ||
] | ||
}, | ||
"status": { | ||
"dockerImageRepository": "" | ||
} | ||
} | ||
``` | ||
|
||
## Manifest graph | ||
|
||
The cluster-version operator unpacks the release image, ingests manifests, loads them into a graph. | ||
For upgrades, the graph is ordered by the number and component of the manifest file: | ||
|
||
<div style="text-align:center"> | ||
<img src="tasks-by-number-and-component.svg" width="100%" /> | ||
</div> | ||
|
||
The `0000_03_authorization-openshift_*` manifest gets its own node, the `0000_03_quota-openshift_01_*` manifest gets its own node, and the `0000_03_security-openshift_*` manifest gets its own node. | ||
The next group of manifests are under `0000_05_config-operator_*`. | ||
Because the number is bumped, the graph blocks until the previous `0000_03_*` are all complete before beginning the `0000_05_*` block. | ||
|
||
We are more relaxed for the initial install, because there is not yet any user data in the cluster to be worried about. | ||
So the graph nodes are all parallized with the by-number ordering flattened out: | ||
|
||
<div style="text-align:center"> | ||
<img src="tasks-flatten-by-number-and-component.svg" width="100%" /> | ||
</div> | ||
|
||
For the usual reconciliation loop (neither an upgrade between releases nor a fresh install), the flattened graph is also randomly permuted to avoid hanging on ordering bugs. | ||
|
||
## Synchronizing the graph | ||
|
||
The cluster-version operator spawns worker goroutines that walk the graph, pushing manifests in their queue. | ||
For each manifest in the node, the worker synchronizes the cluster with the manifest using a resource builder. | ||
On error (or timeout), the worker abandons the manifest, graph node, and any dependencies of that graph node. | ||
On success, the worker proceeds to the next manifest in the graph node. | ||
|
||
## Resource builders | ||
|
||
Resource builders synchronize the cluster with a manifest from the release image. | ||
The general approach is to generates a merged manifest combining critical spec properties from the release-image manifest with data from a preexisting in-cluster object, if any. | ||
If the merged manifest differs from the in-cluster object, the merged manifest is pushed back into the cluster. | ||
|
||
Some types have additional logic, as described in the following subsections. | ||
Note that this logic only applies to manifests included in the release image itself. | ||
For example, only ClusterOperator from the release image will have the blocking logic described [below](#clusteroperator); if an admin or secondary operator pushed a ClusterOperator object, it would not impact the cluster-version operator's graph synchronization. | ||
|
||
### ClusterOperator | ||
|
||
After pushing the merged ClusterOperator into the cluster, the builder monitors the in-cluster object and blocks until it is: | ||
|
||
* Available | ||
* Either not progressing or listing at least one version. | ||
The progressing check is deprecated and will be removed once all operators are reporting versions. | ||
* Not degraded (except during initiazation, where we ignore the degraded status) | ||
|
||
### CustomResourceDefinition | ||
|
||
After pushing the merged CustomResourceDefinition into the cluster, the builder monitors the in-cluster object and blocks until it is established. | ||
|
||
### DaemonSet | ||
|
||
After pushing the merged DaemonSet into the cluster, the builder monitors | ||
|
||
* FIXME: `UpdatedNumberScheduled == DesiredNumberScheduled` check? | ||
* Has no unavailable nodes. | ||
|
||
### Deployment | ||
|
||
After pushing the merged Deployment into the cluster (FIXME: `Generation > 1` check from 14fab0b2?), the builder monitors the in-cluster object and blocks until it: | ||
|
||
* FIXME: `UpdatedReplicas == Replicas` check? | ||
* Has no unavailable replicas. | ||
|
||
### Job | ||
|
||
After pushing the merged Job into the cluster, the builder blocks until the Job succeeds. |
Oops, something went wrong.