Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
202 changes: 77 additions & 125 deletions hips/hip-9999.md
Original file line number Diff line number Diff line change
@@ -1,81 +1,82 @@
---
hip: 9999
title: "New annotations for pre-install and pre-upgrade to fail fast and display output"
authors: [ "Ian Zink <ian@replicated.com>" ]
title: "New CLI switch for hooks to display output"
authors: [ "Ian Zink <ian@replicated.com>", "Xav Paice <xav@replicated.com>" ]
created: "2023-01-26"
type: "feature"
status: "draft"
---

## Abstract

This proposes two new annotations for hooks specific to jobs. One that will cause install, upgrade, or test hooks to block and fail fast. And a second annotation to indicate that the output from the job should be displayed to the user.
This proposes a new CLI switch to indicate that the output from a job executed as a hook should be displayed to the user.


## Motivation

The primary motivation for this HIP is the ability to run Preflight checks before a helm chart runs to verify it can successfully install. Preflight checks require a way to run and fail fast and to present that check that failed back to the user.
The primary motivation for this HIP is the ability to run Preflight checks before a helm chart attempts to install permanent resources in the cluster, to verify it can successfully install. Preflight checks require a way to run, fail before the installation, and to present the results of the check that failed back to the user.

Often it is important to verify that the kubernetes cluster you are deploying a helm chart into has certain properties. You might need to know that the cluster is of a certain version to use various APIs. You might need to know that it has ingress available, a certain amount of ephemeral storage, memory, or CPUs available. You might want to validate the the service key they provided was correct or that that database they entered is reachable. Letting a chart deploy and then finding debugging to see why it failed is a poor user experience. These things can all be done with preflight checks enabled by the hooks proposed in this HIP.

In general, allowing chart developers to run jobs and present that feedback directly to the users could also open up additional use cases beyond just the preflight use case that motivated this HIP.
Often it is important to verify that the kubernetes cluster you are deploying a helm chart into has certain properties. You might need to know that the cluster is of a certain version to use various APIs. You might need to know that it has ingress available, a certain amount of ephemeral storage, memory, or CPUs available. You might want to validate the the service key they provided was correct or that the database they entered is reachable. Letting a chart deploy and then debugging to see why it failed is a poor user experience. These things can all be done with preflight checks enabled by the hooks proposed in this HIP.


## Rationale

There are other ways that this could be implemented. For example, we could have a separate preflight hook type. However, this new hook type wouldn't be handled at all by previous versions of helm. With this design, the new hooks will degrade into timeout errors instead of continuing to the install phase.

Another strategy could be for helm to include Troubleshoot.sh as a dependent library, but this could result in too tight of a coupling between the projects and lower overall flexibility and adaptability.
The problem with running preflight checks as hooks currently is that in order to read the logs from the job, you need to leave the resources created by the hook in place so that logs can be retrieved. Ideally, a failed preflight check would leave no trace of itself in the cluster. If hooks were to collect the output and display it to users via stdout, then install attempts could run using the `--atomic` switch along with settings to delete the resources, and folks would have useful output from the failed hook.

In general, allowing chart developers to run jobs and present that feedback directly to the users could also open up additional use cases beyond just the preflight use case that motivated this HIP.

## Specification

Templates could include the following annotations on Batch Jobs:

```yaml
"helm.sh/hook": pre-install, pre-upgrade
"helm.sh/hook-fail-fast": "true"
"helm.sh/show-output": "true"
```

`helm.sh/hook-fail-fast` would indicate that helm should wait for this job to complete and if it fails should immediately exit the install process.
`helm.sh/show-output` would indicate that helm should display the output of the job to the user.

Additionally a new user flag should be created `--ignore-fail-fast` that would ignore the results of the job and continue with the install process.
When calling `helm install` an additional CLI switch `--show-hook-logs` triggers the command to output the logs from any pods created during hook execution to stdout at hook completion.

There should be no need to follow the logs in real time, printing the entire log at completion is acceptable.
## Backwards compatibility

As helm charts added new fail-fast hooks, old versions of helm would process them as if they were normal hooks. If `--wait-for-jobs` was set, they would timeout and fail. If it was not set, they would continue on to the next hook.
The new switch would not be accepted by older versions of Helm.

## Security implications

As jobs can already arbitrary code, this HIP does not introduce any new security implications -- only the ability to fail fast and display output.
As jobs can already arbitrary code, this HIP does not introduce any new security implications -- only the ability to display output.

Potentially the preflight checks could check for security misconfigurations that could enhance the security of the cluster.

## How to teach this

For one an example template would be provided showing how to use the new feature with Troubleshoot.sh to provide preflight checks.
In the first instance, documentation plus the help text for `helm install` would explain the feature.

An example template could be provided in documentation showing how to use this feature with a generic command used in a hook.

A more advanced example showing how to use the new feature with Troubleshoot.sh to provide preflight checks could be linked in the documentation, provided directly in the documentation, or provided on the Troubleshoot.sh documentation site independently.

## Reference implementation

The `safe-install` plugin (link in references) demonstrates what running preflights could look like, but not in the fashion implemented in this HIP.
The [Troubleshoot Helm chart](https://github.com/xavpaice/helm-chart-troubleshoot) provides an example preflight, but currently misses the new annotation and therefore does not delete resources after running. This would be updated when the annotation is implemented.

## Rejected ideas
N/A

There are other ways that this could be implemented. For example, we could have a separate preflight hook type. However, this new hook type wouldn't be handled at all by previous versions of helm.

Another strategy could be for helm to include Troubleshoot.sh as a dependent library, but this could result in too tight of a coupling between the projects and lower overall flexibility and adaptability.

Use of an extra annotation, e.g. `"helm.sh/hook-output-log-policy": hook-succeeded, hook-failed` was considered, however that puts the choice of viewing logs in the hands of the chart developer rather than the user executing the install.

## Open issues
N/A

## References
Two issues have been closed due to inactivity:

* [#2298](https://github.com/helm/helm/issues/2298)
* [3481](https://github.com/helm/helm/issues/3481)

[Troubleshoot.sh](https://troubleshoot.sh/) - the tool that is the motivation for this HIP.
## References

[safe-install plugin](https://github.com/z4ce/helm-safe-install) - Plugin that provides a similiar experience to what I hope this HIP will provide natively.
* [Troubleshoot.sh](https://troubleshoot.sh/) - the tool that is the motivation for this HIP.
* [safe-install plugin](https://github.com/z4ce/helm-safe-install) - Plugin that provides a similiar experience to what I hope this HIP will provide natively.
* [Troubleshoot Helm chart](https://github.com/xavpaice/helm-chart-troubleshoot) - Example Helm chart with a pre-install hook including a Preflight check.
* [Prior code PR](https://github.com/helm/helm/pull/10309) & [associated Docs PR](https://github.com/helm/helm-www/pull/1242) - similar PRs covering a slightly different implementation of the same topic

# Reference - Examples Usage
## Reference - Examples Usage

## Example using `false`
### Example using `false`

Template:
```yaml
Expand All @@ -92,7 +93,6 @@ metadata:
# This is what defines this resource as a hook. Without this line, the
# job is considered part of the release.
"helm.sh/hook": pre-install, pre-upgrade
"helm.sh/hook-fail-fast": "true"
"helm.sh/show-output": "true"
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": hook-succeeded, hook-failed
Expand All @@ -110,17 +110,20 @@ spec:
containers:
- name: post-install-job
image: "alpine:3.3"
command: ["false"]
command: ["bash", "-c", "echo foo ; false"]
```

What it should loook when running:

```
$ helm install ./ my-release
Fail-fast job failed: my-release-false-job
Job my-release-false-job output:
$ helm install my-release ./ --atomic --show-hook-logs
Error: INSTALLATION FAILED: failed pre-install: job failed: BackoffLimitExceeded
Job output for my-release-false-job:
foo
```

## Example using Troubleshoot Preflight Checks
### Example using Troubleshoot Preflight Checks

```yaml
apiVersion: batch/v1
kind: Job
Expand All @@ -135,12 +138,10 @@ metadata:
# This is what defines this resource as a hook. Without this line, the
# job is considered part of the release.
"helm.sh/hook": pre-install, pre-upgrade
"helm.sh/hook-fail-fast": "true"
"helm.sh/show-output": "true"
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": hook-succeeded, hook-failed

"helm.sh/hook-delete-policy": before-hook-creation, hook-succeeded, hook-failed
spec:
backoffLimit: 0 # do not retry on failure
template:
metadata:
name: "{{ .Release.Name }}"
Expand All @@ -149,95 +150,46 @@ spec:
app.kubernetes.io/instance: {{ .Release.Name | quote }}
helm.sh/chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
spec:
serviceAccountName: "{{ .Release.Name }}-preflight" # See references for full implementation
restartPolicy: Never
volumes:
- name: preflights
configMap:
name: "{{ .Release.Name }}-preflight-config"
secret:
secretName: "{{ .Release.Name }}-preflight-config" # See references for full implementation
- name: kube-api-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
containers:
- name: post-install-job
image: "replicated/preflight:latest"
command: ["preflight", "--interactive=false", "--format json", "/preflights/preflight.yaml"]
- name: pre-install-job
image: "{{ .Values.preflight.image }}"
command:
- "preflight"
- "--interactive=false"
- "/preflights/preflight.yaml"
volumeMounts:
- name: preflights
- name: preflights # See references for full implementation
mountPath: /preflights

---
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
"helm.sh/hook": pre-install, pre-upgrade
"helm.sh/hook-weight": "-6"
"helm.sh/hook-delete-policy": hook-succeeded, hook-failed
labels:
app.kubernetes.io/managed-by: {{ .Release.Service | quote }}
app.kubernetes.io/instance: {{ .Release.Name | quote }}
app.kubernetes.io/version: {{ .Chart.AppVersion }}
helm.sh/chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
name: "{{ .Release.Name }}-preflight-config"
data:
preflights.yaml: |
apiVersion: troubleshoot.sh/v1beta2
kind: Preflight
metadata:
name: preflight-tutorial
spec:
collectors:
{{ if eq .Values.mariadb.enabled false }}
- mysql:
collectorName: mysql
uri: '{{ .Values.externalDatabase.user }}:{{ .Values.externalDatabase.password }}@tcp({{ .Values.externalDatabase.host }}:{{ .Values.externalDatabase.port }})/{{ .Values.externalDatabase.database }}?tls=false'
{{ end }}
analyzers:
- clusterVersion:
outcomes:
- fail:
when: "< 1.16.0"
message: The application requires at least Kubernetes 1.16.0, and recommends 1.18.0.
uri: https://kubernetes.io
- warn:
when: "< 1.18.0"
message: Your cluster meets the minimum version of Kubernetes, but we recommend you update to 1.18.0 or later.
uri: https://kubernetes.io
- pass:
message: Your cluster meets the recommended and required versions of Kubernetes.
{{ if eq .Values.mariadb.enabled false }}
- mysql:
checkName: Must be MySQL 8.x or later
collectorName: mysql
outcomes:
- fail:
when: connected == false
message: Cannot connect to MySQL server
- fail:
when: version < 8.x
message: The MySQL server must be at least version 8
- pass:
message: The MySQL server is ready
{{ end }}
```
What it should loook when running:

What it should look like when running:

```
$ helm install ./ my-release
Fail-fast job failed: my-release-preflight-job
$ helm install my-release ./ --atomic --show-hook-logs
Error: INSTALLATION FAILED: failed pre-install: job failed: BackoffLimitExceeded
Job my-release-preflight-job output:
name: cluster-resources status: completed completed: 1 total: 3
name: mysql/mysql status: running completed: 1 total: 3
name: mysql/mysql status: completed completed: 2 total: 3
name: cluster-info status: running completed: 2 total: 3
{
"fail": [
{
"title": "Required Kubernetes Version",
"message": "The application requires at least Kubernetes 1.16.0, and recommends 1.18.0.",
"uri": "https://kubernetes.io"
},
{
"title": "Must be MySQL 8.x or later",
"message": "Cannot connect to MySQL server"
}
]
}
name: cluster-info status: completed completed: 3 total: 3
name: cluster-info status: running completed: 0 total: 2
name: cluster-info status: completed completed: 1 total: 2
name: cluster-resources status: running completed: 1 total: 2
name: cluster-resources status: completed completed: 2 total: 2

--- FAIL: Node Count Check
--- The cluster has less than 3 nodes.
--- PASS Required Kubernetes Version
--- Your cluster meets the recommended and required versions of Kubernetes.
--- FAIL preflight-tutorial
FAILED
```