Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job resources are not automatically cleaned up preventing version upgrades #7118

Closed
marshall007 opened this issue May 8, 2021 · 19 comments
Closed
Labels
area/helm Issues or PRs related to helm charts help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@marshall007
Copy link

marshall007 commented May 8, 2021

NGINX Ingress controller version: v0.45.0

Kubernetes version (use kubectl version): v1.19.0

Environment:

  • Cloud provider or hardware configuration: GKE / EKS
  • Install tools: kustomize and kpt live apply

What happened:

First deployment of ingress-nginx from static manifests is successful. When attempting to upgrade to any newer version of the static manifests (in our case v0.44.0 -> v0.45.0) you will see errors indicating that job specs are immutable:

The Job "ingress-nginx-admission-create" is invalid: spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"ingress-nginx-admission-create", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app.kubernetes.io/component":"admission-webhook", "app.kubernetes.io/instance":"ingress-nginx", "app.kubernetes.io/managed-by":"Helm", "app.kubernetes.io/name":"ingress-nginx", "app.kubernetes.io/version":"0.45.0", "controller-uid":"5d2d96c5-6c0c-434d-8bb6-c90124b5fbbf", "helm.sh/chart":"ingress-nginx-3.27.0", "job-name":"ingress-nginx-admission-create"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:"create", Image:"docker.io/jettech/kube-webhook-certgen:v1.5.1", Command:[]string(nil), Args:[]string{"create", "--host=ingress-nginx-controller-admission,ingress-nginx-controller-admission.$(POD_NAMESPACE).svc", "--namespace=$(POD_NAMESPACE)", "--secret-name=ingress-nginx-admission"}, WorkingDir:"", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:"POD_NAMESPACE", Value:"", ValueFrom:(*core.EnvVarSource)(0xc02df04f20)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"IfNotPresent", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:"OnFailure", TerminationGracePeriodSeconds:(*int64)(0xc003fb51c0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"ingress-nginx-admission", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", SecurityContext:(*core.PodSecurityContext)(0xc03cd46800), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*core.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable

I think this is simply because labels like app.kubernetes.io/version and helm.sh/chart are changing between versions.

We are using kpt pkg commands to pull the manifests maintained in deploy/static/provider/cloud into our own git repo, we apply kustomizations on top of that, and deploy with kpt live commands.

What you expected to happen:

I believe helm users do not experience this problem because of the helm.sh/hook-delete-policy annotations specified on the various Job resources. These is obviously not respected when deploying static manifests using other tooling.

I would expect users deploying from static manifests to have a similar experience when upgrading between versions.

Rather than relying on proprietary helm hooks to ensure cleanup of the Job resources, perhaps we can use spec.ttlSecondsAfterFinished in order to ensure these resources are pruned in a portable fashion.

How to reproduce it:

This is not precisely what we are doing, but it should be reproducible like so:

# fetch the package
kpt pkg get github.com/kubernetes/ingress-nginx/deploy/static/provider/cloud?ref=controller-v0.44.0 ingress-nginx

# setup
pushd ingress-nginx
kpt live init .
kustomize edit add resources inventory-template.yaml
popd

# deploy v0.44.0
kustomize build ingress-nginx | kpt live apply --reconcile-timeout=15m

# upgrade package to v0.45.0
kpt pkg update ingress-nginx@controller-v0.45.0 --strategy force-delete-replace

# attempt to deploy v0.45.0
kustomize build ingress-nginx | kpt live apply

/kind bug

@marshall007 marshall007 added the kind/bug Categorizes issue or PR as related to a bug. label May 8, 2021
@marshall007
Copy link
Author

I see that my suggestion of using spec.ttlSecondsAfterFinished is already implemented in the helm chart:

{{- if .Capabilities.APIVersions.Has "batch/v1alpha1" }}
# Alpha feature since k8s 1.12
ttlSecondsAfterFinished: 0
{{- end }}

However, this does not end up in the generated static manifests:

spec:
template:
metadata:
name: ingress-nginx-admission-patch

In order for this to work we'd need to start setting --api-versions in the calls to helm template in the gen script to something reasonable:

cat << EOF | helm template $RELEASE_NAME ${DIR}/charts/ingress-nginx --namespace $NAMESPACE --values - | $DIR/hack/add-namespace.py $NAMESPACE > ${OUTPUT_FILE}

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 9, 2021
@NeckBeardPrince
Copy link

Not stale

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 9, 2021
@NeckBeardPrince
Copy link

Can we get some eyes on this?

@iamNoah1
Copy link
Contributor

/lifecycle active
/area helm
/priority important-longterm

@NeckBeardPrince see this comment from @rikatz: #7128 (comment)

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. area/helm Issues or PRs related to helm charts priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Nov 17, 2021
@blu3r4y
Copy link

blu3r4y commented Jan 13, 2022

As a temporary workaround, I added the following steps after I deploy the controller:

kubectl patch -n ingress-nginx job ingress-nginx-admission-create -p '{"spec":{"ttlSecondsAfterFinished":0}}'
kubectl patch -n ingress-nginx job ingress-nginx-admission-patch -p '{"spec":{"ttlSecondsAfterFinished":0}}'

If you are using PowerShell on Windows, don't forget to escape the quotes differently:

kubectl patch -n ingress-nginx job ingress-nginx-admission-create -p '{\"spec\":{\"ttlSecondsAfterFinished\":0}}'
kubectl patch -n ingress-nginx job ingress-nginx-admission-patch -p '{\"spec\":{\"ttlSecondsAfterFinished\":0}}'

@iamNoah1
Copy link
Contributor

/triage accepted
/help

@k8s-ci-robot
Copy link
Contributor

@iamNoah1:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Apr 12, 2022
@numbsafari
Copy link

In our config we are kustomize. For this component we are applying a patch, e.g.:

patches:
- target:
    kind: Job
  patch: |-
    - op: add
      path: /spec/ttlSecondsAfterFinished
      value: 0

@robert-heinzmann-logmein
Copy link

robert-heinzmann-logmein commented Jul 21, 2022

Looks like newer versions of k8s do not ship the batch/v1alpha1 version anymore and therefore the capacilities flag does not work anymore. The following patch should do the trick:

@@ -17,7 +17,7 @@
     {{- toYaml . | nindent 4 }}
     {{- end }}
 spec:
-{{- if .Capabilities.APIVersions.Has "batch/v1alpha1" }}
+{{- if or (.Capabilities.APIVersions.Has "batch/v1alpha1") (.Capabilities.APIVersions.Has "batch/v1beta1") (.Capabilities.APIVersions.Has "batch/v1") }}
   # Alpha feature since k8s 1.12, beta since 1.21, GA in 1.23
   ttlSecondsAfterFinished: 0
 {{- end }}

@rikatz
Copy link
Contributor

rikatz commented Jul 21, 2022

@robert-heinzmann-logmein wanna send the PR? Then just tag me and I can approve it.

Also we dont support k8s <v1.20 anymore so fully dropping this API should be good

@robert-heinzmann-logmein

Not sure, how about #7128 ? Following the discussions, I think as long as you support 1.20 (where this is still alpha) the toggle can not be removed right ?

@rikatz
Copy link
Contributor

rikatz commented Jul 21, 2022

Oh it is beta only on v1.21? :/

yeah we should wait until v1.25 and drop support to v1.20

@rikatz
Copy link
Contributor

rikatz commented Jul 21, 2022

Also apparently that PR is stale, so if you are willing to open a new one lmk.

@robert-heinzmann-logmein

It seems 1.25 is out, so this looks unblocked.

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 18, 2024
niflostancu added a commit to niflostancu/k8s-service-templates that referenced this issue Mar 6, 2024
@longwuyuan
Copy link
Contributor

Enough number of upgrades are safely assumed to have been performed since the last activity on this issue and the version of the controller + the version of the Kubernetes referred to here, is no longer supported or too old to support.

Since nobody is reporting upgrade failures due to pending jobs, I will close the issue for now. If a user posts data proving that upgrades fail due to pending jobs, on a kind cluster or a minikube cluster, then we can reopen this issue.

/close

@k8s-ci-robot
Copy link
Contributor

@longwuyuan: Closing this issue.

In response to this:

Enough number of upgrades are safely assumed to have been performed since the last activity on this issue and the version of the controller + the version of the Kubernetes referred to here, is no longer supported or too old to support.

Since nobody is reporting upgrade failures due to pending jobs, I will close the issue for now. If a user posts data proving that upgrades fail due to pending jobs, on a kind cluster or a minikube cluster, then we can reopen this issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm Issues or PRs related to helm charts help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
10 participants