Skip to content

node-installer job does not terminate properly #140

Open
@voigt

Description

@voigt

As part of #68 I investigated an issue in the containerd restart routine. When the node-installer installs a runtime and restarts containerd, the corresponding pod terminates with status Unknown

Overview:

kubectl get job
NAME                            COMPLETIONS   DURATION   AGE
kwasm-worker-spin-v2-install    1/1           28s        21m
kubectl get po
NAME                                  READY   STATUS      RESTARTS   AGE
kwasm-worker-spin-v2-install-n82d9    0/1     Unknown     0          7m25s
kwasm-worker-spin-v2-install-rq78d    0/1     Completed   0          7m3s

Logs of Pod with status Unknown

kubectl logs kwasm-worker-spin-v2-install-n82d9 -c downloader
2024-05-20T20:49:40     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:42     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-n82d9 -c provisioner
2024/05/20 20:49:46 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=true
2024/05/20 20:49:46 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:49:46 INFO restarting containerd

Logs of Pod with status Completed

kubectl logs kwasm-worker-spin-v2-install-rq78d -c downloader
2024-05-20T20:49:57     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:59     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-rq78d -c provisioner
2024/05/20 20:50:00 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=false
2024/05/20 20:50:00 INFO runtime config already exists, skipping runtime=spin-v2
2024/05/20 20:50:00 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:50:00 INFO nothing changed, nothing more to do

The Completed pod only gets scheduled in the first place, as the first one did not terminated successfully; even though the actual job (rewriting containerd config and removing the binary) is done. As a result, the second run of the job has nothing left todo.

Description of Pod with Status Unknown

    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:48 +0200
kubectl describe po kwasm-worker-spin-v2-install-n82d9
Name:             kwasm-worker-spin-v2-install-n82d9
Namespace:        default
Priority:         0
Service Account:  default
Node:             kwasm-worker/192.168.228.5
Start Time:       Mon, 20 May 2024 22:49:35 +0200
Labels:           batch.kubernetes.io/controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
                  batch.kubernetes.io/job-name=kwasm-worker-spin-v2-install
                  controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
                  job-name=kwasm-worker-spin-v2-install
Annotations:      <none>
Status:           Failed
IP:               10.244.2.2
IPs:
  IP:           10.244.2.2
Controlled By:  Job/kwasm-worker-spin-v2-install
Init Containers:
  downloader:
    Container ID:   containerd://7f63983e513efa392e3cc684bf53d2553aeb898b4bfe08fb22229fbae83406cb
    Image:          ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
    Image ID:       ghcr.io/spinkube/shim-downloader@sha256:719f54c518fc0fc65abbe8ac27978ea188d13faee23530544faf9d622aa2be92
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 20 May 2024 22:49:40 +0200
      Finished:     Mon, 20 May 2024 22:49:42 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      SHIM_NAME:      spin-v2
      SHIM_LOCATION:  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
    Mounts:
      /assets from shim-download (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Containers:
  provisioner:
    Container ID:  containerd://92dd4c994b2fc95d269b5de630c00f55fff233d04d1d649a6b69ce512936278b
    Image:         ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
    Image ID:      ghcr.io/spinkube/node-installer@sha256:fcbfa4d8197d3de3b9953219af6a8784f23abf7d798150b2c2a606daaeebe6df
    Port:          <none>
    Host Port:     <none>
    Args:
      install
      -H
      /mnt/node-root
      -r
      spin-v2
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:47 +0200
    Ready:          False
    Restart Count:  0
    Environment:
      HOST_ROOT:            /mnt/node-root
      SHIM_FETCH_STRATEGY:  /mnt/node-root
    Mounts:
      /assets from shim-download (rw)
      /mnt/node-root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  shim-download:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  kube-api-access-wnr2x:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Pulling  25m   kubelet  Pulling image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader"
  Normal  Pulled   25m   kubelet  Successfully pulled image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" in 4.108s (4.108s including waiting)
  Normal  Created  25m   kubelet  Created container downloader
  Normal  Started  25m   kubelet  Started container downloader
  Normal  Pulling  25m   kubelet  Pulling image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader"
  Normal  Pulled   25m   kubelet  Successfully pulled image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" in 3.105s (3.105s including waiting)
  Normal  Created  25m   kubelet  Created container provisioner
  Normal  Started  25m   kubelet  Started container provisioner
Entire resource of Job (e.g. for recreation of the bug)
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    kwasm.sh/nodeName: kwasm-worker
    kwasm.sh/operation: install
    kwasm.sh/shimName: spin-v2
  labels:
    kwasm-worker-spin-v2-install: "true"
    kwasm.sh/job: "true"
    kwasm.sh/operation: install
    kwasm.sh/shimName: spin-v2
  name: kwasm-worker-spin-v2-install
  namespace: default
spec:
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  manualSelector: false
  parallelism: 1
  podReplacementPolicy: TerminatingOrFailed
  selector:
    matchLabels:
      batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
        batch.kubernetes.io/job-name: kwasm-worker-spin-v2-install
        controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
        job-name: kwasm-worker-spin-v2-install
    spec:
      containers:
      - args:
        - install
        - -H
        - /mnt/node-root
        - -r
        - spin-v2
        env:
        - name: HOST_ROOT
          value: /mnt/node-root
        - name: SHIM_FETCH_STRATEGY
          value: /mnt/node-root
        image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
        imagePullPolicy: IfNotPresent
        name: provisioner
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/node-root
          name: root-mount
        - mountPath: /assets
          name: shim-download
      dnsPolicy: ClusterFirst
      hostPID: true
      initContainers:
      - env:
        - name: SHIM_NAME
          value: spin-v2
        - name: SHIM_LOCATION
          value: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
        image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
        imagePullPolicy: IfNotPresent
        name: downloader
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /assets
          name: shim-download
      nodeName: kwasm-worker
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: shim-download
      - hostPath:
          path: /
          type: ""
        name: root-mount
status:
  completionTime: "2024-05-20T20:50:03Z"
  conditions:
  - lastProbeTime: "2024-05-20T20:50:03Z"
    lastTransitionTime: "2024-05-20T20:50:03Z"
    status: "True"
    type: Complete
  failed: 1
  ready: 0
  startTime: "2024-05-20T20:49:35Z"
  succeeded: 1
  terminating: 0
  uncountedTerminatedPods: {}

While the goal of installing/uninstalling the shim is achieved, this is not a desired behavior and desires for a solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions