Open
Description
As part of #68 I investigated an issue in the containerd restart routine. When the node-installer installs a runtime and restarts containerd, the corresponding pod terminates with status Unknown
Overview:
kubectl get job
NAME COMPLETIONS DURATION AGE
kwasm-worker-spin-v2-install 1/1 28s 21m
kubectl get po
NAME READY STATUS RESTARTS AGE
kwasm-worker-spin-v2-install-n82d9 0/1 Unknown 0 7m25s
kwasm-worker-spin-v2-install-rq78d 0/1 Completed 0 7m3s
Logs of Pod with status Unknown
kubectl logs kwasm-worker-spin-v2-install-n82d9 -c downloader
2024-05-20T20:49:40 INFO start downloading shim from https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:42 INFO download successful:
total 40M
drwxrwxrwx 1 root root 46 May 20 20:49 .
drwxr-xr-x 1 root root 48 May 20 20:49 ..
-rwxr-xr-x 1 1001 127 39.6M May 8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-n82d9 -c provisioner
2024/05/20 20:49:46 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=true
2024/05/20 20:49:46 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:49:46 INFO restarting containerd
Logs of Pod with status Completed
kubectl logs kwasm-worker-spin-v2-install-rq78d -c downloader
2024-05-20T20:49:57 INFO start downloading shim from https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:59 INFO download successful:
total 40M
drwxrwxrwx 1 root root 46 May 20 20:49 .
drwxr-xr-x 1 root root 48 May 20 20:49 ..
-rwxr-xr-x 1 1001 127 39.6M May 8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-rq78d -c provisioner
2024/05/20 20:50:00 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=false
2024/05/20 20:50:00 INFO runtime config already exists, skipping runtime=spin-v2
2024/05/20 20:50:00 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:50:00 INFO nothing changed, nothing more to do
The Completed
pod only gets scheduled in the first place, as the first one did not terminated successfully; even though the actual job (rewriting containerd config and removing the binary) is done. As a result, the second run of the job has nothing left todo.
Description of Pod with Status Unknown
State: Terminated
Reason: Unknown
Exit Code: 255
Started: Mon, 20 May 2024 22:49:46 +0200
Finished: Mon, 20 May 2024 22:49:48 +0200
kubectl describe po kwasm-worker-spin-v2-install-n82d9
Name: kwasm-worker-spin-v2-install-n82d9
Namespace: default
Priority: 0
Service Account: default
Node: kwasm-worker/192.168.228.5
Start Time: Mon, 20 May 2024 22:49:35 +0200
Labels: batch.kubernetes.io/controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
batch.kubernetes.io/job-name=kwasm-worker-spin-v2-install
controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
job-name=kwasm-worker-spin-v2-install
Annotations: <none>
Status: Failed
IP: 10.244.2.2
IPs:
IP: 10.244.2.2
Controlled By: Job/kwasm-worker-spin-v2-install
Init Containers:
downloader:
Container ID: containerd://7f63983e513efa392e3cc684bf53d2553aeb898b4bfe08fb22229fbae83406cb
Image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
Image ID: ghcr.io/spinkube/shim-downloader@sha256:719f54c518fc0fc65abbe8ac27978ea188d13faee23530544faf9d622aa2be92
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 20 May 2024 22:49:40 +0200
Finished: Mon, 20 May 2024 22:49:42 +0200
Ready: True
Restart Count: 0
Environment:
SHIM_NAME: spin-v2
SHIM_LOCATION: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
Mounts:
/assets from shim-download (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Containers:
provisioner:
Container ID: containerd://92dd4c994b2fc95d269b5de630c00f55fff233d04d1d649a6b69ce512936278b
Image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
Image ID: ghcr.io/spinkube/node-installer@sha256:fcbfa4d8197d3de3b9953219af6a8784f23abf7d798150b2c2a606daaeebe6df
Port: <none>
Host Port: <none>
Args:
install
-H
/mnt/node-root
-r
spin-v2
State: Terminated
Reason: Unknown
Exit Code: 255
Started: Mon, 20 May 2024 22:49:46 +0200
Finished: Mon, 20 May 2024 22:49:47 +0200
Ready: False
Restart Count: 0
Environment:
HOST_ROOT: /mnt/node-root
SHIM_FETCH_STRATEGY: /mnt/node-root
Mounts:
/assets from shim-download (rw)
/mnt/node-root from root-mount (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
shim-download:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
root-mount:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
kube-api-access-wnr2x:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 25m kubelet Pulling image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader"
Normal Pulled 25m kubelet Successfully pulled image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" in 4.108s (4.108s including waiting)
Normal Created 25m kubelet Created container downloader
Normal Started 25m kubelet Started container downloader
Normal Pulling 25m kubelet Pulling image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader"
Normal Pulled 25m kubelet Successfully pulled image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" in 3.105s (3.105s including waiting)
Normal Created 25m kubelet Created container provisioner
Normal Started 25m kubelet Started container provisioner
Entire resource of Job (e.g. for recreation of the bug)
apiVersion: batch/v1
kind: Job
metadata:
annotations:
kwasm.sh/nodeName: kwasm-worker
kwasm.sh/operation: install
kwasm.sh/shimName: spin-v2
labels:
kwasm-worker-spin-v2-install: "true"
kwasm.sh/job: "true"
kwasm.sh/operation: install
kwasm.sh/shimName: spin-v2
name: kwasm-worker-spin-v2-install
namespace: default
spec:
backoffLimit: 6
completionMode: NonIndexed
completions: 1
manualSelector: false
parallelism: 1
podReplacementPolicy: TerminatingOrFailed
selector:
matchLabels:
batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
suspend: false
template:
metadata:
creationTimestamp: null
labels:
batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
batch.kubernetes.io/job-name: kwasm-worker-spin-v2-install
controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
job-name: kwasm-worker-spin-v2-install
spec:
containers:
- args:
- install
- -H
- /mnt/node-root
- -r
- spin-v2
env:
- name: HOST_ROOT
value: /mnt/node-root
- name: SHIM_FETCH_STRATEGY
value: /mnt/node-root
image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
imagePullPolicy: IfNotPresent
name: provisioner
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/node-root
name: root-mount
- mountPath: /assets
name: shim-download
dnsPolicy: ClusterFirst
hostPID: true
initContainers:
- env:
- name: SHIM_NAME
value: spin-v2
- name: SHIM_LOCATION
value: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
imagePullPolicy: IfNotPresent
name: downloader
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /assets
name: shim-download
nodeName: kwasm-worker
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- emptyDir: {}
name: shim-download
- hostPath:
path: /
type: ""
name: root-mount
status:
completionTime: "2024-05-20T20:50:03Z"
conditions:
- lastProbeTime: "2024-05-20T20:50:03Z"
lastTransitionTime: "2024-05-20T20:50:03Z"
status: "True"
type: Complete
failed: 1
ready: 0
startTime: "2024-05-20T20:49:35Z"
succeeded: 1
terminating: 0
uncountedTerminatedPods: {}
While the goal of installing/uninstalling the shim is achieved, this is not a desired behavior and desires for a solution.