Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ibmcloud: Peer pods fail during CreateContainer #1882

Open
stevenhorsman opened this issue Jun 24, 2024 · 5 comments
Open

ibmcloud: Peer pods fail during CreateContainer #1882

stevenhorsman opened this issue Jun 24, 2024 · 5 comments

Comments

@stevenhorsman
Copy link
Member

When creating an ibmcloud set up on with a self-managed cluster with both s390x and amd64 architectures, the tests fail.

The pod describe looks like:

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  49m (x19 over 134m)     kubelet  Pulling image "quay.io/prometheus/busybox:latest"
  Warning  Failed   23m (x23 over 133m)     kubelet  Error: failed to create containerd task: failed to create shim task: context deadline exceeded
  Warning  BackOff  4m25s (x482 over 132m)  kubelet  Back-off restarting failed container busybox in pod simple-test_coco-pp-e2e-test-94b410d8(171202c1-07d7-4f95-b541-b9dadc10dbbe)

and CAA log shows and error during the CreateContainer (which includes the pull image step):

2024/06/24 16:26:45 [adaptor/proxy]     storages:
2024/06/24 16:26:45 [adaptor/proxy]         mount_point:/run/kata-containers/702a9450e5570a71633834ec4c5f6f407100921862a9015d96038de8518df2f2/rootfs source:quay.io/prometheus/busybox:latest fstype:overlay driver:image_guest_pull
2024/06/24 16:26:49 [adaptor/proxy] CreateContainer fails: context deadline exceeded
time="2024-06-24T16:26:49Z" level=error msg="ttrpc: received message on inactive stream" stream=3603

I need to dig into the kata-agent logs and see if there is any more information about this.

@stevenhorsman
Copy link
Member Author

Looking in the kata-agent log it has the info message

{"msg":"pull image \"docker.io/library/nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279\", bundle path \"/run/kata-containers/3a9d18335128ca98c7d1f9d86aaad6922c063eeff135ab977ea164fa5ff60dcf/images\"","level":"INFO","ts":"2024-06-26T13:08:03.03219308Z","name":"kata-agent","subsystem":"image","source":"agent","pid":"810","version":"0.1.0"}

from https://github.com/kata-containers/kata-containers/blob/893fd2b59cc31518f8a127c9611e3e8265d9bdfd/src/agent/src/image.rs#L160

But we never get anything back from image-rs's pull image and then after 60s container fails with context deadline exceeded. Unfortunately image-rs doesn't seem to have any logging, so I'm not sure how to get more information on what is going wrong 😞

@bpradipt
Copy link
Member

Looking in the kata-agent log it has the info message

{"msg":"pull image \"docker.io/library/nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279\", bundle path \"/run/kata-containers/3a9d18335128ca98c7d1f9d86aaad6922c063eeff135ab977ea164fa5ff60dcf/images\"","level":"INFO","ts":"2024-06-26T13:08:03.03219308Z","name":"kata-agent","subsystem":"image","source":"agent","pid":"810","version":"0.1.0"}

from https://github.com/kata-containers/kata-containers/blob/893fd2b59cc31518f8a127c9611e3e8265d9bdfd/src/agent/src/image.rs#L160

But we never get anything back from image-rs's pull image and then after 60s container fails with context deadline exceeded. Unfortunately image-rs doesn't seem to have any logging, so I'm not sure how to get more information on what is going wrong 😞

If it's using in-guest image pull, then can you try increasing the remote hypervisor timeout and the container create container timeout - https://github.com/kata-containers/kata-containers/blob/main/src/runtime/config/configuration-remote.toml.in#L298 ?

@stevenhorsman
Copy link
Member Author

If it's using in-guest image pull, then can you try increasing the remote hypervisor timeout and the container create container timeout - https://github.com/kata-containers/kata-containers/blob/main/src/runtime/config/configuration-remote.toml.in#L298 ?

Yeah, that's a good idea, but just pulling nginx shouldn't take more that 60s and in the past when I've seen the timeout it's only been on the containerd side, so the kata-agent has still come back for the image pull afterwards, which doesn't seem to be happening here.

@stevenhorsman
Copy link
Member Author

stevenhorsman commented Jun 26, 2024

Okay - I stand corrected. It appears that the nginx pull took over 2mins:

Jun 26 13:51:30 podvm-nginx-55954c7c66-vptr5-bc08413b kata-agent[811]: {"msg":"pull image \"docker.io/library/nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279\", bundle path \"/run/kata-containers/3c0fc9e0c3634183117f4078d7be48cd3fbb70a8ecc0ea4243cf7cbdf5613aff/images\"","level":"INFO","ts":"2024-06-26T13:51:30.399304775Z","version":"0.1.0","name":"kata-agent","pid":"811","source":"agent","subsystem":"image"}
...
Jun 26 13:53:44 podvm-nginx-55954c7c66-vptr5-bc08413b kata-agent[811]: {"msg":"pull and unpack image \"sha256:dd6c8d4a8748039368f97fd52156d3fadf0ee481dc97d3063d74d9bc38681757\", cid: \"3c0fc9e0c3634183117f4078d7be48cd3fbb70a8ecc0ea4243cf7cbdf5613aff\" succeeded.","level":"INFO","ts":"2024-06-
26T13:53:44.042495884Z","name":"kata-agent","version":"0.1.0","pid":"811","source":"agent","subsystem":"image"}

So I might not have waited long enough, or the containerd request cancelled it or something? So we have an ibmcloud performance issue, rather than functional one. Thanks for nudging me into trying the timeout Pradipta!

@stevenhorsman
Copy link
Member Author

stevenhorsman commented Jul 8, 2024

I'll note that I've just tried the 0.8.2 version of code and that fails with the same issues. As it worked three months ago when 0.8.2 was tested then I think there is potentially some IaaS networking changes/account issues getting in the way and not necessarily a code change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants