Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 and ARM airgap image tarballs should not contain AMD64 images #1285

Closed
gbritton opened this issue Jan 9, 2020 · 14 comments
Closed

ARM64 and ARM airgap image tarballs should not contain AMD64 images #1285

gbritton opened this issue Jan 9, 2020 · 14 comments

Comments

@gbritton
Copy link

gbritton commented Jan 9, 2020

Version:
v1.17.0+k3s.1

Describe the bug
ARM64 and ARM airgap image tarballs contain some AMD64 images instead of the correct arch.

To Reproduce
Examine released files. Unpacked k3s-airgap-images-arm.tar and examined content with:
for d in *.json; do [ $d != manifest.json ] && jq .architecture < $d; done

Expected behavior
Should only see "arm" for all images.

Actual behavior
"amd64"
"arm"
"arm"
"arm"
"amd64"
"amd64"
"arm"

Additional context
This appears to have been broken for a while, maybe it never worked correctly on arm/arm64.

@erikwilson
Copy link
Contributor

Thanks for the info @gbritton! We definitely need to dig deeper into this. Strangely it appears that even tho the metadata says amd64, the extracted binaries look to be the appropriate architecture. Since we are using a simple docker command to download and save those images it appears to be an upstream docker issue (https://github.com/rancher/k3s/blob/master/scripts/package-airgap).

@gbritton
Copy link
Author

These may be related: #1278 and #1094 (I've seen a few other complaints online, but these at least are here). These appear to be people hitting containerd errors when using the airgapped images on arm64. In my limited toying with "ctr i import airgap-images.tar" it fails on the mixed images. Curiously, this appears to succeed back on v0.9.1, and the few missing images got pulled from the internet. Newer versions I end up with containerd complaining about missing objects, I'm guessing because the import of improper data left things in a bad state. I wish I had time to debug this further, but this is at least the impression I've gotten from what little I've dug into.

@gbritton
Copy link
Author

@erikwilson looking at package-airgap I see it uses docker pull <image> to grab things... should this be adding the --platform <arch> argument?

@gbritton
Copy link
Author

Screen Shot 2020-01-14 at 09 32 24

This may be related... it looks like the metrics server at least is misbuilt, pulling that image shows there is an amd64 not arm64 binary in the container.

@shariperryman
Copy link

I am also having the same problem. The three images that have amd64 as architecture are coredns, pause, and metrics server. It causes the container creation to hang because it can't resolve the platform type.

@gbritton
Copy link
Author

I think I understand what the issue is (and have worked around it in my own setup which gathers a airgap images independently). The multiarch manifests refer to say amd64, arm, and arm64 images. Doing just "docker save" simply dumps the existing manifest as-is into the archive along with the single architecture's layers. When this is loaded into containerd, the manifest still refers to additional blobs that aren't needed, but containerd still complains aren't found. My workaround is to use skopeo in my build system to pull images down into local storage, then exporting that all as an oci bundle with self-consistent manifests. Unfortunately, it's not cleanly separable and isn't a drop-in fix for what k3s releases build, but hopefully this helps others fix it.

  for ref in $(sed -e 's/#.*//;/^[[:space:]]*$/d' ${AIRGAP_FILES}); do
    skopeo --policy ${WORKDIR}/policy.json --override-arch ${HOST_GOARCH} copy \
      docker://$ref oci:${WORKDIR}/oci-images:$ref
  done
  tar --numeric-owner --owner=0 --group=0 \
      --mtime @${REPRODUCIBLE_TIMESTAMP_ROOTFS} -C ${WORKDIR}/oci-images \
      -cf ${WORKDIR}/${AIRGAP_TAR} oci-layout blobs index.json

@shariperryman
Copy link

shariperryman commented Feb 27, 2020 via email

@ilovemilk
Copy link

@shariperryman Thanks a lot this also worked for me.

Nevertheless, are the any news on this topic?

@bmarinov
Copy link

I just faceplanted into this issue. The airgapped install seems to be broken and all system ns pods are stuck at ContainerCreating. failed to resolve rootfs and all.

1008 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "coredns-8655855d6-lxb65_kube-system(a9bf61ea-c193-4fa9-944d-cd838424f0c2)" failed: rpc error: code = NotFound desc = failed to create containerd container: error unpacking image: failed to resolve rootfs: content digest sha256:e11a8cbeda8688fdc3a68dc246d6c945aa272b29f8dd94d0ea370d30c2674042: not found

Any updates or a recommended workaround?

@davidnuzik
Copy link
Contributor

We were targeting a fix for v1.18.7 today, but due to complications we are going to postpone this. This is because the images that we would need to repush to fix this are also used by RKE. To safely ensure we don't break production, we need to have a more methodical and thorough approach that encompases testing BOTH RKE and K3S. HOWEVER, we may be able to fix this out of band and it is on our radar to try to get to this soon - just not in time for new patch releases today.

We believe we need to do the following:

  1. repush metrics-server, coredns, and pause to a different dockerhub org, say ranchertest
  2. Assert both normal and airgap install works for RKE1 with those new images
  3. Assert normal and airgap install works for K3s for armhf, arm64, and amd64 for those new images.
  4. repush the images again, but to the rancher org this time
  5. run rke1 and k3s smoke tests to ensure everything is in order.

Our apologies for delay on this but we believe it's in everybody's best interests (we don't want to break people in production).

I have also taken note of this issue in our August patch releases issue here: #2113

@davidnuzik
Copy link
Contributor

@rancher-max wait for #1908 to get in before test. These are both intertwined so should just wait.

@brandond
Copy link
Member

I will note that several of the upstream images are still misconfigured and claim to be the wrong arch, but are in fact correct. We are working around this by telling containerd to not skip loading layers for images that are incorrectly configured.

@rancher-max
Copy link
Contributor

Validated using v1.19.1-rc2+k3s1

OS used (uname -a): Linux ip-172-31-10-146 5.4.0-1022-aws #22-Ubuntu SMP Wed Aug 12 13:52:46 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

  • Previously (v1.18.3+k3s1), the problematic images appeared to be: rancher/metrics-server:v0.3.6, rancher/coredns-coredns:1.6.3, and rancher/pause:3.1. Also, the image for busybox was not present. The result of the images having a problem was that a few pods would not come up successfully and would be stuck in ContainerCreating:
kube-system   helm-install-traefik-qg4hh               0/1     ContainerCreating   0          4m8s
kube-system   coredns-8655855d6-dktnp                  0/1     ContainerCreating   0          4m8s
kube-system   local-path-provisioner-6d59f47c7-cd9d5   0/1     ContainerCreating   0          4m8s
kube-system   metrics-server-7566d596c8-sznr2          0/1     ContainerCreating   0          4m8s
  • Validated cluster comes up successfully
$ k3s kubectl get nodes,pods -A
NAME                    STATUS   ROLES    AGE   VERSION
node/ip-172-31-10-146   Ready    master   46m   v1.19.1-rc2+k3s1
node/ip-172-31-12-195   Ready    <none>   45m   v1.19.1-rc2+k3s1

NAMESPACE     NAME                                         READY   STATUS      RESTARTS   AGE
kube-system   pod/helm-install-traefik-hdfnt               0/1     Completed   0          46m
kube-system   pod/local-path-provisioner-7ff9579c6-chxmr   1/1     Running     0          46m
kube-system   pod/coredns-66c464876b-fdh28                 1/1     Running     0          46m
kube-system   pod/metrics-server-7b4f8b595-tn6w4           1/1     Running     0          46m
kube-system   pod/svclb-traefik-qr82d                      2/2     Running     0          45m
kube-system   pod/svclb-traefik-hgzbz                      2/2     Running     0          45m
kube-system   pod/traefik-5dd496474-qrtz4                  1/1     Running     0          45m
  • Validated list of images present:
$ sudo k3s crictl images
IMAGE                                      TAG                 IMAGE ID            SIZE
docker.io/rancher/coredns-coredns          1.6.9               af51a588dff59       41MB
docker.io/rancher/klipper-helm             v0.3.0              4001cb2c385ce       140MB
docker.io/rancher/klipper-lb               v0.1.2              9be4f056f04b7       6.21MB
docker.io/rancher/library-busybox          1.31.1              19d689bc58fd6       1.6MB
docker.io/rancher/library-traefik          1.7.19              1cdb7e2bd5e25       83.6MB
docker.io/rancher/local-path-provisioner   v0.0.14             2b703ea309660       40.2MB
docker.io/rancher/metrics-server           v0.3.6              f9499facb1e8c       39.6MB
docker.io/rancher/pause                    3.1                 6cf7c80fe4444       529kB
  • Validated architecture for each image. As @brandond this is not necessarily all arm at this point, which is a bit confusing. (did this by doing: cat manifest.json | grep -i config and then cat <configvalue> | grep architecture on all 8 results and ensuring it shows as arm64):
    • rancher/library-busybox:1.31.1: arm64
    • rancher/library-traefik:1.7.19: arm64
    • rancher/local-path-provisioner:v0.0.14: arm64
    • rancher/metrics-server:v0.3.6: amd64
    • rancher/pause:3.1: amd64
    • rancher/coredns-coredns:1.6.9: arm64
    • rancher/klipper-helm:v0.3.0: arm64
    • rancher/klipper-lb:v0.1.2: arm64

@brandond
Copy link
Member

brandond commented Sep 16, 2020

That's correct @rancher-max, and is what I was getting at in #1285 (comment). Upstream managed to build them such that they have the correct arch in the manifest list, and the correct arch for the binaries, but the wrong arch in the image config json. We are fine just ignoring this for now. Upstream has fixed this in newer releases of the images, but we don't want to bump those at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants