Skip to content

Fix build issues for Windows image. #1953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 5, 2025

Conversation

tonyzhc
Copy link
Contributor

@tonyzhc tonyzhc commented Feb 25, 2025

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind failing-test

What this PR does / why we need it:
Our Windows tests have been failing, because a newer version of buildx generates a manifest list for multi-arch images. Set this flag so that we keep the original behavior. This will help us verify that the upstream Kubernetes fix for kubernetes/kubernetes#129084 is working.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:
Note that this is the same issue as https://stackoverflow.com/questions/75521775/buildx-docker-image-claims-to-be-a-manifest-list, and we may benefit from upgrading to the new containerd image store (see https://docs.docker.com/desktop/features/containerd/) in the long term.

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Feb 25, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 25, 2025
@tonyzhc
Copy link
Contributor Author

tonyzhc commented Feb 25, 2025

/retest

@mauriciopoppe
Copy link
Member

I see that presubmits went beyond the manifest error, good job and thanks!

Before:

DOCKER_CLI_EXPERIMENTAL=enabled docker manifest create --amend gcr.io/k8s-infra-e2e-boskos-026/gcp-persistent-disk-csi-driver:433628cf-76f9-4ec6-9194-e01c8af005df gcr.io/k8s-infra-e2e-boskos-026/gcp-persistent-disk-csi-driver:433628cf-76f9-4ec6-9194-e01c8af005df_linux_amd64 gcr.io/k8s-infra-e2e-boskos-026/gcp-persistent-disk-csi-driver:433628cf-76f9-4ec6-9194-e01c8af005df_linux_arm64 gcr.io/k8s-infra-e2e-boskos-026/gcp-persistent-disk-csi-driver:433628cf-76f9-4ec6-9194-e01c8af005df_ltsc2019
gcr.io/k8s-infra-e2e-boskos-026/gcp-persistent-disk-csi-driver:433628cf-76f9-4ec6-9194-e01c8af005df_linux_amd64 is a manifest list
make: *** [Makefile:64: build-and-push-multi-arch] Error 1

After:

DOCKER_CLI_EXPERIMENTAL=enabled docker manifest create --amend gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23 gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23_linux_amd64 gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23_linux_arm64 gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23_ltsc2019
Created manifest list gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23
STAGINGIMAGE="gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver" STAGINGVERSION="6d8bebd4-0814-4b04-a5ee-6073fded8e23" WINDOWS_IMAGE_TAGS="ltsc2019" WINDOWS_BASE_IMAGES="mcr.microsoft.com/windows/servercore:ltsc2019" ./manifest_osversion.sh

The next error is on fixing the CSI Driver Windows deployment:

F0225 05:23:22.884359    8885 main.go:228] Failed to run integration test: failed to install CSI Driver: Windows deployment failed to come up: exit status 255

@mauriciopoppe
Copy link
Member

/uncc @leiyiz @tyuchn
/cc

@k8s-ci-robot k8s-ci-robot requested review from mauriciopoppe and removed request for leiyiz and tyuchn February 25, 2025 15:56
@tonyzhc
Copy link
Contributor Author

tonyzhc commented Feb 25, 2025

Thank you for looking @mauriciopoppe. I think I'm seeing a bunch of image pull errors in the PD CSI deployment:

  Normal   Pulling    13m (x4 over 15m)     kubelet            Pulling image "gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23"
  Warning  Failed     13m (x4 over 15m)     kubelet            Error: ErrImagePull
  Warning  BackOff    5m18s (x29 over 14m)  kubelet            Back-off restarting failed container csi-driver-registrar in pod csi-gce-pd-node-win-xldlj_gce-pd-csi-driver(f4b21d6a-3259-4da4-a409-0c55892b852b)
  Normal   Pulled     39s (x7 over 14m)     kubelet            Container image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.3" already present on machine
  Normal   BackOff    11s (x74 over 15m)    kubelet            Error: ImagePullBackOff

Do you know how I can gain access to that image registry? Would like to see if there are more error details in the GCP console.

@sunnylovestiramisu
Copy link
Contributor

sunnylovestiramisu commented Feb 25, 2025

I saw that Pulling image "gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23", this should be changed to registry.k8s.io already?

@mauriciopoppe
Copy link
Member

@sunnylovestiramisu I think that's a red herring, if the multiarch image gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver was a bad image then it'd be bad for both Linux and it seems that it came up just fine in Linux.

In Linux it comes up just fine:

  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  15m   default-scheduler  Successfully assigned gce-pd-csi-driver/csi-gce-pd-node-j9pn2 to e2e-test-prow-minion-group-jt6r
  Normal  Pulling    15m   kubelet            Pulling image "gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23"
  Normal  Pulled     15m   kubelet            Successfully pulled image "gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23" in 8.816s (8.816s including waiting). Image size: 121669079 bytes.
  Normal  Created    15m   kubelet            Created container: gce-pd-driver
  Normal  Started    15m   kubelet            Started container gce-pd-driver
  Normal  Pulling    15m   kubelet            Pulling image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.3"
  Normal  Pulled     15m   kubelet            Successfully pulled image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.3" in 3.573s (3.573s including waiting). Image size: 10762572 bytes.
  Normal  Created    15m   kubelet            Created container: csi-driver-registrar
  Normal  Started    15m   kubelet            Started container csi-driver-registrar

In Windows it fails:

  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  15m                   default-scheduler  Successfully assigned gce-pd-csi-driver/csi-gce-pd-node-win-xldlj to e2e-test-prow-windows-node-group-d03x
  Normal   Pulling    15m                   kubelet            Pulling image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.3"
  Normal   Pulled     15m                   kubelet            Successfully pulled image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.3" in 20.364s (20.364s including waiting). Image size: 111142168 bytes.
  Normal   Created    14m (x3 over 15m)     kubelet            Created container: csi-driver-registrar
  Normal   Started    14m (x3 over 15m)     kubelet            Started container csi-driver-registrar
  Normal   Pulling    13m (x4 over 15m)     kubelet            Pulling image "gcr.io/k8s-infra-e2e-boskos-007/gcp-persistent-disk-csi-driver:6d8bebd4-0814-4b04-a5ee-6073fded8e23"
  Warning  Failed     13m (x4 over 15m)     kubelet            Error: ErrImagePull
  Warning  BackOff    5m18s (x29 over 14m)  kubelet            Back-off restarting failed container csi-driver-registrar in pod csi-gce-pd-node-win-xldlj_gce-pd-csi-driver(f4b21d6a-3259-4da4-a409-0c55892b852b)
  Normal   Pulled     39s (x7 over 14m)     kubelet            Container image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.3" already present on machine
  Normal   BackOff    11s (x74 over 15m)    kubelet            Error: ImagePullBackOff

I tried building the image locally and got the following warning:

make build-and-push-multi-arch GCE_PD_CSI_STAGING_VERSION=mauricio-2025-02-25 GCE_PD_CSI_STAGING_IMAGE=us-central1-docker.pkg.dev/$USER-gke-dev/gcp-persistent-disk-csi-driver/gcp-persistent-disk-csi-driver
...
- Requested platform "linux/amd64" does not match result platform "linux/arm64"
...

I'm not sure if there's something going on with the way the manifest is built. Could you please add a line to inspect the manifest in the Makefile?

build-and-push-multi-arch: build-and-push-container-linux-amd64 build-and-push-container-linux-arm64 build-and-push-windows-container-ltsc2019
	$(DOCKER) manifest create --amend $(STAGINGIMAGE):$(STAGINGVERSION) $(STAGINGIMAGE):$(STAGINGVERSION)_linux_amd64 $(STAGINGIMAGE):$(STAGINGVERSION)_linux_arm64 $(STAGINGIMAGE):$(STAGINGVERSION)_ltsc2019
	STAGINGIMAGE="$(STAGINGIMAGE)" STAGINGVERSION="$(STAGINGVERSION)" WINDOWS_IMAGE_TAGS="$(WINDOWS_IMAGE_TAGS)" WINDOWS_BASE_IMAGES="$(WINDOWS_BASE_IMAGES)" ./manifest_osversion.sh
	$(DOCKER) manifest push -p $(STAGINGIMAGE):$(STAGINGVERSION)
	$(DOCKER) manifest inspect $(STAGINGIMAGE):$(STAGINGVERSION)

@sunnylovestiramisu
Copy link
Contributor

I am looking at test/run-windows-k8s-integration.sh, it is using E2E_GOOGLE_APPLICATION_CREDENTIALS same as in run-k8s-integration.sh.

@tonyzhc
Copy link
Contributor Author

tonyzhc commented Feb 25, 2025

The manifest looks like

{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 14589,
         "digest": "sha256:6bb59c27e7a2d49b77697ad65e0c2d2e85465fc31eb182da33d683e426b7ee15",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 14588,
         "digest": "sha256:e3d3cd5f69178dbce1c01c4d2be6c484c4761e635269b904a61ab25a14b3d095",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 905,
         "digest": "sha256:073747e08787fe4c2ca7a96a13eb9c3eb20808be5f8e0c77902416f2540af508",
         "platform": {
            "architecture": "amd64",
            "os": "windows"
         }
      }
   ]
}

@mauriciopoppe
Copy link
Member

Thanks, the manifest looks good, I was looking for it to show the same digest as the exported Windows image.

In the logs there's a line pushing the Windows image:

#12 pushing manifest for gcr.io/k8s-infra-e2e-boskos-078/gcp-persistent-disk-csi-driver:b2f855e6-c869-435f-85f0-de24261689e8_ltsc2019@sha256:073747e08787fe4c2ca7a96a13eb9c3eb20808be5f8e0c77902416f2540af508 0.7s done

And the sha matches the digest in #1953 (comment)

      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 905,
         "digest": "sha256:073747e08787fe4c2ca7a96a13eb9c3eb20808be5f8e0c77902416f2540af508",
         "platform": {
            "architecture": "amd64",
            "os": "windows"
         }
      }

I made some changes in #1951, I upgraded the buildkit version to match the one we use internally and I also used docker buildx imagetools create to make the manifest instead of docker manifest, that's because docker manifest is experimental API so it might change in a way that's not compatible with our setup anymore (in fact, that's why we have https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/manifest_osversion.sh). It didn't finish running tests yet so I'll see how it goes.

@tonyzhc
Copy link
Contributor Author

tonyzhc commented Feb 26, 2025

My locally built image (make build-and-push-multi-arch) produces the same manifest and was able to be pulled successfully (docker pull ...). I'm going to walk through the code to check if we are deploying the driver especially the node SAs differently between Linux and Windows.

@tonyzhc
Copy link
Contributor Author

tonyzhc commented Feb 26, 2025

Was able to find this from audit logs

ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image ... : no match for platform in manifest: not found

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 26, 2025
@tonyzhc tonyzhc force-pushed the test-windows branch 3 times, most recently from a20de83 to fa4379c Compare February 27, 2025 00:59
@sunnylovestiramisu
Copy link
Contributor

From Kubernetes repo:

    # TODO(dims): Ensure newer versions get uploaded to
    # - https://console.cloud.google.com/storage/browser/gke-release/winnode/node-problem-detector
    # - https://gcsweb.k8s.io/gcs/kubernetes-release/node-problem-detector/
    # and then the following references get fixed.

And in https://console.cloud.google.com/storage/browser/gke-release/winnode/node-problem-detector there is no node-problem-detector:v0.8.20

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 28, 2025
@sunnylovestiramisu sunnylovestiramisu changed the title Fix build issues for WIndows image. Fix build issues for Windows image. Feb 28, 2025
@tonyzhc
Copy link
Contributor Author

tonyzhc commented Feb 28, 2025

/retest

@tonyzhc
Copy link
Contributor Author

tonyzhc commented Mar 1, 2025

/retest

@mattcary
Copy link
Contributor

mattcary commented Mar 3, 2025

FYI, talking with @sunnylovestiramisu , we will merge this change after kubernetes/kubernetes#130499 goes into k/k (so we don't have the hardcoding hack that's currently in this PR).

@sunnylovestiramisu
Copy link
Contributor

/test pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration

@tonyzhc
Copy link
Contributor Author

tonyzhc commented Mar 4, 2025

FYI, talking with @sunnylovestiramisu , we will merge this change after kubernetes/kubernetes#130499 goes into k/k (so we don't have the hardcoding hack that's currently in this PR).

Previous PR has been merged. Cleaning up the hardcoded hack.

@tonyzhc
Copy link
Contributor Author

tonyzhc commented Mar 4, 2025

/retest


build-and-push-multi-arch: build-and-push-container-linux-amd64 build-and-push-container-linux-arm64 build-and-push-windows-container-ltsc2019
$(DOCKER) manifest create --amend $(STAGINGIMAGE):$(STAGINGVERSION) $(STAGINGIMAGE):$(STAGINGVERSION)_linux_amd64 $(STAGINGIMAGE):$(STAGINGVERSION)_linux_arm64 $(STAGINGIMAGE):$(STAGINGVERSION)_ltsc2019
$(DOCKER) manifest create $(STAGINGIMAGE):$(STAGINGVERSION) $(STAGINGIMAGE):$(STAGINGVERSION)_linux_amd64 $(STAGINGIMAGE):$(STAGINGVERSION)_linux_arm64 $(STAGINGIMAGE):$(STAGINGVERSION)_ltsc2019
STAGINGIMAGE="$(STAGINGIMAGE)" STAGINGVERSION="$(STAGINGVERSION)" WINDOWS_IMAGE_TAGS="$(WINDOWS_IMAGE_TAGS)" WINDOWS_BASE_IMAGES="$(WINDOWS_BASE_IMAGES)" ./manifest_osversion.sh
$(DOCKER) manifest push -p $(STAGINGIMAGE):$(STAGINGVERSION)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also update the build-and-push-multi-arch-debug to the same build command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -59,6 +59,8 @@ func clusterDownGKE(gceZone, gceRegion string) error {
return nil
}

const hardcodeBuildChange = `/template:/,/spec:/s/spec:$/spec:\n\ \ \ \ \ \ nodeSelector:\n\ \ \ \ \ \ \ \ kubernetes.io\/os: linux/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably do not need these two lines either?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@k8s-ci-robot
Copy link
Contributor

@tonyzhc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019 66d41ff link false /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sunnylovestiramisu
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sunnylovestiramisu, tonyzhc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2025
@k8s-ci-robot k8s-ci-robot merged commit 759a080 into kubernetes-sigs:master Mar 5, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants