Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait container unable to start in windows on version >2.12.6 #5376

Closed
mweibel opened this issue Mar 12, 2021 · 11 comments · Fixed by #5462
Closed

wait container unable to start in windows on version >2.12.6 #5376

mweibel opened this issue Mar 12, 2021 · 11 comments · Fixed by #5462
Labels
area/build Build or GithubAction/CI issues area/windows Windows Container support type/bug
Milestone

Comments

@mweibel
Copy link
Contributor

mweibel commented Mar 12, 2021

Summary

Currently running argo workflows version 2.12.0-rc2 and tried to upgrade to 2.12.9 a few days ago.
I noticed the workflows failing due to the wait container. Workflows are being sent the exact same way.

After deploying everything from the working 2.12.0-rc2 I had before up to 2.12.10, I noticed the issue seemed to appear in 2.12.7 (-> 2.12.6 is the last working version).

Error is related to workflows running on windows and can be reproduced by running:

$ kubectl run argotest10 \
  --image=argoproj/argoexec:v2.12.10 \
  --overrides='{ "apiVersion": "v1", "spec": { "template": { "spec": { "nodeSelector": { "kubernetes.io/os": "windows" } } } } }' \
  -- \
  version

Error message I see:

Error response from daemon: container argotest10 encountered an error during hcsshim::System::Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)

I'm unsure what could be the issue as not that much changed in 2.12.7 (v2.12.6...v2.12.7, basicly only #4946 seems remotely relevant but I don't think it's the issue).
Has there been some update to the way windows containers are built?

Diagnostics

What Kubernetes provider are you using?
Rancher


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@simster7
Copy link
Member

Paging our Windows expert @lippertmarkus

@lippertmarkus
Copy link
Member

lippertmarkus commented Mar 12, 2021

Just tried the most basic workflow (https://argoproj.github.io/argo-workflows/windows/#schedule-hybrid-workflows) on v2.12.7 without a problem. Does that also don't work for you or is it just happening for a more complex workflow? Maybe something with volumes?

Are you using the docker executor?

Also could you please provide the Windows Build and Docker version? For me the error looks more like a problem with the host/container build version or the setup in general. Here's what I tried with:

kubectl get node -o wide
NAME                          STATUS   ROLES   AGE    VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION     CONTAINER-RUNTIME
akswin000000                  Ready    agent   9m6s   v1.18.14   10.1.1.35     <none>        Windows Server 2019 Datacenter   10.0.17763.1757    docker://19.3.14

@mweibel
Copy link
Contributor Author

mweibel commented Mar 16, 2021

It may well be a problem of the setup, as it's quite customized.

» k get node -o wide
NAME     STATUS   ROLES               AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                    KERNEL-VERSION    CONTAINER-RUNTIME
mynode   Ready    worker              4m12s   v1.19.4   10.30.0.42    <none>        Windows Server Datacenter   10.0.18363.1198   docker://19.3.14

The difference I see is mostly the OS Image. I'll investigate this part then.

@mweibel
Copy link
Contributor Author

mweibel commented Mar 16, 2021

Question is why does it work in version .6 but not in .7?
Did anything change regarding the build process or so?

From what I can see the docker image is built using OS version 1809, but that was the case for argo-workflows .6 as well.

@lippertmarkus
Copy link
Member

@mweibel Microsoft also updates the 1809 images with (security) fixes. The old .6 image may not had some of them at the time it was created. So you would also need to compare the whole revision/build number of the two images.

@mweibel
Copy link
Contributor Author

mweibel commented Mar 16, 2021

>  docker run --rm --platform windows --entrypoint cmd -it argoproj/argoexec:v2.12.6-windows
Microsoft Windows [Version 10.0.17763.1637]

>  docker run --rm --platform windows --entrypoint cmd -it argoproj/argoexec:v2.12.7-windows
Microsoft Windows [Version 10.0.17763.1697]

KB4598230 is the difference, seemingly. Wondering if that really is the issue or not.

Either I downgrade our nodes to a similar old version or try building argoexec using a newer version too.

@lippertmarkus
Copy link
Member

good question, I'm curious 😄

@mweibel
Copy link
Contributor Author

mweibel commented Mar 16, 2021

I built the images myself now, needed to upgrade the Go version to 1.13.6 to fix an issue with go + windows.
The newly built images are now reporting the following build numbers:

Microsoft Windows [Version 10.0.17763.1817]

running those on my setup (using the kubectl run example above): not working. Same error for both 2.12.6 and 2.12.7 versions.

this means v2.12.6 only worked because of an older windows build version. I'll try building using 1909 as base and report what happens there. If this also doesn't work I'll rebuild my nodes to get newer versions and see how that works...

Edit:

Microsoft Windows [Version 10.0.18363.1440]

$ k logs argotest7-1909-try2
argoexec: latest+5f51507.dirty
  BuildDate: 2021-03-16T13:35:58Z
  GitCommit: 5f5150730c644865a5867bf017100732f55811dd
  GitTreeState: dirty
  GitTag: v2.12.7
  GoVersion: go1.13.6
  Compiler: gc
  Platform: windows/amd64

with 1909 as base it seems to work (I'll need to verify with actually deploying argo using that version, but seemingly 1809 and 1909 are not compatible, despite microsoft saying it should not matter.

Can argo build 1909 based images additionally?

@alexec alexec added this to the v2.12 milestone Mar 16, 2021
@lippertmarkus
Copy link
Member

Rather difficult, GitHub Workflows only provide a 1809 runner.

@mweibel
Copy link
Contributor Author

mweibel commented Mar 18, 2021

Oh, true.

This (windows containers) is such a pain (especially if you're used to non windows systems ;)).
I didn't find a roadmap or such which would indicate that they add a newer runner.

I guess we'll stick to building our own images for the near future then.
Not sure what you'd like to do with this issue? From my side it can be closed as I don't see anything argo can do in this regard. Except maybe document supported windows versions somewhere?

@lippertmarkus
Copy link
Member

I can relate. According to Microsoft differences within the revision shouldn't affect container functionality, but this is an example where that statement unfortunately doesn't hold.

Do you want to add that to the Limitations (https://argoproj.github.io/argo-workflows/windows/#limitations)?

mweibel added a commit to mweibel/argo that referenced this issue Mar 22, 2021
fixes argoproj#5376

Signed-off-by: Michael Weibel <michael@helio.exchange>
@agilgur5 agilgur5 added area/windows Windows Container support area/build Build or GithubAction/CI issues labels Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build or GithubAction/CI issues area/windows Windows Container support type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants