Skip to content

test: Retry intra-pod networking test. #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 12, 2018

Conversation

ijc
Copy link
Contributor

@ijc ijc commented Feb 12, 2018

It seems that pod DNS can take a while to settle and trying to quickly after
the pods are up can fail with DNS lookup failure.

So switch to a model where we retry for up to 60 seconds.

Also switch to curl for consistency with the external test, this also involves
an apk add which is a nice test of external connectivity.

Signed-off-by: Ian Campbell ijc@docker.com

It seems that pod DNS can take a while to settle and trying to quickly after
the pods are up can fail with DNS lookup failure.

So switch to a model where we retry for up to 60 seconds.

Also switch to curl for consistency with the external test, this also involves
an `apk add` which is a nice test of external connectivity.

Signed-off-by: Ian Campbell <ijc@docker.com>
@ijc
Copy link
Contributor Author

ijc commented Feb 12, 2018

With this and the fix for #60 the tests still fail (apparently) spuriously 15% of the time. Out of my 100 runs (results.zip) over the weekend I saw:

  • 6 boot failures, all producing no output from the linuxkit run, so not a simple "slow to boot issue". (1x cri-bridge, 2x cri-weave, 1x docker-bridge, 2x docker-weave, which looks like a light correlation with weave but I think that is likely to be a coincidence)
  • 2 failures of the node to become ready (1x docker-weave, 1x cri-weave). These hit the 5 minute timeout, but interestingly in both cases the probing loop seems have stopped well before then (after 57s and 27s), I'm unsure why this should be the case.
  • 7 failures of intra-pod networking, all on docker-weave, all actually failing the apk add curl with a ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.7/main: temporary error (try again later). Possibly mirror flakiness, although the failures were not obviously correlated in time:
    • 2018-02-09T19:30:18.791634567Z
    • 2018-02-10T05:33:36.773773384Z
    • 2018-02-10T11:57:40.60664368Z
    • 2018-02-10T13:39:11.198415771Z
    • 2018-02-10T14:12:12.241487352Z
    • 2018-02-10T17:00:17.277360205Z
    • 2018-02-10T19:31:43.52757571Z

I haven't actually gather data with #60 without this fix yet, but it was failing a lot (those tests are running now).

@ijc ijc merged commit cc58ae9 into linuxkit:master Feb 12, 2018
@ijc ijc deleted the workaround-dns-timeout-in-tests branch February 12, 2018 11:31
@ijc
Copy link
Contributor Author

ijc commented Feb 12, 2018

  • 6 boot failures, all producing no output from the linuxkit run,

I think I've just had this again while I was watching and the strace is just a neverending stream of

[pid 18139] ioctl(22, KVM_RUN, 0)       = -1 EAGAIN (Resource temporarily unavailable)

I suspect this is a problem with my host setup and not a problem with anything here.

@ijc
Copy link
Contributor Author

ijc commented Feb 14, 2018

I suspect this is a problem with my host setup and not a problem with anything here.

Just to close this one off, I updated my laptop (including kernel and QEMU) and I've now had 75 runs (of each of the 4 configs, so 300 boots) without seeing this boot hang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants