Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use envoy xds server type in daily e2e tests #4100

Merged

Conversation

sunjayBhatia
Copy link
Member

@sunjayBhatia sunjayBhatia commented Oct 12, 2021

Run e2e tests daily against envoy xds server type that uses
go-control-plane.

Once we are confident it is stable, we can flip the default to be the
envoy go-control-plane variant and run the daily tests with the contour
variant until we drop support for it.

@codecov
Copy link

codecov bot commented Oct 12, 2021

Codecov Report

Merging #4100 (a725b08) into main (1ee363b) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #4100   +/-   ##
=======================================
  Coverage   77.72%   77.72%           
=======================================
  Files         112      112           
  Lines        9995     9995           
=======================================
  Hits         7769     7769           
  Misses       2041     2041           
  Partials      185      185           

@sunjayBhatia sunjayBhatia force-pushed the xds-server-type-envoy branch 2 times, most recently from ed87ba5 to d57d0b7 Compare October 21, 2021 21:29
@sunjayBhatia
Copy link
Member Author

looks like envoy has a hard time sometimes connecting to the local contour in the e2e tests in envoy grpc server mode

@sunjayBhatia
Copy link
Member Author

[2021-10-21 21:34:29.342][1][warning][config] [source/common/config/grpc_subscription_impl.cc:118] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener

@sunjayBhatia
Copy link
Member Author

bumping the bootstrap connect timeout seems to have helped, it's hardcoded at 5s for now, maybe should be configurable

@sunjayBhatia
Copy link
Member Author

going back through git history, seems like a 5s connect timeout was somewhat arbitrarily chosen

begs the question why the go-control-plane version takes longer to connect to, but we could bump it or make it configurable

@sunjayBhatia
Copy link
Member Author

hm, even with bumping the connect timeout, seems fetches fail with the same Envoy log as above

@sunjayBhatia
Copy link
Member Author

sunjayBhatia commented Oct 22, 2021

default "initial fetch timeout" is 15s, should be enough to fetch initial clusters and listeners, increasing we still get intermittently stuck on the initial fetch

@sunjayBhatia
Copy link
Member Author

when leader election is disabled looks like the first e2e test always fails, we're not responding to the initial Cluster/Listener request at all, subsequent tests pass when the first Contour exits and Envoy reconnects to the instance that comes up for the next test

@sunjayBhatia
Copy link
Member Author

Looks like the state of the world in how Contour sets up xDS etc. is as below, which explains why the e2e tests fail when you just swap over the xDS server type:

  • With leader election enabled
    • The leader election process causes the event handler to be fired, even without any create/delete/change events to fire:
    • This ensures snapshots are generated for envoy xDS mode, everything "just works" bc the event handler populates the resources/cache the go-control-plane server watches are following
    • Envoy starts up just fine bc it gets the dynamic listener/route config needed to start up and healthcheck properly running in k8s
  • With leader election disabled
    • In contour xDS mode, the server has direct access to xDS resources and the stream loop starts off in a state that allows it to respond to the first discovery request immediately, without needing to wait for a dag rebuild/event handler update:
    • In envoy xDS mode, nothing actually triggers the event handler on startup, except changes to resources in the cluster
      • This means in the e2e tests, Envoy is stuck waiting for the dynamic configs it needs to start up to ensure the pods are healthy, which never happens since there is no dag rebuild etc. triggered
      • go-control-plane watches updates to resources via snapshot generation, snapshot generation never happens bc event handler is never triggered

@sunjayBhatia
Copy link
Member Author

in general contour startup with the envoy xDS server type is pretty racy, the server initialization has to happen strictly before a dag rebuild happens, otherwise any OnChange calls will not trigger a snapshot to be generated

it happens to work when leader election is enabled bc that coincidentally takes longer typically than xDS server initialization

@youngnick
Copy link
Member

This sounds like another reason we need to fix up the initialization process and have a way to ensure the ordering for some goroutines?

@github-actions
Copy link

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2021
@sunjayBhatia sunjayBhatia removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2021
@github-actions
Copy link

github-actions bot commented Dec 7, 2021

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2021
@sunjayBhatia sunjayBhatia removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2021
@sunjayBhatia
Copy link
Member Author

recent refactors have made #4100 (comment) obsolete, the event handler is now notified of leader election when it is enabled and disabled through the leadership notifier runnable that triggers it to run an update

@sunjayBhatia sunjayBhatia added the release-note/none-required Marks a PR as not requiring a release note. Should only be used for very small changes. label Dec 15, 2021
@sunjayBhatia sunjayBhatia changed the title WIP: Use envoy xds server type in e2e tests Use envoy xds server type in e2e tests Dec 15, 2021
@sunjayBhatia sunjayBhatia marked this pull request as ready for review December 15, 2021 22:39
@sunjayBhatia sunjayBhatia requested a review from a team as a code owner December 15, 2021 22:39
@sunjayBhatia sunjayBhatia requested review from skriss and youngnick and removed request for a team December 15, 2021 22:39
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One other idea here would be to set up a nightly run of the full E2E suite using the "envoy" control plane, to get regular full coverage on it without blowing up our E2E matrix for PRs.

@sunjayBhatia
Copy link
Member Author

LGTM.

One other idea here would be to set up a nightly run of the full E2E suite using the "envoy" control plane, to get regular full coverage on it without blowing up our E2E matrix for PRs.

yeah i like that better actually, going to update to do that instead of the smoke tests

@sunjayBhatia sunjayBhatia changed the title Use envoy xds server type in e2e tests Use envoy xds server type in nightly e2e tests Jan 4, 2022
@sunjayBhatia sunjayBhatia changed the title Use envoy xds server type in nightly e2e tests Use envoy xds server type in daily e2e tests Jan 4, 2022
Run e2e tests daily against envoy xds server type that uses
go-control-plane.

Once we are confident it is stable, we can flip the default to be the
envoy go-control-plane variant and run the daily tests with the contour
variant until we drop support for it.

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
So we see a notification during the day

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, LGTM.

I'm now thinking maybe we should re-jigger #4126 to also run its tests in this daily run, instead of running a small subset of the E2E's against an Envoy deployment for every PR, for similar reasons as for this one. Can ponder/tackle that separately.

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
@sunjayBhatia sunjayBhatia merged commit 0f03a6f into projectcontour:main Jan 6, 2022
@sunjayBhatia sunjayBhatia deleted the xds-server-type-envoy branch January 6, 2022 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/none-required Marks a PR as not requiring a release note. Should only be used for very small changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants