Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce runtime of Antrea e2e tests #2014

Closed
antoninbas opened this issue Mar 30, 2021 · 14 comments · Fixed by #2245
Closed

Reduce runtime of Antrea e2e tests #2014

antoninbas opened this issue Mar 30, 2021 · 14 comments · Fixed by #2245
Assignees
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers kind/task Categorizes issue or PR as related to a routine task that needs to be performed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@antoninbas
Copy link
Contributor

As we add more features and more tests, the runtime of tests has been increasing steadily.
We should review the test suite and investigate the following:

  • are there some obvious changes that can be made to speed up tests, notably in setup & teardown?
  • can some tests be run in parallel, e.g. by using separate test namespace instead of a single one? Maybe tests can be grouped by feature (e.g. NodePortLocal) and tests for a given feature can run in parallel
  • should we run a subset of e2e tests in Kind instead of all of them? It takes about one hour to run the testsuite in Kind at the moment.
@antoninbas antoninbas added area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers kind/task Categorizes issue or PR as related to a routine task that needs to be performed labels Mar 30, 2021
@jianjuns
Copy link
Contributor

jianjuns commented Apr 5, 2021

We should consider:

should we run a subset of e2e tests in Kind instead of all of them? It takes about one hour to run the testsuite in Kind at the moment.
It can be for test cases in a single test too.

And maybe no need to repeat some tests in all modes (noEncap, hybrid, encap, ANP, noProxy, etc.). For ANP, could we always enable it? Of course these might not really save runtime as they run in parallel.

@luolanzone
Copy link
Contributor

@antoninbas @jianjuns after check the e2e execution time, Antrea policy test cases use around 15mins which is the most time-consuming test cases, and there are five e2e in parallel, I am thinking is it possible to move Antrea policy out of these cases and make it run in parallel?

1. E2e tests on a Kind cluster on Linux
2. E2e tests on a Kind cluster on Linux with AntreaProxy disabled
3. E2e tests on a Kind cluster on Linux (noEncap)
4. E2e tests on a Kind cluster on Linux (hybrid)
5. E2e tests on a Kind cluster on Linux with Antrea-native policies disabled

the new list maybe like below:

1. E2e tests on a Kind cluster on Linux with Antrea-native policies only
2. E2e tests on a Kind cluster on Linux with Antrea-native policies disabled
3. E2e tests on a Kind cluster on Linux with AntreaProxy, Antrea-native policies disabled
4. E2e tests on a Kind cluster on Linux (noEncap) with Antrea-native policies only
5. E2e tests on a Kind cluster on Linux (noEncap) with Antrea-native policies disabled
6. E2e tests on a Kind cluster on Linux (hybrid) with Antrea-native policies only
7. E2e tests on a Kind cluster on Linux (hybrid) with Antrea-native policies disabled

@antoninbas
Copy link
Contributor Author

I don't think that creating more jobs that run in parallel is necessarily the solution. In the end the project only gets a limited number of concurrent workers, and adding more parallel jobs may actually make things slower across Pull Requests. Instead we should 1) see if improvements to the framework and test cases can be made to speed up tests, 2) consider reducing the number of tests we run (reduce redundancy). Basically, the improvements I suggested when I opened the issue.

I also want to point out that if one of the jobs fail, we have to run all the jobs again, which is one more reason why increasing parallelism can make running the tests slower. All jobs have a dependency on the job that builds the Antrea image (to avoid the extra work of building the image multiple times), so restarting one Kind job actually means re-running all of them. That's also a good idea for improvement: can we change the job configuration so that 1) the Antrea image is only built once (and not once per job), 2) it's possible to re-run a single (failed) job without re-running all of them.

@luolanzone
Copy link
Contributor

@antoninbas , I think there are actually two major issues in your comment, one is the overall e2e time is too long, another one is when we have a job failed, we have to rerun all jobs. I will figure out if there is a way to rerun one job only if any parallel job failed.
I compared a few e2e execution logs, most of test cases are around 1min or less, I doubt there is an efficient way to reduce the time, so I picked up the longest one TestAntreaPolicy and tried here: https://github.com/antrea-io/antrea/actions/runs/915168994, it can help to reduce the overall time to less than 1 hour. you may compare with this one: https://github.com/antrea-io/antrea/actions/runs/901797589.
I believe split them and run in parallel is at least an option for us to reduce overall e2e time, meanwhile, as you say, we can reduce redundancy and improve the framework.

@luolanzone
Copy link
Contributor

hmm, after check the github community, I am afraid there is no way to rerun single job at the moment: https://github.community/t/re-run-jobs/16145, they are using issue actions/runner#432 to track this requirement, but looks like it's still on going.
one option for us is to split workflow for the e2e.

@antoninbas
Copy link
Contributor Author

antoninbas commented Jun 8, 2021

I don't really see a significant difference between https://github.com/antrea-io/antrea/actions/runs/907309257 (recent run on master, 1h15) and https://github.com/antrea-io/antrea/actions/runs/915168994 (your PR, 1h13). Am I missing something?

https://github.com/antrea-io/antrea/actions/runs/915168994 does not seem to be "normal"; sometimes CI is slower I think

I compared a few e2e execution logs, most of test cases are around 1min or less, I doubt there is an efficient way to reduce the time

If we can reduce the execution time of every test by 30% by making some changes to setup / teardown / resource creation, I think it would make a difference. I wonder if we can run some tests in parallel.

@luolanzone
Copy link
Contributor

luolanzone commented Jun 8, 2021

ah, I am checking the each e2e execution item's time instead of total duration, hmm, you are right, there is no significant difference for the total duration. 😔
and yes, I believe we can run some tests in parallel which have multiple test data. it's on my todo list to refine those test cases.

@luolanzone
Copy link
Contributor

luolanzone commented Jun 9, 2021

After check the github doc, it shows the maximum jobs for free plan is 20, I think it's worthy to reduce the total jobs we used in our workflows which can help to release runner quicker when there are multiple PRs, here is a sample change: https://github.com/antrea-io/antrea/pull/2245/files#diff-678682767f2477de3e3c546746f8568b9a1942b2c647d32331d7e774b8ff8d9f

There are also a few areas I have made to reduce the e2e time:

  1. add t.Parallel() for those tests with multiple test data.
  2. use subtest to reorganize most e2e test files so they can share the same setup/teardown.
  3. Set NetworkPolicy and AntreaPolicy as long run test cases.
  4. Considering there are some test cases are actually no difference in encap/noencap/hybird, so I use a "skipIfHasEnvNotDefault" to skip them in noencap/hybird jobs. Here is the list, Could you help to double check it's a full list or not? thanks.
    • antctl_test
    • bandwidth_test
    • basic_test
    • flowaggregator_test
    • proxy_test
    • security_test
    • supportbuddle_test

for now, there will be 6 parallel jobs for e2e as below:

  1. E2e tests without long run cases on a Kind cluster on Linux
  2. E2e tests with long run cases only on a Kind cluster on Linux
  3. E2e tests on a Kind cluster on Linux with AntreaProxy disabled
  4. E2e tests on a Kind cluster on Linux (noEncap)
  5. E2e tests on a Kind cluster on Linux (hybrid)
  6. E2e tests on a Kind cluster on Linux with Antrea-native policies disabled

the original "E2e tests on a Kind cluster on Linux" e2e are split into above 1 and 2, so long run cases will be run separately.
for 3~6, they are just the same as before exclude those default cases which have no difference in different modes.

Here is an e2e job result with above implementation https://github.com/antrea-io/antrea/actions/runs/921247639, the longest e2e job is 45ms now, but for total duration, looks like it totally depends on how many PRs are running in a close window, unfortunately when this one is triggered, there are multiple PRs running. I don't think there is a way to avoid such cases unless we upgrade the github plan, eg: Pro plan has 40 concurrent jobs https://docs.github.com/en/actions/reference/usage-limits-billing-and-administration

@lzhecheng Could you also help to review above ways to reduce the e2e time? or let me know if you have more ideas about this area. thanks.

@antoninbas
Copy link
Contributor Author

I wouldn't necessary assume that tests for Service proxy or the Flow Aggregator have no dependency on the traffic mode (encap, noEncap). They don't have an explicit dependency that's true, but I'm not sure that means we shouldn't validate these features for difference traffic modes. There can always be unexpected dependencies or interactions.

@lzhecheng
Copy link
Contributor

I think these methods will help but we should be very careful about the fourth. I suggest now beginning with the first 3 and invite more to discuss the fourth.

Besides these, I think this will also help reduce e2e runtime. Skip method skipIfAntreaPolicyDisabled() like this one is called after setupTest(t), which is a waste of time. Currently, we get ConfigMap field value with API so this skip method cannot be put before setupTest(t). To improve this, I think you can parse antrea.yml at the beginning to tell if this testcase can be skipped or not.
https://github.com/antrea-io/antrea/blob/release-1.1/test/e2e/antreapolicy_test.go#L2709

@luolanzone
Copy link
Contributor

@lzhecheng I have moved some skip methods like skipIfAntreaPolicyDisabled() and skipIfProxyDisabled before setupTest(t)
@antoninbas I also moved Service proxy or the Flow Aggregator out of the default list.
The e2e result for the new implementation is here https://github.com/antrea-io/antrea/actions/runs/937670984, the total e2e time are still around 15m less comparing with other jobs eg: https://github.com/antrea-io/antrea/actions/runs/937012490. but the total concurrent jobs are less now in workflows.

@tnqn @jianjuns @mengdie-song @gran-vmv do you have any opinions on the item 4 in this comment? #2014 (comment). the full list of e2e case is here:

antctl_test.go
antreapolicy_test.go
bandwidth_test.go
basic_test.go
batch_test.go
clustergroup_test.go
connectivity_test.go
egress_test.go
flowaggregator_test.go
framework_test.go
ipsec_test.go
legacyantreapolicy_test.go
legacyclustergroup_test.go
main_test.go
networkpolicy_test.go
nodeportlocal_test.go
performance_test.go
prometheus_test.go
proxy_test.go
security_test.go
service_test.go
supportbundle_test.go
tls_test.go
traceflow_test.go
upgrade_test.go

@jianjuns
Copy link
Contributor

@tnqn @jianjuns @mengdie-song @gran-vmv do you have any opinions on the item 4 in this comment? #2014 (comment). the full list of e2e case is here:

I feel might be ok to skip some tests for other traffic mode, as long as they are covered by daily (?) and release tests. But might be better keep proxy_test for noEncap.

@gran-vmv
Copy link
Contributor

@tnqn @jianjuns @mengdie-song @gran-vmv do you have any opinions on the item 4 in this comment? #2014 (comment). the full list of e2e case is here:

Traceflow e2e should cover all the different environments, but I think you can try to run it in parallel, this is disabled in current version and not tested.

@luolanzone
Copy link
Contributor

I suppose the new default list can be as below items?

antctl_test
bandwidth_test
basic_test
batch_test
security_test
supportbuddle_test
tls_test

luolanzone added a commit to luolanzone/antrea that referenced this issue Jul 27, 2021
merge some jobs in workflow files go.yml so we can have
less concurrent jobs for one PR to reduce github runner waiting time.

Part of antrea-io#2014

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this issue Jul 27, 2021
merge some jobs in workflow files go.yml so we can have
less concurrent jobs for one PR to reduce github runner waiting time.

Part of antrea-io#2014

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this issue Jul 30, 2021
merge some jobs in workflow files go.yml so we can have
less concurrent jobs for one PR to reduce github runner waiting time.

Part of antrea-io#2014

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this issue Aug 2, 2021
merge some jobs in workflow files go.yml so we can have
less concurrent jobs for one PR to reduce github runner waiting time.

Part of antrea-io#2014

Signed-off-by: Lan Luo <luola@vmware.com>
@tnqn tnqn closed this as completed in #2245 Aug 2, 2021
luolanzone added a commit to luolanzone/antrea that referenced this issue Aug 3, 2021
merge some jobs in workflow files go.yml so we can have
less concurrent jobs for one PR to reduce github runner waiting time.

Part of antrea-io#2014

Signed-off-by: Lan Luo <luola@vmware.com>
antoninbas pushed a commit that referenced this issue Aug 3, 2021
Merge some jobs in workflow files go.yml so we can have
less concurrent jobs for one PR to reduce github runner waiting time.

Part of #2014

Signed-off-by: Lan Luo <luola@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers kind/task Categorizes issue or PR as related to a routine task that needs to be performed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants