Reduce runtime of Antrea e2e tests #2014

antoninbas · 2021-03-30T23:17:20Z

As we add more features and more tests, the runtime of tests has been increasing steadily.
We should review the test suite and investigate the following:

are there some obvious changes that can be made to speed up tests, notably in setup & teardown?
can some tests be run in parallel, e.g. by using separate test namespace instead of a single one? Maybe tests can be grouped by feature (e.g. NodePortLocal) and tests for a given feature can run in parallel
should we run a subset of e2e tests in Kind instead of all of them? It takes about one hour to run the testsuite in Kind at the moment.

jianjuns · 2021-04-05T17:17:37Z

We should consider:

should we run a subset of e2e tests in Kind instead of all of them? It takes about one hour to run the testsuite in Kind at the moment.
It can be for test cases in a single test too.

And maybe no need to repeat some tests in all modes (noEncap, hybrid, encap, ANP, noProxy, etc.). For ANP, could we always enable it? Of course these might not really save runtime as they run in parallel.

luolanzone · 2021-06-07T02:12:43Z

@antoninbas @jianjuns after check the e2e execution time, Antrea policy test cases use around 15mins which is the most time-consuming test cases, and there are five e2e in parallel, I am thinking is it possible to move Antrea policy out of these cases and make it run in parallel?

1. E2e tests on a Kind cluster on Linux
2. E2e tests on a Kind cluster on Linux with AntreaProxy disabled
3. E2e tests on a Kind cluster on Linux (noEncap)
4. E2e tests on a Kind cluster on Linux (hybrid)
5. E2e tests on a Kind cluster on Linux with Antrea-native policies disabled

the new list maybe like below:

1. E2e tests on a Kind cluster on Linux with Antrea-native policies only
2. E2e tests on a Kind cluster on Linux with Antrea-native policies disabled
3. E2e tests on a Kind cluster on Linux with AntreaProxy, Antrea-native policies disabled
4. E2e tests on a Kind cluster on Linux (noEncap) with Antrea-native policies only
5. E2e tests on a Kind cluster on Linux (noEncap) with Antrea-native policies disabled
6. E2e tests on a Kind cluster on Linux (hybrid) with Antrea-native policies only
7. E2e tests on a Kind cluster on Linux (hybrid) with Antrea-native policies disabled

antoninbas · 2021-06-07T18:07:17Z

I don't think that creating more jobs that run in parallel is necessarily the solution. In the end the project only gets a limited number of concurrent workers, and adding more parallel jobs may actually make things slower across Pull Requests. Instead we should 1) see if improvements to the framework and test cases can be made to speed up tests, 2) consider reducing the number of tests we run (reduce redundancy). Basically, the improvements I suggested when I opened the issue.

I also want to point out that if one of the jobs fail, we have to run all the jobs again, which is one more reason why increasing parallelism can make running the tests slower. All jobs have a dependency on the job that builds the Antrea image (to avoid the extra work of building the image multiple times), so restarting one Kind job actually means re-running all of them. That's also a good idea for improvement: can we change the job configuration so that 1) the Antrea image is only built once (and not once per job), 2) it's possible to re-run a single (failed) job without re-running all of them.

luolanzone · 2021-06-08T01:35:25Z

@antoninbas , I think there are actually two major issues in your comment, one is the overall e2e time is too long, another one is when we have a job failed, we have to rerun all jobs. I will figure out if there is a way to rerun one job only if any parallel job failed.
I compared a few e2e execution logs, most of test cases are around 1min or less, I doubt there is an efficient way to reduce the time, so I picked up the longest one TestAntreaPolicy and tried here: https://github.com/antrea-io/antrea/actions/runs/915168994, it can help to reduce the overall time to less than 1 hour. you may compare with this one: https://github.com/antrea-io/antrea/actions/runs/901797589.
I believe split them and run in parallel is at least an option for us to reduce overall e2e time, meanwhile, as you say, we can reduce redundancy and improve the framework.

luolanzone · 2021-06-08T01:59:21Z

hmm, after check the github community, I am afraid there is no way to rerun single job at the moment: https://github.community/t/re-run-jobs/16145, they are using issue actions/runner#432 to track this requirement, but looks like it's still on going.
one option for us is to split workflow for the e2e.

antoninbas · 2021-06-08T02:02:52Z

I don't really see a significant difference between https://github.com/antrea-io/antrea/actions/runs/907309257 (recent run on master, 1h15) and https://github.com/antrea-io/antrea/actions/runs/915168994 (your PR, 1h13). Am I missing something?

https://github.com/antrea-io/antrea/actions/runs/915168994 does not seem to be "normal"; sometimes CI is slower I think

I compared a few e2e execution logs, most of test cases are around 1min or less, I doubt there is an efficient way to reduce the time

If we can reduce the execution time of every test by 30% by making some changes to setup / teardown / resource creation, I think it would make a difference. I wonder if we can run some tests in parallel.

luolanzone · 2021-06-08T02:09:11Z

ah, I am checking the each e2e execution item's time instead of total duration, hmm, you are right, there is no significant difference for the total duration. 😔
and yes, I believe we can run some tests in parallel which have multiple test data. it's on my todo list to refine those test cases.

luolanzone · 2021-06-09T08:26:05Z

After check the github doc, it shows the maximum jobs for free plan is 20, I think it's worthy to reduce the total jobs we used in our workflows which can help to release runner quicker when there are multiple PRs, here is a sample change: https://github.com/antrea-io/antrea/pull/2245/files#diff-678682767f2477de3e3c546746f8568b9a1942b2c647d32331d7e774b8ff8d9f

There are also a few areas I have made to reduce the e2e time:

add t.Parallel() for those tests with multiple test data.
use subtest to reorganize most e2e test files so they can share the same setup/teardown.
Set NetworkPolicy and AntreaPolicy as long run test cases.
Considering there are some test cases are actually no difference in encap/noencap/hybird, so I use a "skipIfHasEnvNotDefault" to skip them in noencap/hybird jobs. Here is the list, Could you help to double check it's a full list or not? thanks.
- antctl_test
- bandwidth_test
- basic_test
- flowaggregator_test
- proxy_test
- security_test
- supportbuddle_test

for now, there will be 6 parallel jobs for e2e as below:

E2e tests without long run cases on a Kind cluster on Linux
E2e tests with long run cases only on a Kind cluster on Linux
E2e tests on a Kind cluster on Linux with AntreaProxy disabled
E2e tests on a Kind cluster on Linux (noEncap)
E2e tests on a Kind cluster on Linux (hybrid)
E2e tests on a Kind cluster on Linux with Antrea-native policies disabled

the original "E2e tests on a Kind cluster on Linux" e2e are split into above 1 and 2, so long run cases will be run separately.
for 3~6, they are just the same as before exclude those default cases which have no difference in different modes.

Here is an e2e job result with above implementation https://github.com/antrea-io/antrea/actions/runs/921247639, the longest e2e job is 45ms now, but for total duration, looks like it totally depends on how many PRs are running in a close window, unfortunately when this one is triggered, there are multiple PRs running. I don't think there is a way to avoid such cases unless we upgrade the github plan, eg: Pro plan has 40 concurrent jobs https://docs.github.com/en/actions/reference/usage-limits-billing-and-administration

@lzhecheng Could you also help to review above ways to reduce the e2e time? or let me know if you have more ideas about this area. thanks.

antoninbas · 2021-06-09T20:19:27Z

I wouldn't necessary assume that tests for Service proxy or the Flow Aggregator have no dependency on the traffic mode (encap, noEncap). They don't have an explicit dependency that's true, but I'm not sure that means we shouldn't validate these features for difference traffic modes. There can always be unexpected dependencies or interactions.

lzhecheng · 2021-06-10T06:43:38Z

I think these methods will help but we should be very careful about the fourth. I suggest now beginning with the first 3 and invite more to discuss the fourth.

Besides these, I think this will also help reduce e2e runtime. Skip method skipIfAntreaPolicyDisabled() like this one is called after setupTest(t), which is a waste of time. Currently, we get ConfigMap field value with API so this skip method cannot be put before setupTest(t). To improve this, I think you can parse antrea.yml at the beginning to tell if this testcase can be skipped or not.
https://github.com/antrea-io/antrea/blob/release-1.1/test/e2e/antreapolicy_test.go#L2709

luolanzone · 2021-06-15T02:34:42Z

@lzhecheng I have moved some skip methods like skipIfAntreaPolicyDisabled() and skipIfProxyDisabled before setupTest(t)
@antoninbas I also moved Service proxy or the Flow Aggregator out of the default list.
The e2e result for the new implementation is here https://github.com/antrea-io/antrea/actions/runs/937670984, the total e2e time are still around 15m less comparing with other jobs eg: https://github.com/antrea-io/antrea/actions/runs/937012490. but the total concurrent jobs are less now in workflows.

@tnqn @jianjuns @mengdie-song @gran-vmv do you have any opinions on the item 4 in this comment? #2014 (comment). the full list of e2e case is here:

antctl_test.go
antreapolicy_test.go
bandwidth_test.go
basic_test.go
batch_test.go
clustergroup_test.go
connectivity_test.go
egress_test.go
flowaggregator_test.go
framework_test.go
ipsec_test.go
legacyantreapolicy_test.go
legacyclustergroup_test.go
main_test.go
networkpolicy_test.go
nodeportlocal_test.go
performance_test.go
prometheus_test.go
proxy_test.go
security_test.go
service_test.go
supportbundle_test.go
tls_test.go
traceflow_test.go
upgrade_test.go

jianjuns · 2021-06-15T18:15:42Z

@tnqn @jianjuns @mengdie-song @gran-vmv do you have any opinions on the item 4 in this comment? #2014 (comment). the full list of e2e case is here:

I feel might be ok to skip some tests for other traffic mode, as long as they are covered by daily (?) and release tests. But might be better keep proxy_test for noEncap.

gran-vmv · 2021-06-16T01:30:30Z

@tnqn @jianjuns @mengdie-song @gran-vmv do you have any opinions on the item 4 in this comment? #2014 (comment). the full list of e2e case is here:

Traceflow e2e should cover all the different environments, but I think you can try to run it in parallel, this is disabled in current version and not tested.

luolanzone · 2021-06-16T01:36:59Z

I suppose the new default list can be as below items?

antctl_test
bandwidth_test
basic_test
batch_test
security_test
supportbuddle_test
tls_test

merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of antrea-io#2014 Signed-off-by: Lan Luo <luola@vmware.com>

Merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of #2014 Signed-off-by: Lan Luo <luola@vmware.com>

tnqn mentioned this issue Apr 27, 2021

Improve Antrea-native Policy CRD schema verification #2125

Merged

luolanzone self-assigned this Jun 4, 2021

antoninbas mentioned this issue Jun 7, 2021

Reduce e2e overall time #2245

Merged

luolanzone mentioned this issue Jun 16, 2021

[CI]add a workflow to cancel previous running workflow #2279

Merged

luolanzone mentioned this issue Jul 27, 2021

merge jobs in go.yml #2474

Merged

tnqn closed this as completed in #2245 Aug 2, 2021

antoninbas pushed a commit that referenced this issue Aug 3, 2021

Merge jobs in go.yml (#2474)

57ea0c5

Merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of #2014 Signed-off-by: Lan Luo <luola@vmware.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce runtime of Antrea e2e tests #2014

Reduce runtime of Antrea e2e tests #2014

antoninbas commented Mar 30, 2021

jianjuns commented Apr 5, 2021

luolanzone commented Jun 7, 2021

antoninbas commented Jun 7, 2021

luolanzone commented Jun 8, 2021

luolanzone commented Jun 8, 2021

antoninbas commented Jun 8, 2021 •

edited

Loading

luolanzone commented Jun 8, 2021 •

edited

Loading

luolanzone commented Jun 9, 2021 •

edited

Loading

antoninbas commented Jun 9, 2021

lzhecheng commented Jun 10, 2021

luolanzone commented Jun 15, 2021

jianjuns commented Jun 15, 2021

gran-vmv commented Jun 16, 2021

luolanzone commented Jun 16, 2021

Reduce runtime of Antrea e2e tests #2014

Reduce runtime of Antrea e2e tests #2014

Comments

antoninbas commented Mar 30, 2021

jianjuns commented Apr 5, 2021

luolanzone commented Jun 7, 2021

antoninbas commented Jun 7, 2021

luolanzone commented Jun 8, 2021

luolanzone commented Jun 8, 2021

antoninbas commented Jun 8, 2021 • edited Loading

luolanzone commented Jun 8, 2021 • edited Loading

luolanzone commented Jun 9, 2021 • edited Loading

antoninbas commented Jun 9, 2021

lzhecheng commented Jun 10, 2021

luolanzone commented Jun 15, 2021

jianjuns commented Jun 15, 2021

gran-vmv commented Jun 16, 2021

luolanzone commented Jun 16, 2021

antoninbas commented Jun 8, 2021 •

edited

Loading

luolanzone commented Jun 8, 2021 •

edited

Loading

luolanzone commented Jun 9, 2021 •

edited

Loading