-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce runtime of Antrea e2e tests #2014
Comments
We should consider:
And maybe no need to repeat some tests in all modes (noEncap, hybrid, encap, ANP, noProxy, etc.). For ANP, could we always enable it? Of course these might not really save runtime as they run in parallel. |
@antoninbas @jianjuns after check the e2e execution time, Antrea policy test cases use around 15mins which is the most time-consuming test cases, and there are five e2e in parallel, I am thinking is it possible to move Antrea policy out of these cases and make it run in parallel?
the new list maybe like below:
|
I don't think that creating more jobs that run in parallel is necessarily the solution. In the end the project only gets a limited number of concurrent workers, and adding more parallel jobs may actually make things slower across Pull Requests. Instead we should 1) see if improvements to the framework and test cases can be made to speed up tests, 2) consider reducing the number of tests we run (reduce redundancy). Basically, the improvements I suggested when I opened the issue. I also want to point out that if one of the jobs fail, we have to run all the jobs again, which is one more reason why increasing parallelism can make running the tests slower. All jobs have a dependency on the job that builds the Antrea image (to avoid the extra work of building the image multiple times), so restarting one Kind job actually means re-running all of them. That's also a good idea for improvement: can we change the job configuration so that 1) the Antrea image is only built once (and not once per job), 2) it's possible to re-run a single (failed) job without re-running all of them. |
@antoninbas , I think there are actually two major issues in your comment, one is the overall e2e time is too long, another one is when we have a job failed, we have to rerun all jobs. I will figure out if there is a way to rerun one job only if any parallel job failed. |
hmm, after check the github community, I am afraid there is no way to rerun single job at the moment: https://github.community/t/re-run-jobs/16145, they are using issue actions/runner#432 to track this requirement, but looks like it's still on going. |
I don't really see a significant difference between https://github.com/antrea-io/antrea/actions/runs/907309257 (recent run on master, 1h15) and https://github.com/antrea-io/antrea/actions/runs/915168994 (your PR, 1h13). Am I missing something? https://github.com/antrea-io/antrea/actions/runs/915168994 does not seem to be "normal"; sometimes CI is slower I think
If we can reduce the execution time of every test by 30% by making some changes to setup / teardown / resource creation, I think it would make a difference. I wonder if we can run some tests in parallel. |
ah, I am checking the each e2e execution item's time instead of total duration, hmm, you are right, there is no significant difference for the total duration. 😔 |
After check the github doc, it shows the maximum jobs for free plan is 20, I think it's worthy to reduce the total jobs we used in our workflows which can help to release runner quicker when there are multiple PRs, here is a sample change: https://github.com/antrea-io/antrea/pull/2245/files#diff-678682767f2477de3e3c546746f8568b9a1942b2c647d32331d7e774b8ff8d9f There are also a few areas I have made to reduce the e2e time:
for now, there will be 6 parallel jobs for e2e as below:
the original "E2e tests on a Kind cluster on Linux" e2e are split into above 1 and 2, so long run cases will be run separately. Here is an e2e job result with above implementation https://github.com/antrea-io/antrea/actions/runs/921247639, the longest e2e job is 45ms now, but for total duration, looks like it totally depends on how many PRs are running in a close window, unfortunately when this one is triggered, there are multiple PRs running. I don't think there is a way to avoid such cases unless we upgrade the github plan, eg: Pro plan has 40 concurrent jobs https://docs.github.com/en/actions/reference/usage-limits-billing-and-administration @lzhecheng Could you also help to review above ways to reduce the e2e time? or let me know if you have more ideas about this area. thanks. |
I wouldn't necessary assume that tests for Service proxy or the Flow Aggregator have no dependency on the traffic mode (encap, noEncap). They don't have an explicit dependency that's true, but I'm not sure that means we shouldn't validate these features for difference traffic modes. There can always be unexpected dependencies or interactions. |
I think these methods will help but we should be very careful about the fourth. I suggest now beginning with the first 3 and invite more to discuss the fourth. Besides these, I think this will also help reduce e2e runtime. Skip method |
@lzhecheng I have moved some skip methods like skipIfAntreaPolicyDisabled() and skipIfProxyDisabled before setupTest(t) @tnqn @jianjuns @mengdie-song @gran-vmv do you have any opinions on the item 4 in this comment? #2014 (comment). the full list of e2e case is here:
|
I feel might be ok to skip some tests for other traffic mode, as long as they are covered by daily (?) and release tests. But might be better keep proxy_test for noEncap. |
Traceflow e2e should cover all the different environments, but I think you can try to run it in parallel, this is disabled in current version and not tested. |
I suppose the new default list can be as below items?
|
merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of antrea-io#2014 Signed-off-by: Lan Luo <luola@vmware.com>
merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of antrea-io#2014 Signed-off-by: Lan Luo <luola@vmware.com>
merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of antrea-io#2014 Signed-off-by: Lan Luo <luola@vmware.com>
merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of antrea-io#2014 Signed-off-by: Lan Luo <luola@vmware.com>
merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of antrea-io#2014 Signed-off-by: Lan Luo <luola@vmware.com>
Merge some jobs in workflow files go.yml so we can have less concurrent jobs for one PR to reduce github runner waiting time. Part of #2014 Signed-off-by: Lan Luo <luola@vmware.com>
As we add more features and more tests, the runtime of tests has been increasing steadily.
We should review the test suite and investigate the following:
The text was updated successfully, but these errors were encountered: