Fix bug in FlowAggregator when sending template set #2039

srikartati · 2021-04-07T06:06:51Z

We need to reset set and prepare the set as a template set before sending the template record. Otherwise, the flow aggregator will crash when retrying to establish the connection to the flow collector. The bug was the missing reset set operation.

In addition, during this failure, the test is taking a long time to finish.
Fix the e2e flow aggregator test timeouts. Timeout values are quite high when the tests are failing
intermittently on dual-stack clusters.
Fix that by reducing poll timeout when checking for logs on ipfix-collector pod.

Fixes #2035

srikartati · 2021-04-07T06:07:56Z

/test-ipv6-e2e

codecov-io · 2021-04-07T06:19:26Z

Codecov Report

Merging #2039 (92ac8ba) into main (9a9804a) will increase coverage by 13.89%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             main    #2039       +/-   ##
===========================================
+ Coverage   41.69%   55.58%   +13.89%     
===========================================
  Files         127      265      +138     
  Lines       15844    19773     +3929     
===========================================
+ Hits         6606    10991     +4385     
+ Misses       8684     7616     -1068     
- Partials      554     1166      +612

Flag	Coverage Δ
kind-e2e-tests	`41.86% <ø> (?)`
unit-tests	`41.68% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/apiserver/handlers/webhook/mutation_labels.go	`24.71% <0.00%> (ø)`
pkg/agent/util/ipset/ipset.go	`61.53% <0.00%> (ø)`
pkg/agent/interfacestore/interface_cache.go	`81.37% <0.00%> (ø)`
pkg/monitor/controller.go	`35.07% <0.00%> (ø)`
pkg/ipfix/ipfix_set.go	`100.00% <0.00%> (ø)`
...erinformation/v1beta1/clusterinformation_client.go	`50.00% <0.00%> (ø)`
...s/externalversions/clusterinformation/interface.go	`0.00% <0.00%> (ø)`
pkg/monitor/agent.go	`33.33% <0.00%> (ø)`
...ntroller/crdmirroring/crdhandler/externalentity.go	`0.00% <0.00%> (ø)`
...versioned/typed/security/v1alpha1/networkpolicy.go	`0.00% <0.00%> (ø)`
... and 209 more

srikartati · 2021-04-07T21:11:28Z

/test-ipv6-e2e

srikartati · 2021-04-07T21:24:24Z

/test-all
/test-ipv6-e2e

srikartati · 2021-04-07T23:58:52Z

/test-ipv6-e2e
/test-ipv6-only-e2e

srikartati · 2021-04-08T00:15:19Z

There was an error when running in dual-stack but here the test before TestFlowAggregator has failed too. Hence, not sure if its the same error. However, with this fix I don't see very high execution times for flow aggregator tests unlike the one in Issue #2035 .
http://10.176.27.169:8080/view/dual-stack/job/antrea-ipv6-ds-e2e-for-pull-request/289/console

--- FAIL: TestFlowAggregator (290.24s)
    --- FAIL: TestFlowAggregator/IPv4 (68.28s)
        --- FAIL: TestFlowAggregator/IPv4/IntraNodeFlows (18.65s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeFlows (12.01s)
        --- FAIL: TestFlowAggregator/IPv4/LocalServiceAccess (17.19s)
        --- FAIL: TestFlowAggregator/IPv4/RemoteServiceAccess (17.33s)
    --- FAIL: TestFlowAggregator/IPv6 (72.46s)
        --- FAIL: TestFlowAggregator/IPv6/IntraNodeFlows (17.23s)
        --- FAIL: TestFlowAggregator/IPv6/InterNodeFlows (17.52s)
        --- FAIL: TestFlowAggregator/IPv6/LocalServiceAccess (17.27s)
        --- FAIL: TestFlowAggregator/IPv6/RemoteServiceAccess (17.36s)

Here as well I see the Antrea agent not found error like the one in #2035 .

=== RUN   TestFlowAggregator/IPv6/IntraNodeFlows
    flowaggregator_test.go:387: Error when waiting for Network Policy to be realized: unable to upgrade connection: container not found ("antrea-agent")

dreamtalen · 2021-04-08T00:27:37Z

LGTM, thanks for using constant when setup flow aggregator.

zyiou · 2021-04-08T04:26:13Z

Checked antrea-agent logs for failure tests. For IntraNodeFlows test I think the failure is caused by container not found ("antrea-agent"). In the logs there is no info showing network policies are realized for IntraNodeFlows either (successful realization of InterNodeFlows network policies is shown). For failure tests after successful InterNodeFlows test, the problem is likely to happen on flow aggregator side:

E0407 21:47:18.364077       1 flow_records.go:128] Error when executing callback for flow record
E0407 21:47:18.364422       1 exporter.go:177] Error when sending flow records: error when iterating flow records: error when sending data set: error when sending message on the connection: write tcp 172.17.0.1:43112->172.17.242.32:4739: write: broken pipe

and

E0407 21:47:19.374206       1 process.go:92] Cannot the create the tls connection to the Collector flow-aggregator.flow-aggregator.svc:4739: dial tcp 172.17.242.32:4739: connect: connection refused

Maybe we can print the log from flow aggregator to see what happened. Is it possible to have local Jenkins setup for us to test out?

srikartati · 2021-04-08T05:19:33Z

Checked antrea-agent logs for failure tests. For IntraNodeFlows test I think the failure is caused by container not found ("antrea-agent"). In the logs there is no info showing network policies are realized for IntraNodeFlows either (successful realization of InterNodeFlows network policies is shown). For failure tests after successful InterNodeFlows test, the problem is likely to happen on flow aggregator side:
E0407 21:47:18.364077       1 flow_records.go:128] Error when executing callback for flow record
E0407 21:47:18.364422       1 exporter.go:177] Error when sending flow records: error when iterating flow records: error when sending data set: error when sending message on the connection: write tcp 172.17.0.1:43112->172.17.242.32:4739: write: broken pipe
and
E0407 21:47:19.374206       1 process.go:92] Cannot the create the tls connection to the Collector flow-aggregator.flow-aggregator.svc:4739: dial tcp 172.17.242.32:4739: connect: connection refused
Maybe we can print the log from flow aggregator to see what happened. Is it possible to have local Jenkins setup for us to test out?

Thanks for looking further into the issue. It is essential to know the root cause of this intermittent failure, but the intention of this PR is to cut down the abnormal execution times when failures happen and not to fix the intermittent issue. I should add partially fixes the issue instead of fixes it.

For triaging this, we are missing a reliable way to reproduce the issue as this is intermittent and does not happen often. The reported issue is on dual-stack clusters and I also have seen this only on dual-stack clusters. Please let me know if we have seen it on other clusters. The local setup that is close to Jenkins is vagrant, but we do not have automation to deploy dual-stack clusters on vagrant yet. Looks like FlowAggreator pod is in crashLoopState and fetching flow aggregator logs will fail at the end of the test. We need to print the logs and describe output during the test to figure out the reason.

srikartati · 2021-04-08T05:43:59Z

/test-ipv6-e2e

zyiou · 2021-04-08T18:20:22Z

test/e2e/flowaggregator_test.go

@@ -216,8 +215,8 @@ func checkRecordsForFlows(t *testing.T, data *TestData, srcIP string, dstIP stri

 	// Polling to make sure all the data records corresponding to the iperf flow
 	// are received.
-	err = wait.Poll(250*time.Millisecond, collectorCheckTimeout, func() (bool, error) {
-		rc, collectorOutput, _, err := provider.RunCommandOnNode(controlPlaneNodeName(), fmt.Sprintf("kubectl logs --since=%v ipfix-collector -n antrea-test", time.Since(timeStart).String()))
+	err = wait.PollImmediate(500*time.Millisecond, flowAggregatorExportInterval, func() (bool, error) {


PR #1904 changed wait condition in line 224 from return strings.Contains(collectorOutput, srcIP) && strings.Contains(collectorOutput, dstIP), nil to return strings.Count(collectorOutput, srcIP) >= expectedNumDataRecords && strings.Count(collectorOutput, dstIP) >= expectedNumDataRecords, nil to avoid the case when we will only receive one record. So here we can use flowAggregatorExportInterval * 2.

Polling happens after iperf runtime, which is 10s currently (flowAggregatorExportInterval * 2). The time out will capture the final records if there are any, so I feel one cycle of flowAggregatorExportInterval is enough. What do you think?

Makes sense to me. Thanks for clarifying.

zyiou · 2021-05-21T18:04:20Z

pkg/flowaggregator/flowaggregator.go

+	fa.set.ResetSet()
+	if err := fa.set.PrepareSet(ipfixentities.Template, templateID); err != nil {
+		return 0, err
+	}


Why do we need to do reset and prepare here? I saw we had done it here and here

You are right. I encountered the issue when the exporter is restarted in the Flow Aggregator and I see the set still has a record of type DataType. Looking more closely at the existing code, we are missing the set reset before L296 here.

I think it is better to have it in one place. Removed those statements in the caller of sendTemplateSet. I am doing the same fix in Flow Exporter too.

codecov-commenter · 2021-05-21T21:24:44Z

Codecov Report

Merging #2039 (999a423) into main (d8d8626) will increase coverage by 3.66%.
The diff coverage is 64.28%.

@@            Coverage Diff             @@
##             main    #2039      +/-   ##
==========================================
+ Coverage   61.83%   65.49%   +3.66%     
==========================================
  Files         276      276              
  Lines       21145    21140       -5     
==========================================
+ Hits        13074    13845     +771     
+ Misses       6702     5908     -794     
- Partials     1369     1387      +18

Flag	Coverage Δ
e2e-tests	`56.57% <64.28%> (?)`
kind-e2e-tests	`53.19% <64.28%> (+0.22%)`	⬆️
unit-tests	`41.30% <42.85%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/flowexporter/exporter/exporter.go	`77.85% <57.14%> (+0.55%)`	⬆️
pkg/flowaggregator/flowaggregator.go	`63.24% <71.42%> (+5.19%)`	⬆️
...ntroller/networkpolicy/networkpolicy_controller.go	`69.45% <0.00%> (-0.65%)`	⬇️
pkg/agent/controller/networkpolicy/reconciler.go	`77.04% <0.00%> (+0.23%)`	⬆️
pkg/ovs/openflow/ofctrl_bridge.go	`50.36% <0.00%> (+0.36%)`	⬆️
pkg/agent/openflow/network_policy.go	`76.41% <0.00%> (+0.59%)`	⬆️
pkg/agent/proxy/proxier.go	`64.53% <0.00%> (+0.63%)`	⬆️
pkg/agent/cniserver/pod_configuration.go	`54.68% <0.00%> (+0.78%)`	⬆️
pkg/controller/egress/store/egressgroup.go	`1.38% <0.00%> (+1.38%)`	⬆️
pkg/apiserver/apiserver.go	`91.12% <0.00%> (+1.61%)`	⬆️
... and 48 more

srikartati · 2021-05-26T22:57:33Z

/test-all

zyiou

LGTM

srikartati · 2021-05-26T23:07:25Z

/test-ipv6-e2e
/test-ipv6-only-e2e

srikartati · 2021-05-27T01:08:17Z

/test-conformance

srikartati · 2021-05-27T04:22:50Z

/test-conformance

srikartati · 2021-05-27T04:39:25Z

/test-all

srikartati · 2021-05-27T17:59:29Z

I had to add the signed-off snippet to the commit message. @antoninbas could you approve this again?

srikartati · 2021-05-27T18:21:27Z

I had to add the signed-off snippet to the commit message. @antoninbas could you approve this again?

You could wait for some more time as there will conflict because of other PRs getting merged. Thanks.

antoninbas · 2021-05-27T22:24:17Z

@srikartati there is a formatting issue

srikartati · 2021-05-27T22:26:51Z

/test-all

We need to prepare the set as template set before sending template record. Otherwise, flow aggregator will crash when retrying to establish the connection to flow collector. In addition, fix the e2e flow aggregator test timeouts. Timeout values are quite high when the tests are failing intermittently on dual stack clusters. Fix that by reducing poll timeout when checking for logs on ipfix-collector pod. Signed-off-by: Srikar Tati <stati@vmware.com>

srikartati · 2021-05-28T00:16:50Z

/test-all

srikartati · 2021-05-28T04:37:48Z

All the tests have passed. I will wait for PR #2163 to be merged as it is a higher priority for the v1.2 release to make sure there are no conflicts for that PR.
This PR can go in after the release too IMO as the bug is for a rare corner case; depends on other PR and CI runs.
@antoninbas I am removing the milestone just to remove the dependency on the release.

srikartati · 2021-05-28T05:00:03Z

All the tests have passed. I will wait for PR #2163 to be merged as it is a higher priority for the v1.2 release to make sure there are no conflicts for that PR.
This PR can go in after the release too IMO as the bug is for a rare corner case; depends on other PR and CI runs.
@antoninbas I am removing the milestone just to remove the dependency on the release.

Tried cherry-picking the commit PR #2163 on my local branch, and there are no conflicts.
@antoninbas please give the approval again.

vmwclabot added the cla-not-required label Apr 7, 2021

srikartati mentioned this pull request Apr 7, 2021

FlowAggregator e2e tests sometimes timed out. #2035

Closed

srikartati requested review from zyiou and dreamtalen April 8, 2021 00:01

srikartati requested a review from antoninbas April 8, 2021 00:16

srikartati changed the title ~~Fix the TestFlowAggregator e2e test when failing~~ Fix the TestFlowAggregator e2e test timeout when failing Apr 8, 2021

zyiou reviewed Apr 8, 2021

View reviewed changes

zyiou previously approved these changes Apr 8, 2021

View reviewed changes

srikartati dismissed zyiou’s stale review via c183746 May 18, 2021 18:02

srikartati force-pushed the fix_e2e_flowagg branch 3 times, most recently from 3aae049 to fd8de9b Compare May 20, 2021 19:37

srikartati changed the title ~~Fix the TestFlowAggregator e2e test timeout when failing~~ Fix bug in FlowAggregator when sending template set May 20, 2021

srikartati requested a review from zyiou May 20, 2021 19:47

srikartati force-pushed the fix_e2e_flowagg branch from fd8de9b to 3b0f6e9 Compare May 20, 2021 19:53

zyiou reviewed May 21, 2021

View reviewed changes

srikartati force-pushed the fix_e2e_flowagg branch 3 times, most recently from a819f45 to 9e37dc8 Compare May 21, 2021 21:20

srikartati requested a review from antoninbas May 26, 2021 22:44

antoninbas previously approved these changes May 26, 2021

View reviewed changes

zyiou previously approved these changes May 26, 2021

View reviewed changes

srikartati dismissed stale reviews from zyiou and antoninbas via e4c7703 May 27, 2021 04:38

srikartati force-pushed the fix_e2e_flowagg branch from 6966fe0 to e4c7703 Compare May 27, 2021 04:38

srikartati force-pushed the fix_e2e_flowagg branch 2 times, most recently from c03c14c to c2fcf98 Compare May 27, 2021 21:38

srikartati force-pushed the fix_e2e_flowagg branch from c2fcf98 to bad90a4 Compare May 27, 2021 22:26

antoninbas previously approved these changes May 27, 2021

View reviewed changes

srikartati dismissed antoninbas’s stale review via 999a423 May 28, 2021 00:15

srikartati force-pushed the fix_e2e_flowagg branch from bad90a4 to 999a423 Compare May 28, 2021 00:15

srikartati removed this from the Antrea v1.1 release milestone May 28, 2021

srikartati requested a review from antoninbas May 28, 2021 05:00

antoninbas approved these changes May 28, 2021

View reviewed changes

antoninbas merged commit 34b981a into antrea-io:main May 28, 2021

srikartati deleted the fix_e2e_flowagg branch May 28, 2021 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in FlowAggregator when sending template set #2039

Fix bug in FlowAggregator when sending template set #2039

srikartati commented Apr 7, 2021 •

edited

Loading

srikartati commented Apr 7, 2021

codecov-io commented Apr 7, 2021 •

edited

Loading

srikartati commented Apr 7, 2021

srikartati commented Apr 7, 2021

srikartati commented Apr 7, 2021

srikartati commented Apr 8, 2021 •

edited

Loading

dreamtalen commented Apr 8, 2021

zyiou commented Apr 8, 2021

srikartati commented Apr 8, 2021 •

edited

Loading

srikartati commented Apr 8, 2021

zyiou Apr 8, 2021 •

edited

Loading

srikartati Apr 8, 2021

zyiou Apr 8, 2021

zyiou May 21, 2021

srikartati May 21, 2021

codecov-commenter commented May 21, 2021 •

edited

Loading

srikartati commented May 26, 2021

zyiou left a comment

srikartati commented May 26, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021 •

edited

Loading

antoninbas commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 28, 2021

srikartati commented May 28, 2021 •

edited

Loading

srikartati commented May 28, 2021

Fix bug in FlowAggregator when sending template set #2039

Fix bug in FlowAggregator when sending template set #2039

Conversation

srikartati commented Apr 7, 2021 • edited Loading

srikartati commented Apr 7, 2021

codecov-io commented Apr 7, 2021 • edited Loading

Codecov Report

srikartati commented Apr 7, 2021

srikartati commented Apr 7, 2021

srikartati commented Apr 7, 2021

srikartati commented Apr 8, 2021 • edited Loading

dreamtalen commented Apr 8, 2021

zyiou commented Apr 8, 2021

srikartati commented Apr 8, 2021 • edited Loading

srikartati commented Apr 8, 2021

zyiou Apr 8, 2021 • edited Loading

Choose a reason for hiding this comment

srikartati Apr 8, 2021

Choose a reason for hiding this comment

zyiou Apr 8, 2021

Choose a reason for hiding this comment

zyiou May 21, 2021

Choose a reason for hiding this comment

srikartati May 21, 2021

Choose a reason for hiding this comment

codecov-commenter commented May 21, 2021 • edited Loading

Codecov Report

srikartati commented May 26, 2021

zyiou left a comment

Choose a reason for hiding this comment

srikartati commented May 26, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 27, 2021 • edited Loading

antoninbas commented May 27, 2021

srikartati commented May 27, 2021

srikartati commented May 28, 2021

srikartati commented May 28, 2021 • edited Loading

srikartati commented May 28, 2021

srikartati commented Apr 7, 2021 •

edited

Loading

codecov-io commented Apr 7, 2021 •

edited

Loading

srikartati commented Apr 8, 2021 •

edited

Loading

srikartati commented Apr 8, 2021 •

edited

Loading

zyiou Apr 8, 2021 •

edited

Loading

codecov-commenter commented May 21, 2021 •

edited

Loading

srikartati commented May 27, 2021 •

edited

Loading

srikartati commented May 28, 2021 •

edited

Loading