Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in FlowAggregator when sending template set #2039

Merged
merged 1 commit into from
May 28, 2021

Conversation

srikartati
Copy link
Member

@srikartati srikartati commented Apr 7, 2021

We need to reset set and prepare the set as a template set before sending the template record. Otherwise, the flow aggregator will crash when retrying to establish the connection to the flow collector. The bug was the missing reset set operation.

In addition, during this failure, the test is taking a long time to finish.
Fix the e2e flow aggregator test timeouts. Timeout values are quite high when the tests are failing
intermittently on dual-stack clusters.
Fix that by reducing poll timeout when checking for logs on ipfix-collector pod.

Fixes #2035

@srikartati
Copy link
Member Author

/test-ipv6-e2e

@codecov-io
Copy link

codecov-io commented Apr 7, 2021

Codecov Report

Merging #2039 (92ac8ba) into main (9a9804a) will increase coverage by 13.89%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2039       +/-   ##
===========================================
+ Coverage   41.69%   55.58%   +13.89%     
===========================================
  Files         127      265      +138     
  Lines       15844    19773     +3929     
===========================================
+ Hits         6606    10991     +4385     
+ Misses       8684     7616     -1068     
- Partials      554     1166      +612     
Flag Coverage Δ
kind-e2e-tests 41.86% <ø> (?)
unit-tests 41.68% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/apiserver/handlers/webhook/mutation_labels.go 24.71% <0.00%> (ø)
pkg/agent/util/ipset/ipset.go 61.53% <0.00%> (ø)
pkg/agent/interfacestore/interface_cache.go 81.37% <0.00%> (ø)
pkg/monitor/controller.go 35.07% <0.00%> (ø)
pkg/ipfix/ipfix_set.go 100.00% <0.00%> (ø)
...erinformation/v1beta1/clusterinformation_client.go 50.00% <0.00%> (ø)
...s/externalversions/clusterinformation/interface.go 0.00% <0.00%> (ø)
pkg/monitor/agent.go 33.33% <0.00%> (ø)
...ntroller/crdmirroring/crdhandler/externalentity.go 0.00% <0.00%> (ø)
...versioned/typed/security/v1alpha1/networkpolicy.go 0.00% <0.00%> (ø)
... and 209 more

@srikartati
Copy link
Member Author

/test-ipv6-e2e

@srikartati
Copy link
Member Author

/test-all
/test-ipv6-e2e

@srikartati
Copy link
Member Author

/test-ipv6-e2e
/test-ipv6-only-e2e

@srikartati
Copy link
Member Author

srikartati commented Apr 8, 2021

There was an error when running in dual-stack but here the test before TestFlowAggregator has failed too. Hence, not sure if its the same error. However, with this fix I don't see very high execution times for flow aggregator tests unlike the one in Issue #2035 .
http://10.176.27.169:8080/view/dual-stack/job/antrea-ipv6-ds-e2e-for-pull-request/289/console

--- FAIL: TestFlowAggregator (290.24s)
    --- FAIL: TestFlowAggregator/IPv4 (68.28s)
        --- FAIL: TestFlowAggregator/IPv4/IntraNodeFlows (18.65s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeFlows (12.01s)
        --- FAIL: TestFlowAggregator/IPv4/LocalServiceAccess (17.19s)
        --- FAIL: TestFlowAggregator/IPv4/RemoteServiceAccess (17.33s)
    --- FAIL: TestFlowAggregator/IPv6 (72.46s)
        --- FAIL: TestFlowAggregator/IPv6/IntraNodeFlows (17.23s)
        --- FAIL: TestFlowAggregator/IPv6/InterNodeFlows (17.52s)
        --- FAIL: TestFlowAggregator/IPv6/LocalServiceAccess (17.27s)
        --- FAIL: TestFlowAggregator/IPv6/RemoteServiceAccess (17.36s)

Here as well I see the Antrea agent not found error like the one in #2035 .

=== RUN   TestFlowAggregator/IPv6/IntraNodeFlows
    flowaggregator_test.go:387: Error when waiting for Network Policy to be realized: unable to upgrade connection: container not found ("antrea-agent")

@dreamtalen
Copy link
Contributor

LGTM, thanks for using constant when setup flow aggregator.

@zyiou
Copy link
Contributor

zyiou commented Apr 8, 2021

Checked antrea-agent logs for failure tests. For IntraNodeFlows test I think the failure is caused by container not found ("antrea-agent"). In the logs there is no info showing network policies are realized for IntraNodeFlows either (successful realization of InterNodeFlows network policies is shown). For failure tests after successful InterNodeFlows test, the problem is likely to happen on flow aggregator side:

E0407 21:47:18.364077       1 flow_records.go:128] Error when executing callback for flow record
E0407 21:47:18.364422       1 exporter.go:177] Error when sending flow records: error when iterating flow records: error when sending data set: error when sending message on the connection: write tcp 172.17.0.1:43112->172.17.242.32:4739: write: broken pipe

and

E0407 21:47:19.374206       1 process.go:92] Cannot the create the tls connection to the Collector flow-aggregator.flow-aggregator.svc:4739: dial tcp 172.17.242.32:4739: connect: connection refused

Maybe we can print the log from flow aggregator to see what happened. Is it possible to have local Jenkins setup for us to test out?

@srikartati srikartati changed the title Fix the TestFlowAggregator e2e test when failing Fix the TestFlowAggregator e2e test timeout when failing Apr 8, 2021
@srikartati
Copy link
Member Author

srikartati commented Apr 8, 2021

Checked antrea-agent logs for failure tests. For IntraNodeFlows test I think the failure is caused by container not found ("antrea-agent"). In the logs there is no info showing network policies are realized for IntraNodeFlows either (successful realization of InterNodeFlows network policies is shown). For failure tests after successful InterNodeFlows test, the problem is likely to happen on flow aggregator side:

E0407 21:47:18.364077       1 flow_records.go:128] Error when executing callback for flow record
E0407 21:47:18.364422       1 exporter.go:177] Error when sending flow records: error when iterating flow records: error when sending data set: error when sending message on the connection: write tcp 172.17.0.1:43112->172.17.242.32:4739: write: broken pipe

and

E0407 21:47:19.374206       1 process.go:92] Cannot the create the tls connection to the Collector flow-aggregator.flow-aggregator.svc:4739: dial tcp 172.17.242.32:4739: connect: connection refused

Maybe we can print the log from flow aggregator to see what happened. Is it possible to have local Jenkins setup for us to test out?

Thanks for looking further into the issue. It is essential to know the root cause of this intermittent failure, but the intention of this PR is to cut down the abnormal execution times when failures happen and not to fix the intermittent issue. I should add partially fixes the issue instead of fixes it.

For triaging this, we are missing a reliable way to reproduce the issue as this is intermittent and does not happen often. The reported issue is on dual-stack clusters and I also have seen this only on dual-stack clusters. Please let me know if we have seen it on other clusters. The local setup that is close to Jenkins is vagrant, but we do not have automation to deploy dual-stack clusters on vagrant yet. Looks like FlowAggreator pod is in crashLoopState and fetching flow aggregator logs will fail at the end of the test. We need to print the logs and describe output during the test to figure out the reason.

@srikartati
Copy link
Member Author

/test-ipv6-e2e

@@ -216,8 +215,8 @@ func checkRecordsForFlows(t *testing.T, data *TestData, srcIP string, dstIP stri

// Polling to make sure all the data records corresponding to the iperf flow
// are received.
err = wait.Poll(250*time.Millisecond, collectorCheckTimeout, func() (bool, error) {
rc, collectorOutput, _, err := provider.RunCommandOnNode(controlPlaneNodeName(), fmt.Sprintf("kubectl logs --since=%v ipfix-collector -n antrea-test", time.Since(timeStart).String()))
err = wait.PollImmediate(500*time.Millisecond, flowAggregatorExportInterval, func() (bool, error) {
Copy link
Contributor

@zyiou zyiou Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1904 changed wait condition in line 224 from return strings.Contains(collectorOutput, srcIP) && strings.Contains(collectorOutput, dstIP), nil to return strings.Count(collectorOutput, srcIP) >= expectedNumDataRecords && strings.Count(collectorOutput, dstIP) >= expectedNumDataRecords, nil to avoid the case when we will only receive one record. So here we can use flowAggregatorExportInterval * 2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polling happens after iperf runtime, which is 10s currently (flowAggregatorExportInterval * 2). The time out will capture the final records if there are any, so I feel one cycle of flowAggregatorExportInterval is enough. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks for clarifying.

zyiou
zyiou previously approved these changes Apr 8, 2021
@srikartati srikartati force-pushed the fix_e2e_flowagg branch 3 times, most recently from 3aae049 to fd8de9b Compare May 20, 2021 19:37
@srikartati srikartati changed the title Fix the TestFlowAggregator e2e test timeout when failing Fix bug in FlowAggregator when sending template set May 20, 2021
@srikartati srikartati requested a review from zyiou May 20, 2021 19:47
Comment on lines +461 to +485
fa.set.ResetSet()
if err := fa.set.PrepareSet(ipfixentities.Template, templateID); err != nil {
return 0, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do reset and prepare here? I saw we had done it here and here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I encountered the issue when the exporter is restarted in the Flow Aggregator and I see the set still has a record of type DataType. Looking more closely at the existing code, we are missing the set reset before L296 here.

I think it is better to have it in one place. Removed those statements in the caller of sendTemplateSet. I am doing the same fix in Flow Exporter too.

@srikartati srikartati force-pushed the fix_e2e_flowagg branch 3 times, most recently from a819f45 to 9e37dc8 Compare May 21, 2021 21:20
@codecov-commenter
Copy link

codecov-commenter commented May 21, 2021

Codecov Report

Merging #2039 (999a423) into main (d8d8626) will increase coverage by 3.66%.
The diff coverage is 64.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2039      +/-   ##
==========================================
+ Coverage   61.83%   65.49%   +3.66%     
==========================================
  Files         276      276              
  Lines       21145    21140       -5     
==========================================
+ Hits        13074    13845     +771     
+ Misses       6702     5908     -794     
- Partials     1369     1387      +18     
Flag Coverage Δ
e2e-tests 56.57% <64.28%> (?)
kind-e2e-tests 53.19% <64.28%> (+0.22%) ⬆️
unit-tests 41.30% <42.85%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/flowexporter/exporter/exporter.go 77.85% <57.14%> (+0.55%) ⬆️
pkg/flowaggregator/flowaggregator.go 63.24% <71.42%> (+5.19%) ⬆️
...ntroller/networkpolicy/networkpolicy_controller.go 69.45% <0.00%> (-0.65%) ⬇️
pkg/agent/controller/networkpolicy/reconciler.go 77.04% <0.00%> (+0.23%) ⬆️
pkg/ovs/openflow/ofctrl_bridge.go 50.36% <0.00%> (+0.36%) ⬆️
pkg/agent/openflow/network_policy.go 76.41% <0.00%> (+0.59%) ⬆️
pkg/agent/proxy/proxier.go 64.53% <0.00%> (+0.63%) ⬆️
pkg/agent/cniserver/pod_configuration.go 54.68% <0.00%> (+0.78%) ⬆️
pkg/controller/egress/store/egressgroup.go 1.38% <0.00%> (+1.38%) ⬆️
pkg/apiserver/apiserver.go 91.12% <0.00%> (+1.61%) ⬆️
... and 48 more

@srikartati
Copy link
Member Author

/test-all

antoninbas
antoninbas previously approved these changes May 26, 2021
zyiou
zyiou previously approved these changes May 26, 2021
Copy link
Contributor

@zyiou zyiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srikartati
Copy link
Member Author

/test-ipv6-e2e
/test-ipv6-only-e2e

@srikartati
Copy link
Member Author

/test-conformance

1 similar comment
@srikartati
Copy link
Member Author

/test-conformance

@srikartati
Copy link
Member Author

/test-all

@srikartati
Copy link
Member Author

I had to add the signed-off snippet to the commit message. @antoninbas could you approve this again?

@srikartati
Copy link
Member Author

srikartati commented May 27, 2021

I had to add the signed-off snippet to the commit message. @antoninbas could you approve this again?

You could wait for some more time as there will conflict because of other PRs getting merged. Thanks.

@srikartati srikartati force-pushed the fix_e2e_flowagg branch 2 times, most recently from c03c14c to c2fcf98 Compare May 27, 2021 21:38
@antoninbas
Copy link
Contributor

@srikartati there is a formatting issue

@srikartati
Copy link
Member Author

/test-all

antoninbas
antoninbas previously approved these changes May 27, 2021
We need to prepare the set as template set before sending
template record. Otherwise, flow aggregator will crash when retrying
to establish the connection to flow collector.

In addition, fix the e2e flow aggregator test timeouts.
Timeout values are quite high when the tests are failing
intermittently on dual stack clusters.
Fix that by reducing poll timeout when checking for logs on
ipfix-collector pod.

Signed-off-by: Srikar Tati <stati@vmware.com>
@srikartati
Copy link
Member Author

/test-all

@srikartati
Copy link
Member Author

srikartati commented May 28, 2021

All the tests have passed. I will wait for PR #2163 to be merged as it is a higher priority for the v1.2 release to make sure there are no conflicts for that PR.
This PR can go in after the release too IMO as the bug is for a rare corner case; depends on other PR and CI runs.
@antoninbas I am removing the milestone just to remove the dependency on the release.

@srikartati srikartati removed this from the Antrea v1.1 release milestone May 28, 2021
@srikartati
Copy link
Member Author

All the tests have passed. I will wait for PR #2163 to be merged as it is a higher priority for the v1.2 release to make sure there are no conflicts for that PR.
This PR can go in after the release too IMO as the bug is for a rare corner case; depends on other PR and CI runs.
@antoninbas I am removing the milestone just to remove the dependency on the release.

Tried cherry-picking the commit PR #2163 on my local branch, and there are no conflicts.
@antoninbas please give the approval again.

@antoninbas antoninbas merged commit 34b981a into antrea-io:main May 28, 2021
@srikartati srikartati deleted the fix_e2e_flowagg branch May 28, 2021 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FlowAggregator e2e tests sometimes timed out.
7 participants