Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakiness in Kind e2e flow aggregator tests #2308

Merged
merged 1 commit into from
Jul 7, 2021

Conversation

srikartati
Copy link
Member

@srikartati srikartati commented Jun 24, 2021

  • Flows are not getting expired as expected because of zero start time and the last export time is initialized as the start time. This commit fixes that bug by adding the current time as start time if it is zero when a new connection is added. Fixes: IPFIX start and stop timestamps are wrong/too large #1417

  • Idle flow export timeout is not configured properly in the e2e test. There is a bug in the test and this commit fixes that. This was introduced in PR: Fortify the bandwidth e2e test of flow aggregator #1802. Ref: https://github.com/antrea-io/antrea/blame/main/test/e2e/framework.go#L572

  • The expected number of records is not correct in Kind clusters as packet counters are not supported in conntrack entries for OVS userspace datapath. Flow exporter cannot send records after active timeout expiry as we cannot consider flows to be active without the statistics. Therefore, we expect only 2 records for any given flow in Kind clusters and 3 flow records in other clusters. We expect this limitation to be resolved soon when OVS userspace datapath support statistics in conntrack entries. This was missed when bandwidth tests are modified for Kind clusters in PR Fortify the bandwidth e2e test of flow aggregator #1802.

  • Additionally, we consider source port from iperf command output to ignore other similar flow records from the iperf control flows. Because of this we were passing the tests in Kind clusters even though there are only two flow
    records while the expected records are 3. In addition, the source port gating fixes flakiness in bandwidth tests. Fixes FlowAggregator bandwidth tests are too flaky #2283

  • Kubectl logs command with since option is not outputting the logs of ipfix collector as expected and this causes intermittent failures. Removing the since option as we have source port as the gating parameter when processing
    flow records.

@codecov-commenter
Copy link

codecov-commenter commented Jun 24, 2021

Codecov Report

Merging #2308 (03d1031) into main (d84da6c) will increase coverage by 23.81%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2308       +/-   ##
===========================================
+ Coverage   41.66%   65.47%   +23.81%     
===========================================
  Files         144      281      +137     
  Lines       17810    21771     +3961     
===========================================
+ Hits         7420    14254     +6834     
+ Misses       9732     6071     -3661     
- Partials      658     1446      +788     
Flag Coverage Δ
e2e-tests 55.51% <80.00%> (?)
kind-e2e-tests 52.47% <80.00%> (?)
unit-tests 41.65% <60.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/flowexporter/connections/conntrack.go 85.71% <ø> (+24.84%) ⬆️
.../flowexporter/connections/conntrack_connections.go 78.68% <100.00%> (+33.68%) ⬆️
pkg/agent/flowexporter/exporter/exporter.go 79.44% <100.00%> (+22.00%) ⬆️
pkg/apiserver/handlers/webhook/mutation_labels.go 24.71% <0.00%> (ø)
pkg/ovs/ovsctl/ofctl.go 38.27% <0.00%> (ø)
pkg/util/runtime/runtime.go 100.00% <0.00%> (ø)
pkg/legacyapis/controlplane/register.go 88.23% <0.00%> (ø)
pkg/signals/signals.go 100.00% <0.00%> (ø)
pkg/apiserver/apiserver.go 90.00% <0.00%> (ø)
...set/versioned/typed/stats/v1alpha1/stats_client.go 45.45% <0.00%> (ø)
... and 241 more

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into the test failures

Comment on lines 529 to 537
// Check the source port along with source and destination IPs as there
// are flow records for control flows during iperf traffics with same IPs
// and destination port.
if strings.Contains(record, srcIP) && strings.Contains(record, dstIP) && strings.Contains(record, srcPort) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the other bug right? please make sure it's mentioned in the commit message / PR description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed here as well because we are accounting for incorrect records (corresponding to the control flow) and passing the test prior to this PR. Therefore, using source port to account for correct records. Added this point in the PR description. Will also add about the bandwidth test issue in the PR description.

@@ -994,3 +1014,22 @@ func deletePerftestServices(t *testing.T, data *TestData) {
}
}
}

// getBandwidthAndSourcePort parses iperf commands output and returns bandwidth
// and source port.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add that the bandwidth is returned as a slice containing 2 strings (bandwidth value and bandwidth unit)

test/e2e/flowaggregator_test.go Show resolved Hide resolved
@srikartati
Copy link
Member Author

/test-ipv6-e2e
/test-ipv6-only-e2e

@srikartati
Copy link
Member Author

srikartati commented Jun 30, 2021

Kind e2e tests are often failing because of the following Antrea network policy tests:

   --- FAIL: TestAntreaPolicy/TestGroupDefaultDENY (66.21s)
        --- FAIL: TestAntreaPolicy/TestGroupDefaultDENY/Case=ACNPAllowXBtoA (15.87s)
        --- PASS: TestAntreaPolicy/TestGroupDefaultDENY/Case=ACNPAllowXBtoYA (16.17s)
        --- PASS: TestAntreaPolicy/TestGroupDefaultDENY/Case=ACNPPriorityOverrideDefaultDeny (16.20s)
--- FAIL: TestAntreaPolicy/TestGroupNoK8sNP (742.02s)
    --- PASS: TestAntreaPolicy/TestGroupNoK8sNP/Case=ACNPAllowNoDefaultIsolationTCP (12.83s)
    ........
    --- FAIL: TestAntreaPolicy/TestGroupNoK8sNP/Case=AppliedToPerRule (29.28s)
Dual stack test fails because of the following test:
```
    --- FAIL: TestLegacyAntreaPolicy/TestGroupNoK8sNP (870.73s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPAllowNoDefaultIsolationTCP (16.60s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPAllowNoDefaultIsolationUDP (16.49s)
        --- SKIP: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPAllowNoDefaultIsolationSCTP (0.00s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPDropEgress (19.96s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPDropEgressUDP (19.24s)
        --- SKIP: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPDropEgressSCTP (0.00s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPPortRange (76.52s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPRejectEgress (16.81s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPRejectIngress (16.51s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPRejectIngressUDP (16.92s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPNoEffectOnOtherProtocols (31.96s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPBaselinePolicy (22.04s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPPrioirtyOverride (39.55s)
        --- PASS: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPTierOverride (39.42s)
        --- FAIL: TestLegacyAntreaPolicy/TestGroupNoK8sNP/Case=LegacyACNPCustomTiers (20.48s)
```

Comment on lines +229 to +231
if conn.StartTime.IsZero() {
conn.StartTime = time.Now()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel that we can include fix of this #1417 in this PR too. It is about start time too but the default value is different (flowStartSeconds: 2288912640 or StartTime: Jul 14, 2042 03:04:00.000000000 CEST)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 601 to 607
cmd := fmt.Sprintf("kubectl -n %s logs -c antrea-agent %s", nsName, podName)
_, antreaLogs, _, err := provider.RunCommandOnNode(controlPlaneNodeName(), cmd)
t.Logf("Logs for %s: %v err: %v", podName, antreaLogs, err)
return err
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the test fails, we can view the antrea-agent logs from GitHub workflow. Is this log for debugging Jenkins issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary addition for this PR to see if there are any intermittent kind CI failures and debug them. Will remove this before merging.

// getBandwidthAndSourcePort parses iperf commands output and returns bandwidth
// and source port. Bandwidth is returned as a slice containing two strings (bandwidth
// value and bandwidth unit).
func getBandwidthAndSourcePort(iperfStdout string) ([]string, string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a sample output here for elaboration. Also if the iperf doesn't send the traffic correctly, will the result be the same? (srcPort return empty string etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check for the error after executing iperf command before parsing the output on stdout. It will take care of the mentioned scenario.

@srikartati
Copy link
Member Author

Ran the kind tests on CI (more than 10 times) on this PR from last 2-3 days. I did not see any Flow Aggregator tests failure. However, there were some repeated failures as mentioned here: #2308 (comment)

test/e2e/flowaggregator_test.go Show resolved Hide resolved
dataRecordsCount := 0
var octetTotalCount uint64
for _, record := range recordSlices {
if strings.Contains(record, srcIP) && strings.Contains(record, dstIP) {
// Check the source port along with source and destination IPs as there
// are flow records for control flows during iperf traffics with same IPs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/iperf traffics/iperf connections ?

traffic can never be plural

Copy link
Member Author

@srikartati srikartati Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an unintentional typo :)

antoninbas
antoninbas previously approved these changes Jul 2, 2021
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

@srikartati maybe you just got very unlucky but all the FlowAggregator tests failed in Jenkins:

--- FAIL: TestFlowAggregator (254.27s)
    --- FAIL: TestFlowAggregator/IPv4 (151.47s)
        --- FAIL: TestFlowAggregator/IPv4/IntraNodeFlows (20.67s)
        --- FAIL: TestFlowAggregator/IPv4/IntraNodeDenyConnIngressANP (9.23s)
        --- FAIL: TestFlowAggregator/IPv4/IntraNodeDenyConnEgressANP (9.15s)
        --- FAIL: TestFlowAggregator/IPv4/IntraNodeDenyConnNP (11.16s)
        --- FAIL: TestFlowAggregator/IPv4/InterNodeFlows (19.29s)
        --- FAIL: TestFlowAggregator/IPv4/InterNodeDenyConnIngressANP (9.17s)
        --- FAIL: TestFlowAggregator/IPv4/InterNodeDenyConnEgressANP (9.19s)
        --- FAIL: TestFlowAggregator/IPv4/InterNodeDenyConnNP (11.14s)
        --- FAIL: TestFlowAggregator/IPv4/ToExternalFlows (11.02s)
        --- FAIL: TestFlowAggregator/IPv4/LocalServiceAccess (19.16s)
        --- FAIL: TestFlowAggregator/IPv4/RemoteServiceAccess (19.18s)

https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2815/consoleFull

@srikartati
Copy link
Member Author

srikartati commented Jul 3, 2021

@srikartati maybe you just got very unlucky but all the FlowAggregator tests failed in Jenkins:

https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2815/consoleFull

Yeah.. I looked through the collected logs. However, this is a very different error compared to the issues that this PR is solving.
There is a DNS lookup(net.LookupIP) failure when initializing the flow aggregator as part of creating the server certificate.

I0702 23:23:58.241360       6 log_file.go:99] Set log file max size to 104857600
I0702 23:23:58.243962       6 flow-aggregator.go:74] Flow aggregator starting...
I0702 23:23:58.255937       6 flow-aggregator.go:94] Flow aggregator Observation Domain ID: 743104829
F0702 23:24:18.434942       6 main.go:59] Error running flow aggregator: error when creating collecting process: error when creating server certificate: lookup flow-aggregator.flow-aggregator.svc on 100.64.0.10:53: read udp 100.96.1.55:53497->100.64.0.10:53: i/o timeout
goroutine 20 [running]:
k8s.io/klog/v2.stacks(0xc0003f2c01, 0xc000022d80, 0x10f, 0x228)
	/go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:1021 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x263a900, 0xc000000003, 0x0, 0x0, 0xc0003e2b60, 0x2589ce2, 0x7, 0x3b, 0x0)
	/go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:970 +0x19b
k8s.io/klog/v2.(*loggingT).printf(0x263a900, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x19e24bb, 0x21, 0xc0004c6140, 0x1, ...)
	/go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:751 +0x191
k8s.io/klog/v2.Fatalf(...)
	/go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:1509
antrea.io/antrea/cmd/flow-aggregator.newFlowAggregatorCommand.func1(0xc0004782c0, 0xc0003e2af0, 0x0, 0x7)
	/antrea/cmd/flow-aggregator/main.go:59 +0x32f
github.com/spf13/cobra.(*Command).execute(0xc0004782c0, 0xc0003dc810, 0x7, 0x7, 0xc0004782c0, 0xc0003dc810)
	/go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc0004782c0, 0xc0003dc780, 0x70, 0xc0003dc780)
	/go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:895
antrea.io/antrea/cmd/flow-aggregator.main()
	/antrea/cmd/flow-aggregator/main.go:38 +0x52
github.com/confluentinc/bincover.RunTest(0x1a7cd28)
	/go/pkg/mod/github.com/confluentinc/bincover@v0.1.0/instrument_bin.go:93 +0x176
antrea.io/antrea/cmd/flow-aggregator.TestBincoverRunMain(0xc00047c000)
	/antrea/cmd/flow-aggregator/bincover_run_main_test.go:12 +0x2d
testing.tRunner(0xc00047c000, 0x1a7cd20)
	/usr/local/go/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1168 +0x2b3

I ran the Jenkins tests multiple times including v6 tests last few days and did not hit this. I think the proper way to solve this is to have a retry logic instead of failing immediately, i.e., if there is no issue with the Kubernetes cluster itself, which seems unlikely here as all other tests are passed.

(EDIT)
Plan to resolve this in a separate PR.

@@ -340,6 +342,7 @@ func (exp *flowExporter) sendFlowRecords() error {
return err
}
exp.numDataSetsSent = exp.numDataSetsSent + 1
klog.V(4).InfoS("Deny connection sent successfully", "flowKey", connKey, "record", conn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/record/connection/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. We need to keep it consistent with L334 but missed it here. Rephrased the log for correctness.

@srikartati
Copy link
Member Author

/test-e2e
/test-ipv6-e2e

- Flows are not getting expired as expected because of zero start time and
last export time is initialized as start time.This commit fixes that bug by
adding current time as start time if it is zero when a new connection is added.

- Idle flow export timeout is not configured properly in the e2e test.
There is a bug in the test and this commit fixes that.

- The expected number of records is not correct in Kind clusters
as packet counters are not supported in conntrack entries for
OVS userspace datapath. Flow exporter cannot send records after
active timeout expiry as we cannot consider flows to be active
without the statistics. Therefore, we expect only 2 records for any
given flow in Kind clusters and 3 flow records in other clusters.
We expect this limitation to be resolved soon when OVS
userspace datapath support statistics in conntrack entries.

- Additionally, we consider source port from iperf flows to ignore other
similar flow records from the iperf control flows. Because of this we
were passing the tests in Kind clusters even though there are only two flow
records.

- Kubectl logs command with since option is not outputting the logs of
ipfix collector as expected and this causes intermittent failures. Removing
the since option as we have source port as the gating parameter when processing
flow records.

Signed-off-by: Srikar Tati <stati@vmware.com>
@srikartati
Copy link
Member Author

/test-all
/test-ipv6-e2e
/test-ipv6-only-e2e

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srikartati
Copy link
Member Author

/test-ipv6-e2e

@zyiou zyiou added area/flow-visibility Issues or PRs related to flow visibility support in Antrea area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. labels Jul 7, 2021
@antoninbas antoninbas merged commit a8c7970 into antrea-io:main Jul 7, 2021
@srikartati srikartati deleted the e2e_test_failure branch July 7, 2021 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility Issues or PRs related to flow visibility support in Antrea area/test/e2e Issues or PRs related to Antrea specific end-to-end testing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FlowAggregator bandwidth tests are too flaky IPFIX start and stop timestamps are wrong/too large
5 participants