-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in FlowAggregator when sending template set #2039
Conversation
/test-ipv6-e2e |
Codecov Report
@@ Coverage Diff @@
## main #2039 +/- ##
===========================================
+ Coverage 41.69% 55.58% +13.89%
===========================================
Files 127 265 +138
Lines 15844 19773 +3929
===========================================
+ Hits 6606 10991 +4385
+ Misses 8684 7616 -1068
- Partials 554 1166 +612
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/test-ipv6-e2e |
/test-all |
/test-ipv6-e2e |
There was an error when running in dual-stack but here the test before TestFlowAggregator has failed too. Hence, not sure if its the same error. However, with this fix I don't see very high execution times for flow aggregator tests unlike the one in Issue #2035 .
Here as well I see the Antrea agent not found error like the one in #2035 .
|
LGTM, thanks for using constant when setup flow aggregator. |
Checked
and
Maybe we can print the log from flow aggregator to see what happened. Is it possible to have local Jenkins setup for us to test out? |
Thanks for looking further into the issue. It is essential to know the root cause of this intermittent failure, but the intention of this PR is to cut down the abnormal execution times when failures happen and not to fix the intermittent issue. I should add partially fixes the issue instead of fixes it. For triaging this, we are missing a reliable way to reproduce the issue as this is intermittent and does not happen often. The reported issue is on dual-stack clusters and I also have seen this only on dual-stack clusters. Please let me know if we have seen it on other clusters. The local setup that is close to Jenkins is vagrant, but we do not have automation to deploy dual-stack clusters on vagrant yet. Looks like FlowAggreator pod is in crashLoopState and fetching flow aggregator logs will fail at the end of the test. We need to print the logs and describe output during the test to figure out the reason. |
/test-ipv6-e2e |
test/e2e/flowaggregator_test.go
Outdated
@@ -216,8 +215,8 @@ func checkRecordsForFlows(t *testing.T, data *TestData, srcIP string, dstIP stri | |||
|
|||
// Polling to make sure all the data records corresponding to the iperf flow | |||
// are received. | |||
err = wait.Poll(250*time.Millisecond, collectorCheckTimeout, func() (bool, error) { | |||
rc, collectorOutput, _, err := provider.RunCommandOnNode(controlPlaneNodeName(), fmt.Sprintf("kubectl logs --since=%v ipfix-collector -n antrea-test", time.Since(timeStart).String())) | |||
err = wait.PollImmediate(500*time.Millisecond, flowAggregatorExportInterval, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #1904 changed wait condition in line 224 from return strings.Contains(collectorOutput, srcIP) && strings.Contains(collectorOutput, dstIP), nil
to return strings.Count(collectorOutput, srcIP) >= expectedNumDataRecords && strings.Count(collectorOutput, dstIP) >= expectedNumDataRecords, nil
to avoid the case when we will only receive one record. So here we can use flowAggregatorExportInterval * 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polling happens after iperf runtime, which is 10s currently (flowAggregatorExportInterval * 2
). The time out will capture the final records if there are any, so I feel one cycle of flowAggregatorExportInterval
is enough. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Thanks for clarifying.
3aae049
to
fd8de9b
Compare
fd8de9b
to
3b0f6e9
Compare
fa.set.ResetSet() | ||
if err := fa.set.PrepareSet(ipfixentities.Template, templateID); err != nil { | ||
return 0, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I encountered the issue when the exporter is restarted in the Flow Aggregator and I see the set still has a record of type DataType
. Looking more closely at the existing code, we are missing the set reset before L296 here.
I think it is better to have it in one place. Removed those statements in the caller of sendTemplateSet
. I am doing the same fix in Flow Exporter too.
a819f45
to
9e37dc8
Compare
Codecov Report
@@ Coverage Diff @@
## main #2039 +/- ##
==========================================
+ Coverage 61.83% 65.49% +3.66%
==========================================
Files 276 276
Lines 21145 21140 -5
==========================================
+ Hits 13074 13845 +771
+ Misses 6702 5908 -794
- Partials 1369 1387 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-ipv6-e2e |
/test-conformance |
1 similar comment
/test-conformance |
6966fe0
to
e4c7703
Compare
/test-all |
I had to add the signed-off snippet to the commit message. @antoninbas could you approve this again? |
You could wait for some more time as there will conflict because of other PRs getting merged. Thanks. |
c03c14c
to
c2fcf98
Compare
@srikartati there is a formatting issue |
c2fcf98
to
bad90a4
Compare
/test-all |
We need to prepare the set as template set before sending template record. Otherwise, flow aggregator will crash when retrying to establish the connection to flow collector. In addition, fix the e2e flow aggregator test timeouts. Timeout values are quite high when the tests are failing intermittently on dual stack clusters. Fix that by reducing poll timeout when checking for logs on ipfix-collector pod. Signed-off-by: Srikar Tati <stati@vmware.com>
bad90a4
to
999a423
Compare
/test-all |
All the tests have passed. I will wait for PR #2163 to be merged as it is a higher priority for the v1.2 release to make sure there are no conflicts for that PR. |
Tried cherry-picking the commit PR #2163 on my local branch, and there are no conflicts. |
We need to reset set and prepare the set as a template set before sending the template record. Otherwise, the flow aggregator will crash when retrying to establish the connection to the flow collector. The bug was the missing reset set operation.
In addition, during this failure, the test is taking a long time to finish.
Fix the e2e flow aggregator test timeouts. Timeout values are quite high when the tests are failing
intermittently on dual-stack clusters.
Fix that by reducing poll timeout when checking for logs on ipfix-collector pod.
Fixes #2035