Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the deadlock between the exporter and the conntrack polling go routines #2429

Merged
merged 1 commit into from
Jul 21, 2021

Conversation

srikartati
Copy link
Member

@srikartati srikartati commented Jul 17, 2021

Deadlock is due to the access of connection map from exporter goroutine waiting to acquire connection map lock to update the "DoneExport" flag of the stored connection. At the same time, the connection polling goroutine acquires the connection map lock waiting to acquire flow record map, which is acquired by the exporter goroutine.
This was caught in scale testing.
Resolved this through a temporary fix by adding the same flag in the flow record data struct.

The connection and record deletion logic will be re-evaluated through PR #2360 as it refactors the related code quite a bit.

@srikartati srikartati added the area/flow-visibility/exporter Issues or PRs related to the Flow Exporter functions in the Agent label Jul 17, 2021
zyiou
zyiou previously approved these changes Jul 17, 2021
Copy link
Contributor

@zyiou zyiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The logic is much cleaner. This solution LGTM. Just some tiny comments.

pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
}
} else {
// If the connection is in dying state and the corresponding flow records are already
// exported, then update the DoneExport flag on the connection.
if flowexporter.IsConnectionDying(conn) && record.DoneExport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check whether connection is dying here again?

klog.V(2).Infof("Deleting the inactive flow records with key: %v from record map", key)
if err := exp.flowRecords.DeleteFlowRecordWithoutLock(key); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function DeleteFlowRecordWithoutLock(key) can be removed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

@zyiou
Copy link
Contributor

zyiou commented Jul 17, 2021

Sorry I was intended to select Comment instead of Approve, please ignore it.

@codecov-commenter
Copy link

codecov-commenter commented Jul 17, 2021

Codecov Report

Merging #2429 (c3cd2f4) into main (7469d7e) will increase coverage by 5.32%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2429      +/-   ##
==========================================
+ Coverage   59.63%   64.96%   +5.32%     
==========================================
  Files         284      284              
  Lines       22178    25417    +3239     
==========================================
+ Hits        13226    16512    +3286     
+ Misses       7527     7358     -169     
- Partials     1425     1547     +122     
Flag Coverage Δ
e2e-tests 55.95% <100.00%> (?)
kind-e2e-tests 47.05% <100.00%> (+0.19%) ⬆️
unit-tests 42.03% <66.66%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../flowexporter/connections/conntrack_connections.go 81.30% <100.00%> (+2.61%) ⬆️
pkg/agent/flowexporter/exporter/exporter.go 82.36% <100.00%> (+3.15%) ⬆️
pkg/agent/flowexporter/flowrecords/flow_records.go 81.96% <100.00%> (+3.53%) ⬆️
pkg/controller/egress/ipallocator/allocator.go 67.82% <0.00%> (-15.16%) ⬇️
pkg/controller/networkpolicy/endpoint_querier.go 77.64% <0.00%> (-13.79%) ⬇️
pkg/apis/controlplane/v1beta1/conversion.go 72.44% <0.00%> (-11.89%) ⬇️
pkg/legacyapis/core/v1alpha2/register.go 69.23% <0.00%> (-10.77%) ⬇️
pkg/controller/egress/controller.go 76.76% <0.00%> (-10.44%) ⬇️
pkg/apis/stats/register.go 71.42% <0.00%> (-10.39%) ⬇️
pkg/legacyapis/stats/register.go 71.42% <0.00%> (-10.39%) ⬇️
... and 271 more

@srikartati srikartati force-pushed the fix_deadlock_in_flow_exporter branch 3 times, most recently from 34e5dd3 to 2d2b19d Compare July 19, 2021 18:54
klog.V(2).Infof("Deleting the inactive flow records with key: %v from record map", key)
if err := exp.flowRecords.DeleteFlowRecordWithoutLock(key); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

func (fr *FlowRecords) AddOrUpdateFlowRecord(key flowexporter.ConnectionKey, conn *flowexporter.Connection) error {
// If the connection is in dying state and the corresponding flow records are already
// exported, then there is no need to add or update the record.
if flowexporter.IsConnectionDying(conn) && conn.DoneExport {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check whether connection is dying here again?

Changed the name of the flag. Removed this extra check here and other places as well.

@srikartati srikartati requested a review from zyiou July 19, 2021 19:05
zyiou
zyiou previously approved these changes Jul 19, 2021
Copy link
Contributor

@zyiou zyiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 120 to 128
// SetDyingAndDoneExport sets the appropriate field of the record to true given
// the connection key and record. It also updates the record map. Caller is expected
// to grab lock to the record map.
func (fr *FlowRecords) SetDyingAndDoneExport(connKey flowexporter.ConnectionKey, record flowexporter.FlowRecord) {
record.DyingAndDoneExport = true

fr.recordsMap[connKey] = record
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this function a bit confusing, and the name is not really helping. Since you only use it in one place, do you think you could inline the code instead, and not define this function at all?

Copy link
Member Author

@srikartati srikartati Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to do that initially but could not as the flow record map is not public. I modified it in a different way.

Deadlock is due to the access of connection map from exporter goroutine waiting
to acquire connection map lock to update the "DoneExport" flag of the stored
connection. At the same time, the connection polling goroutine acquires the
connection map lock waiting to acquire flow record map, which is acquired by the
exporter goroutine.
This was caught in scale testing.
Resolved this through a temporary fix by adding the same flag in the flow record data struct.

The connection and record deletion logic will be re-evaluated through PR antrea-io#2360 as it refactors the related code quite a bit.

Signed-off-by: Srikar Tati <stati@vmware.com>
@srikartati
Copy link
Member Author

/test-all

@srikartati
Copy link
Member Author

/test-ipv6-e2e /test-ipv6-only-e2e

@srikartati
Copy link
Member Author

/test-e2e /test-conformance /test-networkpolicy

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once merged, please cherry pick to the release-1.2 branch if appropriate (see https://github.com/antrea-io/antrea/blob/main/docs/contributors/cherry-picks.md)

}
// If the connection is in dying state or connection is not in conntrack
// table, we set the DyingAndDoneExport flag to do the deletion later.
record.DyingAndDoneExport = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a future improvement, maybe we should just change ForAllFlowRecordsDo so that updateOrSendFlowRecord uses a flow record pointer instead of a copy of stored flow record. This whole code is executed with the lock any way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Or I was thinking to store the pointer to the record in the record map instead of value.
As this code is being refactored and the flow record map would be mostly removed in PR #2360, I tried to keep the changes to a minimum.

@@ -83,6 +94,12 @@ func (fr *FlowRecords) AddFlowRecordToMap(connKey *flowexporter.ConnectionKey, r
fr.recordsMap[*connKey] = *record
}

// AddFlowRecordWithoutLock adds the flow record from record map given connection key.
// Caller is expected to grab the lock the record map.
func (fr *FlowRecords) AddFlowRecordWithoutLock(connKey *flowexporter.ConnectionKey, record *flowexporter.FlowRecord) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can address naming consistency with the function above (AddFlowRecordToMap) in a separate PR, or implement the approach I suggested above if it is acceptable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this can be done. Hope you are ok with the explanation in the other comment.

Copy link
Member Author

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once merged, please cherry pick to the release-1.2 branch if appropriate (see https://github.com/antrea-io/antrea/blob/main/docs/contributors/cherry-picks.md)

Thanks for the review.
Sure, will do that. We could cherrypick to the release-1.1 branch as well as bug is applicable in that branch too. Not sure if there is any user who may want to pick that patch release on release-1.1 branch.

}
// If the connection is in dying state or connection is not in conntrack
// table, we set the DyingAndDoneExport flag to do the deletion later.
record.DyingAndDoneExport = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Or I was thinking to store the pointer to the record in the record map instead of value.
As this code is being refactored and the flow record map would be mostly removed in PR #2360, I tried to keep the changes to a minimum.

@@ -83,6 +94,12 @@ func (fr *FlowRecords) AddFlowRecordToMap(connKey *flowexporter.ConnectionKey, r
fr.recordsMap[*connKey] = *record
}

// AddFlowRecordWithoutLock adds the flow record from record map given connection key.
// Caller is expected to grab the lock the record map.
func (fr *FlowRecords) AddFlowRecordWithoutLock(connKey *flowexporter.ConnectionKey, record *flowexporter.FlowRecord) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this can be done. Hope you are ok with the explanation in the other comment.

@antoninbas
Copy link
Contributor

Don't think this will affect integration tests (the testbed has been fixed now but I don't know of an easy way to re-trigger the tests), so I'll go ahead and merge this

@antoninbas antoninbas merged commit 365623f into antrea-io:main Jul 21, 2021
@antoninbas
Copy link
Contributor

@srikartati please cherry-pick to release branches as needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility/exporter Issues or PRs related to the Flow Exporter functions in the Agent
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants