Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multicast] Clone remoteMembers when creating an updated GroupMemberStatus #4903

Conversation

ceclinux
Copy link
Contributor

@ceclinux ceclinux commented Apr 26, 2023

This patch is to resolve a concurrency issue in Multicast feature with encap mode.

The root cause is the existing code doesn't create a new object for field "remoteMembers" before updating it into cache.

Fixed: #4904

@codecov
Copy link

codecov bot commented Apr 26, 2023

Codecov Report

Merging #4903 (5354f38) into main (97b8e4a) will decrease coverage by 0.78%.
The diff coverage is 66.66%.

❗ Current head 5354f38 differs from pull request most recent head 044084a. Consider uploading reports for the commit 044084a to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4903      +/-   ##
==========================================
- Coverage   72.13%   71.36%   -0.78%     
==========================================
  Files         409      409              
  Lines       61161    59956    -1205     
==========================================
- Hits        44121    42788    -1333     
- Misses      14127    14286     +159     
+ Partials     2913     2882      -31     
Flag Coverage Δ
e2e-tests ?
integration-tests 33.79% <0.00%> (-0.08%) ⬇️
kind-e2e-tests 46.83% <0.00%> (-0.48%) ⬇️
unit-tests 62.75% <66.66%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pkg/agent/multicast/mcast_controller.go 69.12% <66.66%> (-0.16%) ⬇️

... and 56 files with indirect coverage changes

@luolanzone luolanzone added the action/backport Indicates a PR that requires backports. label Apr 26, 2023
@@ -82,12 +82,14 @@ type GroupMemberStatus struct {
group net.IP
// localMembers is a map for the local Pod member and its last update time, key is the Pod's interface name,
// and value is its last update time.
localMembers map[string]time.Time
localMembers map[string]time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary to introduce Mutexes for localMembers and remoteMembers. The root cause of the issue is we didn't create a new object for remoteMembers before updating the cache. The change in line 137 is enough.

@wenyingd wenyingd force-pushed the concurrency-control-for-members-in-GroupMemberStatus branch from 63d63d1 to 5354f38 Compare April 26, 2023 06:48
@@ -131,7 +131,7 @@ func (c *Controller) updateGroupMemberStatus(obj interface{}, e *mcastGroupEvent
newStatus := &GroupMemberStatus{
group: status.group,
localMembers: make(map[string]time.Time),
remoteMembers: status.remoteMembers,
remoteMembers: sets.NewString(status.remoteMembers.List()...),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyingd could you provide a few background for this issue in the PR summary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create an issue for it.

@luolanzone
Copy link
Contributor

/test-multicast-e2e

luolanzone
luolanzone previously approved these changes Apr 26, 2023
Copy link
Contributor

@luolanzone luolanzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -131,7 +131,7 @@ func (c *Controller) updateGroupMemberStatus(obj interface{}, e *mcastGroupEvent
newStatus := &GroupMemberStatus{
group: status.group,
localMembers: make(map[string]time.Time),
remoteMembers: status.remoteMembers,
remoteMembers: sets.NewString(status.remoteMembers.List()...),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
remoteMembers: sets.NewString(status.remoteMembers.List()...),
remoteMembers: status.remoteMembers.Union(),

List performs unnecessary sorting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

pkg/agent/multicast/mcast_controller.go Outdated Show resolved Hide resolved
Comment on lines 632 to 634
wg.Add(4)
// Below func adds local group join events.
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below code is too redundant, can we reduce redundancy?

And there seems no point to generate the events concurrently as they are eventually consumed by a single event handler. The race is not between the event generator and others, but the eventHandler and worker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Comment on lines 711 to 719
for i := 0; i < numEvents; i++ {
mockIfaceStore.EXPECT().GetInterfaceByName(fmt.Sprintf("local-interfaceName-%d", i)).Return(&interfacestore.InterfaceConfig{
Type: interfacestore.ContainerInterface,
InterfaceName: fmt.Sprintf("local-interfaceName-%d", i),
OVSPortConfig: &interfacestore.OVSPortConfig{
OFPort: int32(i),
},
}, true).AnyTimes()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's contradictory to assert a function is called "anytimes" many times.
If you want accurate assertion, it shouldn't use Anytimes.
If you want it to be called anytimes, no need to have a for loop here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a valid interfaceStore to avoid such mock.

for i := 0; i < 2; i++ {
go wait.Until(c.worker, time.Second, stopCh)
}
wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no guaratee that when this call finishes, eventHander and worker have processed the events. Could we use some determinstic input and decide when the goroutines finish their job by polling the result, which could be more determinstic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check on the size of eventCh and queue after all events have been sent.

@wenyingd wenyingd force-pushed the concurrency-control-for-members-in-GroupMemberStatus branch from 5354f38 to 044084a Compare April 26, 2023 11:36
@@ -131,7 +131,7 @@ func (c *Controller) updateGroupMemberStatus(obj interface{}, e *mcastGroupEvent
newStatus := &GroupMemberStatus{
group: status.group,
localMembers: make(map[string]time.Time),
remoteMembers: status.remoteMembers,
remoteMembers: sets.NewString().Union(status.remoteMembers),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This writing initializes an extra set. Why don't just use status.remoteMembers.Union(nil) which is quite common in our code base?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

wg.Add(4)

eventFunc := func(eType eventType, isLocal bool) {
LeastSignificantByteArr := rand.Perm(numEvents)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LeastSignificantByteArr := rand.Perm(numEvents)
leastSignificantByteArr := rand.Perm(numEvents)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

stopCh := make(chan struct{})
defer close(stopCh)
groupIP := net.ParseIP("224.3.4.5")
numEvents := 250
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this number, the test's logging could flood the output and even make the race stack being cut, making it hard to figure out what causes the failure. You can try removing the fix and runing the test alone. I tested a few times setting it to 10 could always trigger the race and only produced reasonable logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@ceclinux ceclinux force-pushed the concurrency-control-for-members-in-GroupMemberStatus branch from 044084a to 60559d0 Compare April 27, 2023 02:10
tnqn
tnqn previously approved these changes Apr 27, 2023
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Apr 27, 2023

/test-all

@tnqn
Copy link
Member

tnqn commented Apr 27, 2023

/test-multicast-e2e

@luolanzone
Copy link
Contributor

@tnqn can you change the commit title when you merge it? the commit title doesn't match the change. Thanks.

@ceclinux ceclinux changed the title [Multicast] Add concurrency control in GroupMemberStatus [Multicast] Clone new remoteMembers when creating GroupMemberStatus Apr 27, 2023
@ceclinux ceclinux changed the title [Multicast] Clone new remoteMembers when creating GroupMemberStatus [Multicast] Clone remoteMembers when creating GroupMemberStatus Apr 27, 2023
@ceclinux ceclinux changed the title [Multicast] Clone remoteMembers when creating GroupMemberStatus [Multicast] Clone remoteMembers when creating an updated GroupMemberStatus Apr 27, 2023
@ceclinux ceclinux force-pushed the concurrency-control-for-members-in-GroupMemberStatus branch from 60559d0 to 2ddfcb8 Compare April 27, 2023 07:55
When updating GroupMemberStatus, a clone of remoteMembers should assign to the
updated GroupMemberStatus. Simply setting it with original remoteMembers
will cause race conditions.

Fixes antrea-io#4904

Signed-off-by: ceclinux <src655@gmail.com>
@ceclinux ceclinux force-pushed the concurrency-control-for-members-in-GroupMemberStatus branch from 2ddfcb8 to 039f0c8 Compare April 27, 2023 08:15
@ceclinux
Copy link
Contributor Author

@tnqn can you change the commit title when you merge it? the commit title doesn't match the change. Thanks.

I have updated the title and commit message.

@luolanzone
Copy link
Contributor

@ceclinux I recalled that you said this issue is also found in v1.9? I suppose we should backport this to v1.9 and v1.10? @tnqn

@tnqn
Copy link
Member

tnqn commented Apr 27, 2023

/skip-all

@tnqn tnqn merged commit a1e96e4 into antrea-io:main Apr 27, 2023
@tnqn
Copy link
Member

tnqn commented Apr 27, 2023

@ceclinux I recalled that you said this issue is also found in v1.9? I suppose we should backport this to v1.9 and v1.10? @tnqn

The releases in maintenance are 1.10 and v1.11, so let's backport to these two releases.

jainpulkit22 pushed a commit to urharshitha/antrea that referenced this pull request Apr 28, 2023
…trea-io#4903)

When updating GroupMemberStatus, a clone of remoteMembers should assign to the
updated GroupMemberStatus. Simply setting it with original remoteMembers
will cause race conditions.

Fixes antrea-io#4904

Signed-off-by: ceclinux <src655@gmail.com>
rajnkamr pushed a commit to rajnkamr/antrea that referenced this pull request May 4, 2023
…trea-io#4903)

When updating GroupMemberStatus, a clone of remoteMembers should assign to the
updated GroupMemberStatus. Simply setting it with original remoteMembers
will cause race conditions.

Fixes antrea-io#4904

Signed-off-by: ceclinux <src655@gmail.com>
@wenyingd wenyingd mentioned this pull request May 24, 2023
ceclinux added a commit to ceclinux/antrea that referenced this pull request Jun 5, 2023
…trea-io#4903)

When updating GroupMemberStatus, a clone of remoteMembers should assign to the
updated GroupMemberStatus. Simply setting it with original remoteMembers
will cause race conditions.

Fixes antrea-io#4904

Signed-off-by: ceclinux <src655@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multicast concurrency issue
4 participants