-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid ServiceCIDR flapping on agent start #5017
Conversation
90d5e82
to
206e5c3
Compare
return | ||
} | ||
svcs, _ := d.serviceLister.List(labels.Everything()) | ||
d.updateServiceCIDR(svcs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here all existing Services are listed, and a calculated Service CIDR will be got, how could we clean the stale Service routes? Previously, we use the first ClusterIP to collect all routes whose destination includes the ClusterIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous cleanup logic is not reliable, there is no guarantee the first clusterIP has been covered the previous routes. Please check the latest code about how stale routes are collected.
569025d
to
00ec216
Compare
00ec216
to
d39f100
Compare
The previous implementation always generated intermediate values for ServiceCIDR on agent start, which may interrupt the Service traffic and causes difficulty for cleaning up stale routes as the value calculated at one point may not be reliable to identify all stale routes. This commit waits for the Service Informer to be synced first, and calculates the ServiceCIDR based on all Services. Ideally the Service route won't change in most cases, and hence avoid the above issues. Besides, it fixes an issue that stale routes on Linux were not cleaned up correctly due to incorrect check. Signed-off-by: Quan Tian <qtian@vmware.com>
d39f100
to
741df1b
Compare
/test-all |
|
||
func (d *Discoverer) updateServiceCIDR(svcs ...*corev1.Service) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me that the current implementation of updateServiceCIDR
will always be correct if multiple workers / goroutines can call updateServiceCIDR
simultaneously?
That being said, it should not matter given that you start a single worker goroutine in Run
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not work if it's called simultaneously. There is no need to have multiple workers for this task as it just checks if an IP is in a subnet in most cases (CIDR expansion can only happen few times).
@hongliangl do you have other comments for this one? |
The previous implementation always generated intermediate values for ServiceCIDR on agent start, which may interrupt the Service traffic and causes difficulty for cleaning up stale routes as the value calculated at one point may not be reliable to identify all stale routes. This commit waits for the Service Informer to be synced first, and calculates the ServiceCIDR based on all Services. Ideally the Service route won't change in most cases, and hence avoid the above issues. Besides, it fixes an issue that stale routes on Linux were not cleaned up correctly due to incorrect check. Signed-off-by: Quan Tian <qtian@vmware.com>
The previous implementation always generated intermediate values for ServiceCIDR on agent start, which may interrupt the Service traffic and causes difficulty for cleaning up stale routes as the value calculated at one point may not be reliable to identify all stale routes.
This commit waits for the Service Informer to be synced first, and calculates the ServiceCIDR based on all Services. Ideally the Service route won't change in most cases, and hence avoid the above issues.
Besides, it fixes an issue that stale routes on Linux were not cleaned up correctly due to incorrect check.