-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve AntreaProxy route syncing on Windows #4941
Conversation
d711902
to
2be46e2
Compare
2be46e2
to
5ffc476
Compare
5ffc476
to
dd08f4e
Compare
/test-windows-all |
/test-windows-all |
dd08f4e
to
b71e850
Compare
b71e850
to
204864c
Compare
204864c
to
f679376
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
pkg/agent/route/route_windows.go
Outdated
if iputil.IsSubCIDR(serviceCIDR, rt.DestinationSubnet) { | ||
staleRoutes[rt.String()] = rt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it lists routes twice? couldn't get the undesired routes in the above else branch by a unified condition?
// Not the routes we are interested in.
if !rt.GatewayAddress.Equal(gw) {
continue
}
// It's the latest route we just installed.
if net.IPNetEqual(rt.DestinationSubnet, serviceCIDR) {
continue
}
// The route covers the desired route. It was installed when the calculated ServiceCIDR is larger than the current one, which could happen after some Services are deleted.
if rt.DestinationSubnet.Contains(serviceCIDR.IP) {
staleRoutes = append(staleRoutes, rt)
}
// The desired route covers the route. It was either installed when the calculated ServiceCIDR is smaller than the current one, or a per-IP route generated before v1.12.0.
if serviceCIDR.Contains(rt.DestinationSubnet.IP) {
staleRoutes = append(staleRoutes, rt)
}
It doesn't need to change staleRoutes to map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It lists routes twice only when installing the first Service. For the first listing of routes, it is used to consistent with Linux code. Besides, the logic is used to find the existing routes whose destinations contains the first ClusterIP (serviceCIDR.IP), and it is only called when syncing the first ClusterIP.
For the second listing of routes, it is used to collect the per-IP route generated before v1.12.0. The reason to change staleRoutes to map is that:
- Assuming the first ClusterIP is 10.96.0.1, and the second one is 10.96.0.2
- After syncing the second ClusterIP, the route for 10.96.0.1/32 will be added to
staleRoutes
according to L284-L288 - The route 10.96.0.1 will be also listed in
routes
in L304 - The route 10.96.0.1 will be added to
staleRoutes
according to L311-L314 again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- They have different cases to handle and their code are different anyway, I don't feel it's important to use partially consistent code but sacrifice extra system calls.
- It doesn't make sense to relist the routes everytime the ServiceCIDR is updated, I guess the problem is the implementation of the discover generating intermediate values. I think there are more problems in it, Avoid ServiceCIDR flapping on agent start #5017 should fix it.
- Even the Linux implementation seems wrong as it can only clean up larger CIDR but not the smaller ones.
f679376
to
9a586dd
Compare
9a586dd
to
7b2e3fc
Compare
@@ -27,7 +27,7 @@ func RunCommand(cmd string) (string, error) { | |||
// https://stackoverflow.com/questions/19282870/how-can-i-use-try-catch-and-get-my-script-to-stop-if-theres-an-error/19285405 | |||
psCmd := exec.Command("powershell.exe", "-NoLogo", "-NoProfile", "-NonInteractive", "-Command", | |||
fmt.Sprintf(`$ErrorActionPreference="Stop";try {%s} catch {Write-Host $_;os.Exit(1)}`, cmd)) // #nosec G204 | |||
stdout, err := psCmd.Output() | |||
stdout, err := psCmd.CombinedOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tested this powershell function with CombinedOutput support, I got this error: os.Exit : The term ‘os.Exit’ is not recognized as the name of a cmdlet, function, script file, or operable program.
Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:211
+ catch {Write-Host $_;os.Exit(1)}
+ ~~~~~~~
+ CategoryInfo : ObjectNotFound: (os.Exit:String) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : CommandNotFoundException
exit status 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which command? CombinedOutput
is used to output error info, while Output
ignores error info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tested a netnatstaticmapping function that includes the runCommand(cmd). When I replaced Output with CombinedOutput, I got this error. I think this issue has always been there, previously we didn't get it because psCmd.Output() ignore this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got that, then I think it's another issue. We could resolve the issue after merging this. BTW, have you tested the command manually on powershell? If the output of the command in powershell still has the error, maybe we should double check the command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
try { Get-Process} catch { Write-Host $; os.Exit(1) }
pass
try { Get-Process error} catch { Write-Host $; os.Exit(1) }
Cannot find a process with the name "error". Verify the process name and call the cmdlet again.
os.Exit : The term 'os.Exit' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path
is correct and try again.
My concern is after you enable CombinedOutput in RunCommand function, all error commands may bring this same error log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we need to add some code to ignore the error which was ignored by Output
previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with either keeping the use of Output in RunCommand or ignoring the error previously, as long as we don't get this error in the agent log.
In the long term, we may need to addressing the os.Exit issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnqn After syncing with @wenyingd @XinShuYang , the error output is caused by os.Exit(1)
since its not powershell command. We can fix it by changing os.Exit(1)
to Exit(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, need to confirm @XinShuYang's comment before merging
Should the description be updated? |
I'll check the description. |
7b2e3fc
to
0042779
Compare
Updated |
/test-windows-all |
This PR fixes the following issues: 1. AntreaAgent logs "Failed to sync route" when attempting to sync route entries every time. 2. AntreaAgent logs "Failed to install route for Service CIDR" err="failed to delete stale Service CIDR route" during startup. For the first issue, previously, to recover the connected route of antrea-gw0 (assuming the IP address is 10.10.0.1/24) that may have been deleted by mistake, a route with a destination 10.10.0.1/24 and gateway 10.10.0.1 was periodically synced. However, this caused an error because an existing active route with the same destination but a different gateway 0.0.0.0 should have already been automatically installed when antrea-gw0 was created. To address this issue, this PR changes the gateway of the recover route from 10.10.0.1 to 0.0.0.0, which matches the existing installed route. This ensures that the periodic sync will not cause any errors. For the second issue, previously, when syncing the second ClusterIP, the stale route entry installed for the first ClusterIP is added to the stale routes twice. This results in the error log. Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
0042779
to
044a7dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
/test-e2e |
1 similar comment
/test-e2e |
/test-windows-proxyall-e2e |
1 similar comment
/test-windows-proxyall-e2e |
I saw that the failed case in windows-proxyall-e2e is the well-known flaky test TestNodePortLocal/testNPLChangePortRangeAgentRestart. Do we still need any tests for this PR? @tnqn |
It can be ignored |
Aren't the issues introduced in release-1.12? Do them really need backport? |
@@ -383,7 +394,7 @@ func (c *Client) syncRoute() error { | |||
gwAutoconfRoute := &util.Route{ | |||
LinkIndex: c.nodeConfig.GatewayConfig.LinkIndex, | |||
DestinationSubnet: c.nodeConfig.PodIPv4CIDR, | |||
GatewayAddress: c.nodeConfig.GatewayConfig.IPv4, | |||
GatewayAddress: net.IPv4zero, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should backport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's introduced by 6143e78
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I forgot that. Thanks for reminding.
fmt.Sprintf(`$ErrorActionPreference="Stop";try {%s} catch {Write-Host $_;os.Exit(1)}`, cmd)) // #nosec G204 | ||
stdout, err := psCmd.Output() | ||
fmt.Sprintf(`$ErrorActionPreference="Stop";try {%s} catch {Write-Host $_;Exit(1)}`, cmd)) // #nosec G204 | ||
stdout, err := psCmd.CombinedOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
This PR fixes the following issues: 1. AntreaAgent logs "Failed to sync route" when attempting to sync route entries every time. 2. AntreaAgent logs "Failed to install route for Service CIDR" err="failed to delete stale Service CIDR route" during startup. For the first issue, previously, to recover the connected route of antrea-gw0 (assuming the IP address is 10.10.0.1/24) that may have been deleted by mistake, a route with a destination 10.10.0.1/24 and gateway 10.10.0.1 was periodically synced. However, this caused an error because an existing active route with the same destination but a different gateway 0.0.0.0 should have already been automatically installed when antrea-gw0 was created. To address this issue, this PR changes the gateway of the recover route from 10.10.0.1 to 0.0.0.0, which matches the existing installed route. This ensures that the periodic sync will not cause any errors. For the second issue, previously, when syncing the second ClusterIP, the stale route entry installed for the first ClusterIP is added to the stale routes twice. This results in the error log. Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
This PR fixes the following issues:
every time.
delete stale Service CIDR route" during startup.
For the first issue, previously, to recover the connected route of antrea-gw0
(assuming the IP address is 10.10.0.1/24) that may have been deleted by mistake,
a route with a destination 10.10.0.1/24 and gateway 10.10.0.1 was periodically
synced. However, this caused an error because an existing active route with the
same destination but a different gateway 0.0.0.0 should have already been
automatically installed when antrea-gw0 was created. To address this issue, this
PR changes the gateway of the recover route from 10.10.0.1 to 0.0.0.0, which
matches the existing installed route. This ensures that the periodic sync will
not cause any errors.
For the second issue, previously, when syncing the second ClusterIP, the stale
route entry installed for the first ClusterIP is added to the stale routes twice.
This results in the error log.
Signed-off-by: Hongliang Liu lhongliang@vmware.com