Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AntreaProxy route syncing on Windows #4941

Merged
merged 1 commit into from
May 24, 2023

Conversation

hongliangl
Copy link
Contributor

@hongliangl hongliangl commented May 5, 2023

This PR fixes the following issues:

  1. AntreaAgent logs "Failed to sync route" when attempting to sync route entries
    every time.
  2. AntreaAgent logs "Failed to install route for Service CIDR" err="failed to
    delete stale Service CIDR route" during startup.

For the first issue, previously, to recover the connected route of antrea-gw0
(assuming the IP address is 10.10.0.1/24) that may have been deleted by mistake,
a route with a destination 10.10.0.1/24 and gateway 10.10.0.1 was periodically
synced. However, this caused an error because an existing active route with the
same destination but a different gateway 0.0.0.0 should have already been
automatically installed when antrea-gw0 was created. To address this issue, this
PR changes the gateway of the recover route from 10.10.0.1 to 0.0.0.0, which
matches the existing installed route. This ensures that the periodic sync will
not cause any errors.

For the second issue, previously, when syncing the second ClusterIP, the stale
route entry installed for the first ClusterIP is added to the stale routes twice.
This results in the error log.

Signed-off-by: Hongliang Liu lhongliang@vmware.com

@hongliangl hongliangl requested review from tnqn and wenyingd May 5, 2023 12:51
@hongliangl hongliangl added the area/OS/windows Issues or PRs related to the Windows operating system. label May 5, 2023
@hongliangl hongliangl changed the title Fix Windows antrea-gw0 connected route syncing Improve AntreaProxy route syncing on Windows May 6, 2023
@hongliangl hongliangl added area/proxy Issues or PRs related to proxy functions in Antrea action/backport Indicates a PR that requires backports. labels May 6, 2023
@hongliangl
Copy link
Contributor Author

/test-windows-all

pkg/agent/route/route_windows.go Outdated Show resolved Hide resolved
pkg/agent/route/route_windows.go Outdated Show resolved Hide resolved
@hongliangl
Copy link
Contributor Author

/test-windows-all

pkg/agent/route/route_windows.go Outdated Show resolved Hide resolved
pkg/util/ip/ip.go Outdated Show resolved Hide resolved
wenyingd
wenyingd previously approved these changes May 12, 2023
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn added this to the Antrea v1.12 release milestone May 19, 2023
@hongliangl
Copy link
Contributor Author

/test-all
/test-windows-all

Comment on lines 311 to 312
if iputil.IsSubCIDR(serviceCIDR, rt.DestinationSubnet) {
staleRoutes[rt.String()] = rt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it lists routes twice? couldn't get the undesired routes in the above else branch by a unified condition?

// Not the routes we are interested in.
if !rt.GatewayAddress.Equal(gw) {
        continue
}
// It's the latest route we just installed.
if net.IPNetEqual(rt.DestinationSubnet, serviceCIDR) {
        continue
}
// The route covers the desired route. It was installed when the calculated ServiceCIDR is larger than the current one, which could happen after some Services are deleted.
if rt.DestinationSubnet.Contains(serviceCIDR.IP) {
        staleRoutes = append(staleRoutes, rt)
}
// The desired route covers the route. It was either installed when the calculated ServiceCIDR is smaller than the current one, or a per-IP route generated before v1.12.0.
if serviceCIDR.Contains(rt.DestinationSubnet.IP) {
        staleRoutes = append(staleRoutes, rt)
}

It doesn't need to change staleRoutes to map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It lists routes twice only when installing the first Service. For the first listing of routes, it is used to consistent with Linux code. Besides, the logic is used to find the existing routes whose destinations contains the first ClusterIP (serviceCIDR.IP), and it is only called when syncing the first ClusterIP.

For the second listing of routes, it is used to collect the per-IP route generated before v1.12.0. The reason to change staleRoutes to map is that:

  • Assuming the first ClusterIP is 10.96.0.1, and the second one is 10.96.0.2
  • After syncing the second ClusterIP, the route for 10.96.0.1/32 will be added to staleRoutes according to L284-L288
  • The route 10.96.0.1 will be also listed in routes in L304
  • The route 10.96.0.1 will be added to staleRoutes according to L311-L314 again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. They have different cases to handle and their code are different anyway, I don't feel it's important to use partially consistent code but sacrifice extra system calls.
  2. It doesn't make sense to relist the routes everytime the ServiceCIDR is updated, I guess the problem is the implementation of the discover generating intermediate values. I think there are more problems in it, Avoid ServiceCIDR flapping on agent start #5017 should fix it.
  3. Even the Linux implementation seems wrong as it can only clean up larger CIDR but not the smaller ones.

pkg/agent/route/route_windows.go Show resolved Hide resolved
@@ -27,7 +27,7 @@ func RunCommand(cmd string) (string, error) {
// https://stackoverflow.com/questions/19282870/how-can-i-use-try-catch-and-get-my-script-to-stop-if-theres-an-error/19285405
psCmd := exec.Command("powershell.exe", "-NoLogo", "-NoProfile", "-NonInteractive", "-Command",
fmt.Sprintf(`$ErrorActionPreference="Stop";try {%s} catch {Write-Host $_;os.Exit(1)}`, cmd)) // #nosec G204
stdout, err := psCmd.Output()
stdout, err := psCmd.CombinedOutput()
Copy link
Contributor

@XinShuYang XinShuYang May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested this powershell function with CombinedOutput support, I got this error: os.Exit : The term ‘os.Exit’ is not recognized as the name of a cmdlet, function, script file, or operable program.
Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:211
+ catch {Write-Host $_;os.Exit(1)}
+ ~~~~~~~
+ CategoryInfo : ObjectNotFound: (os.Exit:String) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : CommandNotFoundException
exit status 1

Copy link
Contributor Author

@hongliangl hongliangl May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which command? CombinedOutput is used to output error info, while Output ignores error info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tested a netnatstaticmapping function that includes the runCommand(cmd). When I replaced Output with CombinedOutput, I got this error. I think this issue has always been there, previously we didn't get it because psCmd.Output() ignore this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got that, then I think it's another issue. We could resolve the issue after merging this. BTW, have you tested the command manually on powershell? If the output of the command in powershell still has the error, maybe we should double check the command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.
try { Get-Process} catch { Write-Host $; os.Exit(1) }
pass
try { Get-Process error} catch { Write-Host $
; os.Exit(1) }
Cannot find a process with the name "error". Verify the process name and call the cmdlet again.
os.Exit : The term 'os.Exit' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path
is correct and try again.

My concern is after you enable CombinedOutput in RunCommand function, all error commands may bring this same error log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we need to add some code to ignore the error which was ignored by Output previously?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with either keeping the use of Output in RunCommand or ignoring the error previously, as long as we don't get this error in the agent log.
In the long term, we may need to addressing the os.Exit issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn After syncing with @wenyingd @XinShuYang , the error output is caused by os.Exit(1) since its not powershell command. We can fix it by changing os.Exit(1) to Exit(1)

tnqn
tnqn previously approved these changes May 23, 2023
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, need to confirm @XinShuYang's comment before merging

@tnqn
Copy link
Member

tnqn commented May 23, 2023

Should the description be updated?

@hongliangl
Copy link
Contributor Author

Should the description be updated?

I'll check the description.

@hongliangl
Copy link
Contributor Author

Should the description be updated?

Updated

@hongliangl
Copy link
Contributor Author

/test-windows-all

This PR fixes the following issues:

1. AntreaAgent logs "Failed to sync route" when attempting to sync route entries
  every time.
2. AntreaAgent logs "Failed to install route for Service CIDR" err="failed to
  delete stale Service CIDR route" during startup.

For the first issue, previously, to recover the connected route of antrea-gw0
(assuming the IP address is 10.10.0.1/24) that may have been deleted by mistake,
a route with a destination 10.10.0.1/24 and gateway 10.10.0.1 was periodically
synced. However, this caused an error because an existing active route with the
same destination but a different gateway 0.0.0.0 should have already been
automatically installed when antrea-gw0 was created. To address this issue, this
PR changes the gateway of the recover route from 10.10.0.1 to 0.0.0.0, which
matches the existing installed route. This ensures that the periodic sync will
not cause any errors.

For the second issue, previously, when syncing the second ClusterIP, the stale
route entry installed for the first ClusterIP is added to the stale routes twice.
This results in the error log.

Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented May 23, 2023

/test-all
/test-windows-all

@hongliangl
Copy link
Contributor Author

/test-e2e

1 similar comment
@hongliangl
Copy link
Contributor Author

/test-e2e

@hongliangl
Copy link
Contributor Author

/test-windows-proxyall-e2e

1 similar comment
@hongliangl
Copy link
Contributor Author

/test-windows-proxyall-e2e

@hongliangl
Copy link
Contributor Author

I saw that the failed case in windows-proxyall-e2e is the well-known flaky test TestNodePortLocal/testNPLChangePortRangeAgentRestart. Do we still need any tests for this PR? @tnqn

@tnqn
Copy link
Member

tnqn commented May 24, 2023

I saw that the failed case in windows-proxyall-e2e is the well-known flaky test TestNodePortLocal/testNPLChangePortRangeAgentRestart. Do we still need any tests for this PR? @tnqn

It can be ignored

@tnqn tnqn merged commit b16e4e9 into antrea-io:main May 24, 2023
@tnqn
Copy link
Member

tnqn commented May 24, 2023

Aren't the issues introduced in release-1.12? Do them really need backport?

@hongliangl hongliangl deleted the 20230505-windows-log-issue branch May 24, 2023 04:59
@@ -383,7 +394,7 @@ func (c *Client) syncRoute() error {
gwAutoconfRoute := &util.Route{
LinkIndex: c.nodeConfig.GatewayConfig.LinkIndex,
DestinationSubnet: c.nodeConfig.PodIPv4CIDR,
GatewayAddress: c.nodeConfig.GatewayConfig.IPv4,
GatewayAddress: net.IPv4zero,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should backport.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's introduced by 6143e78

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I forgot that. Thanks for reminding.

fmt.Sprintf(`$ErrorActionPreference="Stop";try {%s} catch {Write-Host $_;os.Exit(1)}`, cmd)) // #nosec G204
stdout, err := psCmd.Output()
fmt.Sprintf(`$ErrorActionPreference="Stop";try {%s} catch {Write-Host $_;Exit(1)}`, cmd)) // #nosec G204
stdout, err := psCmd.CombinedOutput()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@hongliangl hongliangl removed the action/backport Indicates a PR that requires backports. label May 24, 2023
@wenyingd wenyingd mentioned this pull request May 24, 2023
ceclinux pushed a commit to ceclinux/antrea that referenced this pull request Jun 5, 2023
This PR fixes the following issues:

1. AntreaAgent logs "Failed to sync route" when attempting to sync route entries
  every time.
2. AntreaAgent logs "Failed to install route for Service CIDR" err="failed to
  delete stale Service CIDR route" during startup.

For the first issue, previously, to recover the connected route of antrea-gw0
(assuming the IP address is 10.10.0.1/24) that may have been deleted by mistake,
a route with a destination 10.10.0.1/24 and gateway 10.10.0.1 was periodically
synced. However, this caused an error because an existing active route with the
same destination but a different gateway 0.0.0.0 should have already been
automatically installed when antrea-gw0 was created. To address this issue, this
PR changes the gateway of the recover route from 10.10.0.1 to 0.0.0.0, which
matches the existing installed route. This ensures that the periodic sync will
not cause any errors.

For the second issue, previously, when syncing the second ClusterIP, the stale
route entry installed for the first ClusterIP is added to the stale routes twice.
This results in the error log.

Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/OS/windows Issues or PRs related to the Windows operating system. area/proxy Issues or PRs related to proxy functions in Antrea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants