Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix tunnel interface name issue #2486

Merged
merged 1 commit into from
Aug 6, 2021

Conversation

luolanzone
Copy link
Contributor

the purpose of this commit is to fix a tunnel interface issue founded during some IPSec PoC verification:

  1. when the node name is like 'lan-k8s-0-0', the IPsec tunnel interface name will be
    like '-k8s-0-2-e8dbe6', then it will failed to run command like
    ipsec up '-k8s-0-2-e8dbe6' inside a Antrea agent with error /usr/lib/ipsec/stroke: invalid option -- 'k'
    due to the first char is '-', ipsec command interrupted it as an
    option. so changed the generateInterfaceName method to use strings.TrimLeft() to remove '-' in left.

  2. createIPSecTunnelPort method can't handle the case when tunnel
    interface name changed when there is cache matched for the node. it
    will reuse the existing tunnel name without creating a new one with new
    name which means any change in generateInterfaceName won't take
    affect. so refine it with more checks.

and also add some unit test cases.

Signed-off-by: Lan Luo luola@vmware.com

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2021

Codecov Report

Merging #2486 (e35417e) into main (42077cb) will decrease coverage by 14.66%.
The diff coverage is 5.55%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2486       +/-   ##
===========================================
- Coverage   61.70%   47.04%   -14.67%     
===========================================
  Files         276      278        +2     
  Lines       21317    21364       +47     
===========================================
- Hits        13154    10051     -3103     
- Misses       6782     9985     +3203     
+ Partials     1381     1328       -53     
Flag Coverage Δ
kind-e2e-tests 47.04% <5.55%> (-5.88%) ⬇️
unit-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...gent/controller/noderoute/node_route_controller.go 40.13% <0.00%> (-6.95%) ⬇️
pkg/agent/util/net.go 51.28% <50.00%> (ø)
pkg/agent/controller/egress/id_allocator.go 0.00% <0.00%> (-100.00%) ⬇️
pkg/controller/networkpolicy/endpoint_querier.go 2.85% <0.00%> (-88.58%) ⬇️
pkg/controller/egress/controller.go 0.00% <0.00%> (-83.88%) ⬇️
pkg/agent/util/iptables/lock.go 0.00% <0.00%> (-81.82%) ⬇️
pkg/controller/networkpolicy/crd_utils.go 10.84% <0.00%> (-80.73%) ⬇️
pkg/apis/controlplane/v1beta1/conversion.go 1.20% <0.00%> (-79.81%) ⬇️
pkg/controller/networkpolicy/clustergroup.go 3.27% <0.00%> (-79.44%) ⬇️
pkg/cni/client.go 0.00% <0.00%> (-75.52%) ⬇️
... and 117 more

Comment on lines 540 to 543
if ok && (interfaceConfig.InterfaceName != portName ||
interfaceConfig.PSK != c.networkConfig.IPSecPSK ||
!interfaceConfig.RemoteIP.Equal(nodeIP) ||
interfaceConfig.TunnelInterfaceConfig.Type != c.networkConfig.TunnelType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be duplicated from removeStaleTunnelPorts, can we define a separate function for tunnel config comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do.

Comment on lines 547 to 544
} else {
stalePortRemoved = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious as to why we don't just return an error here and let the Node be requeued by the controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will leave it to reconcile

pkg/agent/util/net.go Show resolved Hide resolved
@luolanzone luolanzone force-pushed the tunnel-bug-fix branch 2 times, most recently from 17c50ac to 0e978e4 Compare July 29, 2021 02:56
luolanzone added a commit to luolanzone/antrea that referenced this pull request Jul 29, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Jul 29, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Jul 29, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
Copy link
Contributor

@abhiraut abhiraut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits

if !c.validateInterfaceConfig(interfaceConfig, nodeIP, portName) {
return 0, fmt.Errorf("IPSec tunnel interface config mismatched with cached one,stale IPSec tunnel port %s", interfaceConfig.InterfaceName)
} else {
klog.V(2).Infof("find cached IPSec tunnel interface %s with port %d for Node %s", interfaceConfig.InterfaceName, interfaceConfig.OFPort, nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.V(2).Infof("find cached IPSec tunnel interface %s with port %d for Node %s", interfaceConfig.InterfaceName, interfaceConfig.OFPort, nodeName)
klog.V(2).Infof("Found cached IPSec tunnel interface %s with port %d for Node %s", interfaceConfig.InterfaceName, interfaceConfig.OFPort, nodeName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return interfaceConfig.OFPort, nil
if !c.validateInterfaceConfig(interfaceConfig, nodeIP, portName) {
return 0, fmt.Errorf("IPSec tunnel interface config mismatched with cached one,stale IPSec tunnel port %s", interfaceConfig.InterfaceName)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else block is not needed since you return in the if statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/agent/util/net.go Show resolved Hide resolved
abhiraut
abhiraut previously approved these changes Jul 30, 2021
@luolanzone
Copy link
Contributor Author

@antoninbas @abhiraut Could you help to take a look again? thanks!

@@ -258,6 +256,14 @@ func (c *Controller) removeStaleTunnelPorts() error {
return nil
}

func (c *Controller) validateInterfaceConfig(interfaceConfig *interfacestore.InterfaceConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/validateInterfaceConfig/compareInterfaceConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

interfaceConfig, ok := c.interfaceStore.GetNodeTunnelInterface(nodeName)
// check if Node IP, PSK, or tunnel type changes. This can
// happen if removeStaleTunnelPorts fails to remove a "stale"
// tunnel port for which the configuration has changed, return error to requeue the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/node/Node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// happen if removeStaleTunnelPorts fails to remove a "stale"
// tunnel port for which the configuration has changed.
if !c.validateInterfaceConfig(interfaceConfig, nodeIP, portName) {
return 0, fmt.Errorf("IPSec tunnel interface config mismatched with cached one, stale IPSec tunnel port %s", interfaceConfig.InterfaceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/mismatched with cached one/doesn't match cached one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if !c.validateInterfaceConfig(interfaceConfig, nodeIP, portName) {
return 0, fmt.Errorf("IPSec tunnel interface config mismatched with cached one, stale IPSec tunnel port %s", interfaceConfig.InterfaceName)
}
klog.V(2).Infof("Found cached IPSec tunnel interface %s with port %d for Node %s", interfaceConfig.InterfaceName, interfaceConfig.OFPort, nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use structured logging: klog.V(2).InfoS

@@ -23,6 +23,7 @@ import (
"github.com/containernetworking/plugins/pkg/ip"
"github.com/golang/mock/gomock"
"github.com/stretchr/testify/assert"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't typically split these into 2 separate import groups IIRC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -38,10 +39,14 @@ func generateInterfaceName(key string, name string, useHead bool) string {
interfaceKey := hex.EncodeToString(hash.Sum(nil))
prefix := name
if len(name) > interfacePrefixLength {
// We use Node/Pod name to generate the interface name,
// valid chars for Node/Pod name are ASCII letters from a to z,
// the digits from 0 to 9, and the hyphen (-), hyphen (-) is the only char
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a new sentence:

Hyphen (-) is the only char which will impact command-line interpretation if the interface name starts with one, so we remove it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it.

Comment on lines 237 to 270
c, closeFn := newController(t, &config.NetworkConfig{
TrafficEncapMode: 0,
TunnelType: ovsconfig.TunnelType("vxlan"),
EnableIPSecTunnel: true,
IPSecPSK: "changeme",
})

c.interfaceStore.AddInterface(&interfacestore.InterfaceConfig{
Type: interfacestore.TunnelInterface,
InterfaceName: util.GenerateNodeTunnelInterfaceName("xyz-k8s-0-1"),
TunnelInterfaceConfig: &interfacestore.TunnelInterfaceConfig{
NodeName: "xyz-k8s-0-1",
Type: ovsconfig.TunnelType("vxlan"),
PSK: "mismatchpsk",
RemoteIP: nodeIP1,
},
OVSPortConfig: &interfacestore.OVSPortConfig{
PortUUID: "123",
},
})

defer closeFn()
defer c.queue.ShutDown()
stopCh := make(chan struct{})
defer close(stopCh)
c.informerFactory.Start(stopCh)
c.informerFactory.WaitForCacheSync(stopCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe unify this set up code in a separate function:

func setup(ifaces ...*interfacestore.InterfaceConfig) cleanupFn {
    ...
}

then in the test:

cleanup := setup(ifaceConfig)
defer cleanup()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 285 to 299
tests := []struct {
name string
wantErr bool
}{
{
name: "good path with IPSec enabled",
wantErr: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if err := c.removeStaleTunnelPorts(); (err != nil) != tt.wantErr {
t.Errorf("Controller.removeStaleTunnelPorts() error = %v, wantErr %v", err, tt.wantErr)
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think there will be more tests added in the future? maybe since you only have one test right now there is no need to use a table pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed table pattern

@luolanzone luolanzone force-pushed the tunnel-bug-fix branch 2 times, most recently from 16a97ce to 39e711d Compare August 4, 2021 03:38
Comment on lines 373 to 380
if (err != nil) != tt.wantErr {
t.Errorf("Controller.createIPSecTunnelPort() error = %v, wantErr %v", err, tt.wantErr)
return
}
if got != tt.want {
t.Errorf("Controller.createIPSecTunnelPort() got %v,want = %v", got, tt.want)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use testify assertions for these tests like we do in other Antrea unit tests? assert assertions will continue the test on error, while require assertions will abort the test on error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

want int32
}{
{
name: "good path",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"good path" is not a great name, maybe "create new port"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

the purpose of this commit is to fix a tunnel interface issue
founded during some IPSec PoC verification:

1. when the node name is like 'lan-k8s-0-0', the IPsec tunnel interface name will be
like '-k8s-0-2-e8dbe6', then it will failed to run command like
`ipsec up '-k8s-0-2-e8dbe6'` with error `/usr/lib/ipsec/stroke: invalid option -- 'k'`
due to the first char is '-', ipsec command interrupted it as an
option. so changed the `generateInterfaceName` method to use `strings.TrimLeft()` to remove '-' in left.

2. `createIPSecTunnelPort` method can't handle the case when tunnel
interface name changed when there is cache matched for the node. it
will reuse the existing tunnel name without creating a new one with new
name which means any change in `generateInterfaceName` won't take
affect.

and also add some unit test cases.

Signed-off-by: Lan Luo <luola@vmware.com>
@antoninbas
Copy link
Contributor

/test-all
/test-ipv6-e2e
/test-ipv6-only-e2e

@luolanzone
Copy link
Contributor Author

/test-ipv6-only-e2e

@antoninbas antoninbas merged commit 09fc82f into antrea-io:main Aug 6, 2021
@antoninbas
Copy link
Contributor

@luolanzone please cherry-pick to release-1.2

luolanzone added a commit to luolanzone/antrea that referenced this pull request Aug 9, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Aug 9, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
@luolanzone luolanzone deleted the tunnel-bug-fix branch August 9, 2021 02:32
luolanzone added a commit to luolanzone/antrea that referenced this pull request Aug 10, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late review, I notice this change via another PR and have a question.

interfaceConfig, ok := c.interfaceStore.GetNodeTunnelInterface(nodeName)
// check if Node IP, PSK, or tunnel type changes. This can
// happen if removeStaleTunnelPorts fails to remove a "stale"
// tunnel port for which the configuration has changed, return error to requeue the Node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you removed the TODO, and according to the commit message, I suppose this is fixed. However, how requeuing the Node helps? I think the next attemp will still get here and requeue again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tnqn, I think the TODO is highlighting it doesn't check the configuration change here but return directly. it's an assumption that it's a temporary issue if last remove action failed in removeStaleTunnelPorts, so re-queue will help because removeStaleTunnelPorts also checks configuration difference. but it indeed will go back here if failed again, is there any case in your mind port can't be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that we would call c.ovsBridgeClient.DeletePort and c.interfaceStore.DeleteInterface here, and in case of error during port removal, return an error and let the Node be requeued. removeStaleTunnelPorts is only called on controller start up and transient errors are always possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is not really fixed. @luolanzone would you have a follow-up PR to fix it? and maybe have a test covering this scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will fix this.

annakhm pushed a commit to annakhm/antrea that referenced this pull request Aug 16, 2021
The purpose of this commit is to fix a tunnel interface issue
founded during some IPSec PoC verification:

1. when the node name is like 'lan-k8s-0-0', the IPsec tunnel interface name will be
like '-k8s-0-2-e8dbe6', then it will failed to run command like
`ipsec up '-k8s-0-2-e8dbe6'` with error `/usr/lib/ipsec/stroke: invalid option -- 'k'`
due to the first char is '-', ipsec command interrupted it as an
option. so changed the `generateInterfaceName` method to use `strings.TrimLeft()` to remove '-' in left.

2. `createIPSecTunnelPort` method can't handle the case when tunnel
interface name changed when there is cache matched for the node. it
will reuse the existing tunnel name without creating a new one with new
name which means any change in `generateInterfaceName` won't take
affect.

and also add some unit test cases.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Aug 17, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Aug 18, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Aug 18, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Aug 19, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Sep 13, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this pull request Sep 16, 2021
this PR is based on antrea-io#2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
tnqn pushed a commit that referenced this pull request Sep 16, 2021
this PR is based on #2486 and I verified all tunnel modes with
IPSec in K8s Cluster, it all works fine now, so I remove the limitation
on our docs and the check in the code.

Signed-off-by: Lan Luo <luola@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants