Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea Agent sometimes (rarely) fails to query supported OVS datapath features #6471

Closed
antoninbas opened this issue Jun 20, 2024 · 1 comment · Fixed by #6472
Closed

Antrea Agent sometimes (rarely) fails to query supported OVS datapath features #6471

antoninbas opened this issue Jun 20, 2024 · 1 comment · Fixed by #6472
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
On 2 separate occasions, I have observed the antrea-agent container crashing shortly after starting.
In both cases, this was on a new Kind cluster, and Antrea was being installed for the first time in the cluster.
The contents of the logs showed that the Agent had failed to query the supported datapath features from OVS, and hence had exited early with an error:

I0618 18:39:11.300984       1 log_file.go:93] Set log file max size to 104857600
I0618 18:39:11.303689       1 options.go:504] "The Egress feature gate is enabled, but it won't work because it is only applicable to the encap mode"
I0618 18:39:11.304221       1 agent.go:105] "Starting Antrea agent" version="v2.0.0"
I0618 18:39:11.304266       1 client.go:89] No kubeconfig file was specified. Falling back to in-cluster config
I0618 18:39:11.308585       1 prometheus.go:189] Initializing prometheus metrics
I0618 18:39:11.315568       1 ovs_client.go:70] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I0618 18:39:12.323252       1 ovs_client.go:89] Not connected yet, will try again in 2s
I0618 18:39:12.344775       1 agent.go:377] Setting up node network
I0618 18:39:12.353225       1 discoverer.go:82] Starting ServiceCIDRDiscoverer
I0618 18:39:12.456571       1 agent.go:966] "Updating Node MAC annotation"
I0618 18:39:12.649350       1 agent.go:991] "Got Interface MTU" MTU=1500
I0618 18:39:12.658499       1 ovs_client.go:137] Created bridge: 640294ff-f1c4-4816-93f7-72c3fb730dc9
F0618 18:39:13.014321       1 main.go:53] Error running agent: error initializing agent: error listing DP features: ExecError: exit status 2, output: "dpif/show-dp-features" is not a valid command (use "list-commands" to see a list of valid commands)
ovs-appctl: /var/run/openvswitch/ovs-vswitchd.98.ctl: server returned an error
goroutine 1 [running]:
k8s.io/klog/v2/internal/dbg.Stacks(0x0)
	k8s.io/klog/v2@v2.110.1/internal/dbg/dbg.go:35 +0x85
k8s.io/klog/v2.(*loggingT).output(0x4953b00, 0x3, 0x0, 0xc000512af0, 0x1, {0x39b5a60?, 0x1?}, 0x10?, 0x0)
	k8s.io/klog/v2@v2.110.1/klog.go:957 +0x6fa
k8s.io/klog/v2.(*loggingT).printfDepth(0x0?, 0x0?, 0x0, {0x0, 0x0}, 0x0?, {0x2c88231, 0x17}, {0xc0007d6060, 0x1, ...})
	k8s.io/klog/v2@v2.110.1/klog.go:750 +0x1dd
k8s.io/klog/v2.(*loggingT).printf(...)
	k8s.io/klog/v2@v2.110.1/klog.go:727
k8s.io/klog/v2.Fatalf(...)
	k8s.io/klog/v2@v2.110.1/klog.go:1661
main.newAgentCommand.func1(0xc000278300?, {0xc000682e40, 0x0, 0x6})
	antrea.io/antrea/cmd/antrea-agent/main.go:53 +0x2d1
github.com/spf13/cobra.(*Command).execute(0xc000616000, {0xc000146010, 0x6, 0x6})
	github.com/spf13/cobra@v1.8.0/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0xc000616000)
	github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.8.0/command.go:1039
main.main()
	antrea.io/antrea/cmd/antrea-agent/main.go:32 +0x18

After the automatic restart, the Agent was running correctly as expected. So this is not a major issue in any way, just a small inconvenience.

To Reproduce
It doesn't happen very often, but this was observed during a standard install:

kind create cluster --config ci/kind/config-3nodes.yml
helm install -n kube-system antrea antrea/antrea

You may notice that the logs above show that the Agent was installed in noEncap mode. However, I have observed this same issue once in encap mode and once in noEncap mode, during 2 separate installations.

Versions:
Antrea v2.0.0

Additional context
The matching ovs-vswitchd logs:

root@kind-worker2:/# cat /var/log/antrea/openvswitch/ovs-vswitchd.log
2024-06-18T18:39:12.870Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2024-06-18T18:39:12.965Z|00002|ovs_numa|INFO|Discovered 8 CPU cores on NUMA node 0
2024-06-18T18:39:12.965Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 8 CPU cores
2024-06-18T18:39:12.972Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2024-06-18T18:39:12.972Z|00005|netlink_socket|INFO|netlink: could not enable listening to all nsid (Operation not permitted)
2024-06-18T18:39:12.973Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2024-06-18T18:39:13.163Z|00007|dpif_netlink|INFO|Datapath dispatch mode: per-cpu
2024-06-18T18:39:13.166Z|00008|ofproto_dpif|INFO|system@ovs-system: Datapath supports recirculation
2024-06-18T18:39:13.166Z|00009|ofproto_dpif|INFO|system@ovs-system: VLAN header stack length probed as 2
2024-06-18T18:39:13.166Z|00010|ofproto_dpif|INFO|system@ovs-system: MPLS label stack length probed as 3
2024-06-18T18:39:13.167Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath supports truncate action
2024-06-18T18:39:13.167Z|00012|ofproto_dpif|INFO|system@ovs-system: Datapath supports unique flow ids
2024-06-18T18:39:13.167Z|00013|ofproto_dpif|INFO|system@ovs-system: Datapath supports clone action
2024-06-18T18:39:13.167Z|00014|ofproto_dpif|INFO|system@ovs-system: Max sample nesting level probed as 10
2024-06-18T18:39:13.167Z|00015|ofproto_dpif|INFO|system@ovs-system: Datapath supports eventmask in conntrack action
2024-06-18T18:39:13.168Z|00016|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_clear action
2024-06-18T18:39:13.168Z|00017|ofproto_dpif|INFO|system@ovs-system: Max dp_hash algorithm probed to be 1
2024-06-18T18:39:13.168Z|00018|ofproto_dpif|INFO|system@ovs-system: Datapath supports check_pkt_len action
2024-06-18T18:39:13.168Z|00019|ofproto_dpif|INFO|system@ovs-system: Datapath supports timeout policy in conntrack action
2024-06-18T18:39:13.168Z|00020|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_zero_snat
2024-06-18T18:39:13.168Z|00021|ofproto_dpif|INFO|system@ovs-system: Datapath supports add_mpls action
2024-06-18T18:39:13.168Z|00022|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state
2024-06-18T18:39:13.169Z|00023|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_zone
2024-06-18T18:39:13.169Z|00024|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_mark
2024-06-18T18:39:13.171Z|00025|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_label
2024-06-18T18:39:13.171Z|00026|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state_nat
2024-06-18T18:39:13.172Z|00027|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple
2024-06-18T18:39:13.172Z|00028|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple6
2024-06-18T18:39:13.172Z|00029|ofproto_dpif|INFO|system@ovs-system: Datapath does not support IPv6 ND Extensions
2024-06-18T18:39:13.195Z|00030|bridge|INFO|bridge br-int: using datapath ID 0000fe9402641648
2024-06-18T18:39:13.195Z|00031|connmgr|INFO|br-int: added service controller "punix:/var/run/openvswitch/br-int.mgmt"
2024-06-18T18:39:13.209Z|00002|vlog(monitor)|INFO|closing log file
2024-06-18T18:39:13.209Z|00032|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.17.7
2024-06-18T18:39:16.072Z|00033|bridge|INFO|bridge br-int: added interface antrea-gw0 on port 32769
2024-06-18T18:39:18.042Z|00034|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 8, setting n-revalidator-threads to 3
2024-06-18T18:39:18.043Z|00035|ofproto_dpif_upcall|INFO|Starting 11 threads
2024-06-18T18:39:23.040Z|00036|memory|INFO|116644 kB peak resident set size after 10.2 seconds
2024-06-18T18:39:23.041Z|00037|memory|INFO|handlers:8 idl-cells:96 ofconns:1 ports:1 revalidators:3 rules:92
2024-06-18T18:39:27.010Z|00038|bridge|INFO|bridge br-int: added interface coredns--d86282 on port 1
2024-06-18T18:39:27.022Z|00001|ofproto_dpif_xlate(handler8)|INFO|/proc/sys/net/core/netdev_max_backlog: open failed (No such file or directory)
2024-06-18T18:39:27.054Z|00039|bridge|INFO|bridge br-int: added interface coredns--49f73e on port 2
2024-06-18T18:39:27.054Z|00040|bridge|INFO|bridge br-int: added interface local-pa-0db489 on port 3
2024-06-18T18:39:27.055Z|00041|bridge|INFO|bridge br-int: added interface antrea-t-e136b1 on port 4
2024-06-18T18:39:27.318Z|00042|connmgr|INFO|br-int<->unix#2: 118 flow_mods in the 9 s starting 10 s ago (113 adds, 1 deletes, 4 modifications)
2024-06-18T18:40:27.304Z|00043|connmgr|INFO|br-int<->unix#2: 14 flow_mods in the 19 s starting 48 s ago (14 adds)
@antoninbas antoninbas added the kind/bug Categorizes issue or PR as related to a bug. label Jun 20, 2024
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 20, 2024
We wait (for a maximum of 5s) for the datapath_id of the br-int OVS
bridge to be reported in OVSDB, after creating the bridge and before
checking supported datapath features. This prevents errors when querying
the supported features before the ofproto-dpif provider has been
initialized.

Fixes antrea-io#6471

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas
Copy link
Contributor Author

Some more details here:

As you can see from the logs, the CLI call to check supported DP features (error is reported at 18:39:13.014321) happened before the ofproto-dpif provider started checking for supported features (2024-06-18T18:39:13.166Z). Obviously this is an issue in itself. However, the error reported by the Agent indicates that while the CLI server has been started, the CLI command we are invoking has not been registered yet.

This would indicate that the init factory function for ofproto-dpif has not been called yet: https://github.com/openvswitch/ovs/blob/f15de6508ce2ef60bb1a9d5baf0bf4a393a70d37/ofproto/ofproto-dpif.c#L267
I believe that indicates that bridge_init_ofproto has not been called yet: https://github.com/openvswitch/ovs/blob/f15de6508ce2ef60bb1a9d5baf0bf4a393a70d37/vswitchd/bridge.c#L3301-L3305
One possible explanation for this is that ovs-vswitchd is not done retrieving OVSDB contents: https://github.com/openvswitch/ovs/blob/f15de6508ce2ef60bb1a9d5baf0bf4a393a70d37/vswitchd/bridge.c#L3287-L3291

I believe that the solution proposed in #6472 will resolve the issue: it waits for the bridge (br-int) datapath ID to become available in OVSDB before querying supported datapath features.
Here is why:

I think that's a better solution that checking the OVS datapath features in a loop until it succeeds, even though that's debatable :P

antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 20, 2024
We wait (for a maximum of 5s) for the datapath_id of the br-int OVS
bridge to be reported in OVSDB, after creating the bridge and before
checking supported datapath features. This prevents errors when querying
the supported features before the ofproto-dpif provider has been
initialized.

Fixes antrea-io#6471

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas antoninbas self-assigned this Jun 21, 2024
@antoninbas antoninbas added this to the Antrea v2.1 release milestone Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant