Skip to content

freebsd (and linux) dco: event handler mess #947

@cron2

Description

@cron2

So, the initial observation was "the community VPN had intermediate network issues, and after a few ping timeout -> restart events, the openvpn client crashed with DCO key setup errors".

Trying to reproduce this, I have at least something that is not working right, and is easy to reproduce.

Start with a generic client config with persist-tun

client
dev tun
proto udp
remote ubuntu2004.ov.greenie.net 51194
resolv-retry infinite
auth-retry none
nobind
persist-key
persist-tun
remote-cert-tls server
verb 6
# ca, key, cert as needed
reneg-sec 300

connect, see it working, and on server timeout (kill $CN on the server management interface, which kills the instance hard without(!) EEN), events are properly delivered. I have instrumented dco_event_set() and kernel if_ovpn to see what is going on:

Dec 30 10:30:14 fbsd14 openvpn[11834]: Data Channel: cipher 'AES-256-GCM', peer-id: 1
Dec 30 10:30:14 fbsd14 openvpn[11834]: Timers: ping 10, ping-restart 30
Dec 30 10:30:14 fbsd14 openvpn[11834]: Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:14 fbsd14 openvpn[11834]: UDPv6 WRITE [34] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_ACK_V1 kid=0 [ 4 3 2 1 ] DATA len=0
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:24 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:30:29 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:29 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:34 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:30:44 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:30:44 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:44 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:54 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:00 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:00 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:31:04 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:14 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:15 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:15 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:31:24 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:30 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:30 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:31:34 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:34 fbsd14 kernel: if_ovpn: ovpn_notify_del_peer: NOTIF_DEL_PEER for peer 1, reason=1
Dec 30 10:31:34 fbsd14 openvpn[11834]: UDPv6 READ [0] from [AF_INET6][undef]:51194: DATA UNDEF len=-1
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_event_set, pending=1
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_read_and_process
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_read_and_process: received NOTIF_DEL_PEER for peer-id=1, reason=0
Dec 30 10:31:34 fbsd14 openvpn[11834]: GERT: process_incoming_dco, dco_message_type=1
Dec 30 10:31:34 fbsd14 openvpn[11834]: process_incoming_dco: received peer expired notification of for peer-id 1
Dec 30 10:31:34 fbsd14 openvpn[11834]: [server] Inactivity timeout (--ping-restart), restarting
Dec 30 10:31:34 fbsd14 openvpn[11834]: register signal: SIGUSR1 (ping-restart)
Dec 30 10:31:34 fbsd14 openvpn[11834]: TCP/UDP: Closing socket
Dec 30 10:31:34 fbsd14 openvpn[11834]: SIGUSR1[soft,ping-restart] received, process restarting

(the different "reason" values are propably remapped in dco_freebsd.c, but that is not of interest here)

After the reconnect, we see...

Dec 30 10:31:36 fbsd14 openvpn[11834]: dco_set_peer: peer-id 1, ping interval 10, ping timeout 30
Dec 30 10:31:36 fbsd14 openvpn[11834]: dco_install_key: peer_id=1 keyid=0 epoch=0, currently 0 keys installed
Dec 30 10:31:36 fbsd14 openvpn[11834]: dco_new_key: slot 0, key-id 0, peer-id 1, cipher AES-256-GCM, epoch 0
Dec 30 10:31:36 fbsd14 openvpn[11834]: Initialization Sequence Completed
Dec 30 10:31:36 fbsd14 openvpn[11834]: Data Channel: cipher 'AES-256-GCM', peer-id: 1
Dec 30 10:31:36 fbsd14 openvpn[11834]: Timers: ping 10, ping-restart 30
Dec 30 10:31:36 fbsd14 openvpn[11834]: Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt
Dec 30 10:31:36 fbsd14 openvpn[11834]: UDPv6 WRITE [34] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_ACK_V1 kid=0 [ 4 3 2 1 ] DATA len=0
Dec 30 10:31:46 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:56 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:07 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:17 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:27 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:37 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:47 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:57 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:07 fbsd14 kernel: if_ovpn: ping peer_id 1

note the complete absence of calls to dco_event_set(). Killing the client instance on the server:

Dec 30 10:33:17 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:27 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:37 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:47 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:56 fbsd14 openvpn[11834]: UDPv6 READ [0] from [AF_INET6][undef]:51194: DATA UNDEF len=-1
Dec 30 10:33:56 fbsd14 kernel: if_ovpn: ovpn_notify_del_peer: NOTIF_DEL_PEER for peer 1, reason=1

so the kernel side properly ping-timeouts the server and claims to have generated the notification, but all userland got was a read error on the socket... at this point the tun is broken - userland thinks "all is fine" and kernel returns "network unreachable" because the peer is gone:

gert@fbsd14:~ $ ifconfig tun0
tun0: flags=1008043<UP,BROADCAST,RUNNING,MULTICAST,LOWER_UP> metric 0 mtu 1500
        options=80000<LINKSTATE>
        inet 10.220.2.4 netmask 0xffffff00 broadcast 10.220.2.255
        inet6 fd00:abcd:220:2::1002 prefixlen 64
        inet6 fe80::be24:11ff:fe19:c3d6%tun0 prefixlen 64 scopeid 0x4
        groups: openvpn
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
gert@fbsd14:~ $ ping 10.220.2.1
PING 10.220.2.1 (10.220.2.1): 56 data bytes
ping: sendto: Network is down

5 Minutes later (reneg-sec) we see that kernel and userland are indeed desynced and the server instance is, indeed, gone...

Dec 30 10:36:37 fbsd14 openvpn[11834]: TLS: soft reset sec=301/300 bytes=0/-1 pkts=0/0 aead_limit_send=0/60129542137 aead_limit_recv=0/60129542137
Dec 30 10:36:37 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=361 DATA len=40
Dec 30 10:36:39 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=617 DATA len=40
Dec 30 10:36:43 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=873 DATA len=40
Dec 30 10:36:51 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=1129 DATA len=40
Dec 30 10:37:08 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=1385 DATA len=40
Dec 30 10:37:37 fbsd14 openvpn[11834]: TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
Dec 30 10:37:37 fbsd14 kernel: 
Dec 30 10:37:37 fbsd14 openvpn[11834]: TLS Error: TLS handshake failed
Dec 30 10:37:37 fbsd14 openvpn[11834]: TLS: move_session: dest=TM_LAME_DUCK src=TM_ACTIVE reinit_src=1
...
Dec 30 10:37:37 fbsd14 openvpn[11834]: PUSH: Received control message: 'PUSH_REPLY,route 10.220.0.0 255.255.0.0,route-ipv6 fd00:abcd:220::/48,tun-ipv6,route-gateway 10.220.2.1,topology subnet,ping 10,ping-restart 30,ifconfig-ipv6 fd00:abcd:220:2::1002/64 fd00:abcd:220:2::1,ifconfig 10.220.2.4 255.255.255.0,peer-id 1,cipher AES-256-GCM,protocol-flags cc-exit tls-ekm dyn-tls-crypt,tun-mtu 1500'
Dec 30 10:37:37 fbsd14 openvpn[11834]: dco_del_peer: peer-id 1
Dec 30 10:37:37 fbsd14 openvpn[11834]: Failed to delete peer: No such file or directory (errno=2)
Dec 30 10:37:37 fbsd14 openvpn[11834]: dco_new_peer: peer-id 1, fd 4

at this point the tunnel is working, but then it breaks apart

Dec 30 10:37:37 fbsd14 openvpn[11834]: dco_new_key: slot 1, key-id 0, peer-id 1, cipher AES-256-GCM, epoch 0
Dec 30 10:37:37 fbsd14 openvpn[11834]: Initialization Sequence Completed
Dec 30 10:37:37 fbsd14 openvpn[11834]: Data Channel: cipher 'AES-256-GCM', peer-id: 1
Dec 30 10:37:37 fbsd14 openvpn[11834]: Timers: ping 10, ping-restart 30
Dec 30 10:37:37 fbsd14 openvpn[11834]: Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt
Dec 30 10:37:37 fbsd14 openvpn[11834]: UDPv6 WRITE [38] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_ACK_V1 kid=0 [ 4 3 2 1 0 ] DATA len=0
Dec 30 10:37:48 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:37:57 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:38:07 fbsd14 kernel: if_ovpn: ovpn_notify_del_peer: NOTIF_DEL_PEER for peer 1, reason=1
Dec 30 10:38:07 fbsd14 openvpn[11834]: UDPv6 READ [0] from [AF_INET6][undef]:51194: DATA UNDEF len=-1
Dec 30 10:38:17 fbsd14 openvpn[11834]: UDPv6 READ [40] from [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_DATA_V2 kid=0 DATA len=39
Dec 30 10:38:27 fbsd14 openvpn[11834]: UDPv6 READ [40] from [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_DATA_V2 kid=0 DATA len=39
Dec 30 10:38:38 fbsd14 openvpn[11834]: Swapping primary and secondary keys to primary-id=0 secondary-id=0
Dec 30 10:38:38 fbsd14 openvpn[11834]: dco_swap_keys: peer-id 1
Dec 30 10:38:38 fbsd14 openvpn[11834]: Failed to swap keys: No such file or directory (errno=2)
Dec 30 10:38:38 fbsd14 openvpn[11834]: Cannot swap keys: Operation not permitted (-1)
Dec 30 10:38:38 fbsd14 openvpn[11834]: register signal: SIGUSR1 (dco update keys error)
Dec 30 10:38:38 fbsd14 openvpn[11834]: dco_del_peer: peer-id 1
Dec 30 10:38:38 fbsd14 openvpn[11834]: Failed to delete peer: No such file or directory (errno=2)
Dec 30 10:38:38 fbsd14 openvpn[11834]: TCP/UDP: Closing socket
Dec 30 10:38:38 fbsd14 openvpn[11834]: SIGUSR1[soft,dco update keys error] received, process restarting
Dec 30 10:38:38 fbsd14 openvpn[11834]: signal_reset: signal SIGUSR1 is cleared
Dec 30 10:38:38 fbsd14 openvpn[11834]: Restart pause, 2 second(s)

(interesting enough, my test instance just restarts, while my community VPN instance FATAL'ed at this point - which I can't reproduce right now)

Even after the restart, we still are not polling events...

So, I think we have two issues here

Major:

  • after a restart/reconnect, the cyclic calling of dco_event_set() is gone, so notification delivery breaks, things desynchronize, and do weird "it does not work until I restart openvpn, with nothing in the log!" things

Minor:

  • the whole "we only check every 15 seconds for kernel-side notifications, and then we queue a pipefd write so we can listen for file descriptor events from DCO and act on them" is not really a good hack. I think we should change this to do a timer-based callout (coarse timer handler?) which will poll-and-process the events, and do away with the pipe stuff.

I will proceed to bisect this, and see where we broke this... assigning to Giaan, Ralf ("you understands event") and me ("FreeBSD")...

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions