-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
So, the initial observation was "the community VPN had intermediate network issues, and after a few ping timeout -> restart events, the openvpn client crashed with DCO key setup errors".
Trying to reproduce this, I have at least something that is not working right, and is easy to reproduce.
Start with a generic client config with persist-tun
client
dev tun
proto udp
remote ubuntu2004.ov.greenie.net 51194
resolv-retry infinite
auth-retry none
nobind
persist-key
persist-tun
remote-cert-tls server
verb 6
# ca, key, cert as needed
reneg-sec 300
connect, see it working, and on server timeout (kill $CN on the server management interface, which kills the instance hard without(!) EEN), events are properly delivered. I have instrumented dco_event_set() and kernel if_ovpn to see what is going on:
Dec 30 10:30:14 fbsd14 openvpn[11834]: Data Channel: cipher 'AES-256-GCM', peer-id: 1
Dec 30 10:30:14 fbsd14 openvpn[11834]: Timers: ping 10, ping-restart 30
Dec 30 10:30:14 fbsd14 openvpn[11834]: Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:14 fbsd14 openvpn[11834]: UDPv6 WRITE [34] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_ACK_V1 kid=0 [ 4 3 2 1 ] DATA len=0
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:14 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:24 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:30:29 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:29 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:34 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:30:44 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:30:44 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:30:44 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:30:54 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:00 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:00 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:31:04 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:14 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:15 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:15 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:31:24 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:30 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:30 fbsd14 openvpn[11834]: dco_event_set, pending=0
Dec 30 10:31:34 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:34 fbsd14 kernel: if_ovpn: ovpn_notify_del_peer: NOTIF_DEL_PEER for peer 1, reason=1
Dec 30 10:31:34 fbsd14 openvpn[11834]: UDPv6 READ [0] from [AF_INET6][undef]:51194: DATA UNDEF len=-1
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_event_set
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_event_set, pending=1
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_read_and_process
Dec 30 10:31:34 fbsd14 openvpn[11834]: dco_read_and_process: received NOTIF_DEL_PEER for peer-id=1, reason=0
Dec 30 10:31:34 fbsd14 openvpn[11834]: GERT: process_incoming_dco, dco_message_type=1
Dec 30 10:31:34 fbsd14 openvpn[11834]: process_incoming_dco: received peer expired notification of for peer-id 1
Dec 30 10:31:34 fbsd14 openvpn[11834]: [server] Inactivity timeout (--ping-restart), restarting
Dec 30 10:31:34 fbsd14 openvpn[11834]: register signal: SIGUSR1 (ping-restart)
Dec 30 10:31:34 fbsd14 openvpn[11834]: TCP/UDP: Closing socket
Dec 30 10:31:34 fbsd14 openvpn[11834]: SIGUSR1[soft,ping-restart] received, process restarting
(the different "reason" values are propably remapped in dco_freebsd.c, but that is not of interest here)
After the reconnect, we see...
Dec 30 10:31:36 fbsd14 openvpn[11834]: dco_set_peer: peer-id 1, ping interval 10, ping timeout 30
Dec 30 10:31:36 fbsd14 openvpn[11834]: dco_install_key: peer_id=1 keyid=0 epoch=0, currently 0 keys installed
Dec 30 10:31:36 fbsd14 openvpn[11834]: dco_new_key: slot 0, key-id 0, peer-id 1, cipher AES-256-GCM, epoch 0
Dec 30 10:31:36 fbsd14 openvpn[11834]: Initialization Sequence Completed
Dec 30 10:31:36 fbsd14 openvpn[11834]: Data Channel: cipher 'AES-256-GCM', peer-id: 1
Dec 30 10:31:36 fbsd14 openvpn[11834]: Timers: ping 10, ping-restart 30
Dec 30 10:31:36 fbsd14 openvpn[11834]: Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt
Dec 30 10:31:36 fbsd14 openvpn[11834]: UDPv6 WRITE [34] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_ACK_V1 kid=0 [ 4 3 2 1 ] DATA len=0
Dec 30 10:31:46 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:31:56 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:07 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:17 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:27 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:37 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:47 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:32:57 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:07 fbsd14 kernel: if_ovpn: ping peer_id 1
note the complete absence of calls to dco_event_set(). Killing the client instance on the server:
Dec 30 10:33:17 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:27 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:37 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:47 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:33:56 fbsd14 openvpn[11834]: UDPv6 READ [0] from [AF_INET6][undef]:51194: DATA UNDEF len=-1
Dec 30 10:33:56 fbsd14 kernel: if_ovpn: ovpn_notify_del_peer: NOTIF_DEL_PEER for peer 1, reason=1
so the kernel side properly ping-timeouts the server and claims to have generated the notification, but all userland got was a read error on the socket... at this point the tun is broken - userland thinks "all is fine" and kernel returns "network unreachable" because the peer is gone:
gert@fbsd14:~ $ ifconfig tun0
tun0: flags=1008043<UP,BROADCAST,RUNNING,MULTICAST,LOWER_UP> metric 0 mtu 1500
options=80000<LINKSTATE>
inet 10.220.2.4 netmask 0xffffff00 broadcast 10.220.2.255
inet6 fd00:abcd:220:2::1002 prefixlen 64
inet6 fe80::be24:11ff:fe19:c3d6%tun0 prefixlen 64 scopeid 0x4
groups: openvpn
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
gert@fbsd14:~ $ ping 10.220.2.1
PING 10.220.2.1 (10.220.2.1): 56 data bytes
ping: sendto: Network is down
5 Minutes later (reneg-sec) we see that kernel and userland are indeed desynced and the server instance is, indeed, gone...
Dec 30 10:36:37 fbsd14 openvpn[11834]: TLS: soft reset sec=301/300 bytes=0/-1 pkts=0/0 aead_limit_send=0/60129542137 aead_limit_recv=0/60129542137
Dec 30 10:36:37 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=361 DATA len=40
Dec 30 10:36:39 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=617 DATA len=40
Dec 30 10:36:43 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=873 DATA len=40
Dec 30 10:36:51 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=1129 DATA len=40
Dec 30 10:37:08 fbsd14 openvpn[11834]: UDPv6 WRITE [54] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_CONTROL_SOFT_RESET_V1 kid=1 [ ] pid=1385 DATA len=40
Dec 30 10:37:37 fbsd14 openvpn[11834]: TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
Dec 30 10:37:37 fbsd14 kernel:
Dec 30 10:37:37 fbsd14 openvpn[11834]: TLS Error: TLS handshake failed
Dec 30 10:37:37 fbsd14 openvpn[11834]: TLS: move_session: dest=TM_LAME_DUCK src=TM_ACTIVE reinit_src=1
...
Dec 30 10:37:37 fbsd14 openvpn[11834]: PUSH: Received control message: 'PUSH_REPLY,route 10.220.0.0 255.255.0.0,route-ipv6 fd00:abcd:220::/48,tun-ipv6,route-gateway 10.220.2.1,topology subnet,ping 10,ping-restart 30,ifconfig-ipv6 fd00:abcd:220:2::1002/64 fd00:abcd:220:2::1,ifconfig 10.220.2.4 255.255.255.0,peer-id 1,cipher AES-256-GCM,protocol-flags cc-exit tls-ekm dyn-tls-crypt,tun-mtu 1500'
Dec 30 10:37:37 fbsd14 openvpn[11834]: dco_del_peer: peer-id 1
Dec 30 10:37:37 fbsd14 openvpn[11834]: Failed to delete peer: No such file or directory (errno=2)
Dec 30 10:37:37 fbsd14 openvpn[11834]: dco_new_peer: peer-id 1, fd 4
at this point the tunnel is working, but then it breaks apart
Dec 30 10:37:37 fbsd14 openvpn[11834]: dco_new_key: slot 1, key-id 0, peer-id 1, cipher AES-256-GCM, epoch 0
Dec 30 10:37:37 fbsd14 openvpn[11834]: Initialization Sequence Completed
Dec 30 10:37:37 fbsd14 openvpn[11834]: Data Channel: cipher 'AES-256-GCM', peer-id: 1
Dec 30 10:37:37 fbsd14 openvpn[11834]: Timers: ping 10, ping-restart 30
Dec 30 10:37:37 fbsd14 openvpn[11834]: Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt
Dec 30 10:37:37 fbsd14 openvpn[11834]: UDPv6 WRITE [38] to [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_ACK_V1 kid=0 [ 4 3 2 1 0 ] DATA len=0
Dec 30 10:37:48 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:37:57 fbsd14 kernel: if_ovpn: ping peer_id 1
Dec 30 10:38:07 fbsd14 kernel: if_ovpn: ovpn_notify_del_peer: NOTIF_DEL_PEER for peer 1, reason=1
Dec 30 10:38:07 fbsd14 openvpn[11834]: UDPv6 READ [0] from [AF_INET6][undef]:51194: DATA UNDEF len=-1
Dec 30 10:38:17 fbsd14 openvpn[11834]: UDPv6 READ [40] from [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_DATA_V2 kid=0 DATA len=39
Dec 30 10:38:27 fbsd14 openvpn[11834]: UDPv6 READ [40] from [AF_INET6]2001:608:1:995a:250:56ff:febb:2084:51194: P_DATA_V2 kid=0 DATA len=39
Dec 30 10:38:38 fbsd14 openvpn[11834]: Swapping primary and secondary keys to primary-id=0 secondary-id=0
Dec 30 10:38:38 fbsd14 openvpn[11834]: dco_swap_keys: peer-id 1
Dec 30 10:38:38 fbsd14 openvpn[11834]: Failed to swap keys: No such file or directory (errno=2)
Dec 30 10:38:38 fbsd14 openvpn[11834]: Cannot swap keys: Operation not permitted (-1)
Dec 30 10:38:38 fbsd14 openvpn[11834]: register signal: SIGUSR1 (dco update keys error)
Dec 30 10:38:38 fbsd14 openvpn[11834]: dco_del_peer: peer-id 1
Dec 30 10:38:38 fbsd14 openvpn[11834]: Failed to delete peer: No such file or directory (errno=2)
Dec 30 10:38:38 fbsd14 openvpn[11834]: TCP/UDP: Closing socket
Dec 30 10:38:38 fbsd14 openvpn[11834]: SIGUSR1[soft,dco update keys error] received, process restarting
Dec 30 10:38:38 fbsd14 openvpn[11834]: signal_reset: signal SIGUSR1 is cleared
Dec 30 10:38:38 fbsd14 openvpn[11834]: Restart pause, 2 second(s)
(interesting enough, my test instance just restarts, while my community VPN instance FATAL'ed at this point - which I can't reproduce right now)
Even after the restart, we still are not polling events...
So, I think we have two issues here
Major:
- after a restart/reconnect, the cyclic calling of
dco_event_set()is gone, so notification delivery breaks, things desynchronize, and do weird "it does not work until I restart openvpn, with nothing in the log!" things
Minor:
- the whole "we only check every 15 seconds for kernel-side notifications, and then we queue a pipefd write so we can listen for file descriptor events from DCO and act on them" is not really a good hack. I think we should change this to do a timer-based callout (coarse timer handler?) which will poll-and-process the events, and do away with the pipe stuff.
I will proceed to bisect this, and see where we broke this... assigning to Giaan, Ralf ("you understands event") and me ("FreeBSD")...