Skip to content

DCO Netlink Message Mix-up on 2.6 (Fixed in 2.7 with OpenVPN#793) #883

@stefan-baranoff

Description

@stefan-baranoff

Describe the bug
We are seeing one of the behaviors described in #793 as a possible other symptom of the same "netlink message async receive mixup". With OpenVPN 2.6.15 (the dsommers RPM) running DCO (on Alma 9.6) when clients disconnect from the server there is a chance that the stats receive path intercepts messages related to the disconnect and the disconnect is never properly processed by user space. This means disconnect scripts and log messages are never invoked and that the client is still seen as connected from the user space perspective.

We see kernel DCO logs indicating the peer was deleted on the kernel side, but user space keeps the connection around indefinitely. There is a netlink error (-4): Try again that's reported, likely because the sequence is send/recv for stats (gets all netlink messages) but poll already said to read the netlink socket for the disconnect, so that happens too. When that happens, there's no message to read and -NLE_AGAIN is returned. Intermixing OpenVPN logs from verb 4 and kernel logs with tail on both while 5 UDP clients disconnect then the keep alive pings time out:

...
2025-10-30 13:30:09 us=577611 device-116/147.136.246.4:45347 Data Channel: cipher 'AES-256-GCM', peer-id: 0
2025-10-30 13:30:09 us=577684 device-116/147.136.246.4:45347 Timers: ping 2, ping-restart 20
2025-10-30 13:30:09 us=577703 device-116/147.136.246.4:45347 Protocol options: explicit-exit-notify 1, protocol-flags cc-exit tls-ekm dyn-tls-crypt
2025-10-30 13:30:10 us=151461 device-117/147.136.246.4:26209 Data Channel: cipher 'AES-256-GCM', peer-id: 0
2025-10-30 13:30:10 us=151537 device-117/147.136.246.4:26209 Timers: ping 2, ping-restart 20
2025-10-30 13:30:10 us=151552 device-117/147.136.246.4:26209 Protocol options: explicit-exit-notify 1, protocol-flags cc-exit tls-ekm dyn-tls-crypt
2025-10-30 13:30:33 us=953242 dco_do_read: netlink reports error (-4): Try again
Oct 30 13:30:33 tc3-vpn-0003 kernel: tun15: deleting peer with id 3, reason 2
2025-10-30 13:30:34 us=976885 device-115/147.136.246.4:19379 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Oct 30 13:30:34 tc3-vpn-0003 kernel: tun15: deleting peer with id 4, reason 2
2025-10-30 13:30:35 us=489924 device-114/147.136.246.4:54055 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Oct 30 13:30:35 tc3-vpn-0003 kernel: tun15: deleting peer with id 6, reason 2
Oct 30 13:30:35 tc3-vpn-0003 kernel: tun15: deleting peer with id 0, reason 2
2025-10-30 13:30:36 us=932 device-117/147.136.246.4:26209 SIGTERM[soft,ovpn-dco: ping expired] received, client-instance exiting
Oct 30 13:30:36 tc3-vpn-0003 kernel: tun15: deleting peer with id 2, reason 2

There are only 3 ping expired notices but 5 peers deleted and the status file still shows 2 connected devices that won't ever clean up. Turning up more verbose logging it's clear that dco_do_read is called right before the Try again but fails. Adding loops to repeatedly try or increasing message buffer sizes did not resolve that issue. Adding some extra logging shows unusual error (without log) paths being taken in the stats gathering receive callback.

To Reproduce

  1. Set up OpenVPN DCO on a Linux server; we have status output every 10s and relatively aggressive keepalives:
local 100.97.0.3
port 1194
proto udp
dev tun15
topology subnet
tun-mtu 1400
dh none
keepalive 2 10
max-clients 511
user openvpn
group openvpn
persist-key
persist-tun
status /var/log/openvpn/status/openvpn-status.log 10
verb 4
explicit-exit-notify 1
server 172.22.0.0 255.255.254.0 nopool
client-config-dir /etc/openvpn/ccd/
push "redirect-gateway def1 bypass-dhcp ipv6"
push "block-ipv6"
push "dhcp-option DNS 8.8.8.8"
push "dhcp-option DNS 1.1.1.1"
push "dhcp-option DNS 8.8.4.4"
push "dhcp-option DNS 1.0.0.1"
ccd-exclusive
management /run/openvpn-server/management.sock unix
<cert>
...
  1. Loop over connecting multiple clients simultaneously (we used docker containers) and then disconnect them all at about the same time (we just put a "sleep 5" then killed the client process)
  2. Watch logs and status files

Expected behavior
All devices to be disconnected in user space properly.

Version information (please complete the following information):

  • OS: Alma 9.6
  • OpenVPN version: 2.6.15-1 (dsommers RPM from Copr)

Additional context
This was briefly discussed on IRC with @cron2 - he asked for a ticket and @ordex to get involved. I will be trying to cherry pick a couple of commits (a699681 and f353b71) to see if that creates a compiling/working binary since f353b71 resolved this same issue already for the 2.7 code base and will report back if that was successful.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions