-
Notifications
You must be signed in to change notification settings - Fork 146
xdp: introduce bulking for page_pool tx return path #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduce bulking capability in xdp tx return path (XDP_REDIRECT). xdp_return_frame is usually run inside the driver NAPI tx completion loop so it is possible batch it. Current implementation considers only page_pool memory model. Convert mvneta driver to xdp_return_frame_bulk APIs. Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Introduce the capability to batch page_pool ptr_ring refill since it is usually run inside the driver NAPI tx completion loop. Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Convert mvpp2 driver to xdp_return_frame_bulk APIs. XDP_REDIRECT (upstream codepath): 1.79Mpps XDP_REDIRECT (upstream codepath + bulking APIs): 1.93Mpps Tested-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Convert mlx5 driver to xdp_return_frame_bulk APIs. XDP_REDIRECT (upstream codepath): 8.5Mpps XDP_REDIRECT (upstream codepath + bulking APIs): 10.1Mpps Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Master branch: 3cb12d2 |
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=373421 irrelevant now. Closing PR. |
kernel-patches-bot
pushed a commit
that referenced
this pull request
Aug 11, 2021
…CKOPT Add verifier ctx test to call bpf_get_netns_cookie from cgroup/setsockopt. #269/p pass ctx or null check, 1: ctx Did not run the program (not supported) OK #270/p pass ctx or null check, 2: null Did not run the program (not supported) OK #271/p pass ctx or null check, 3: 1 OK #272/p pass ctx or null check, 4: ctx - const OK #273/p pass ctx or null check, 5: null (connect) Did not run the program (not supported) OK #274/p pass ctx or null check, 6: null (bind) Did not run the program (not supported) OK #275/p pass ctx or null check, 7: ctx (bind) Did not run the program (not supported) OK #276/p pass ctx or null check, 8: null (bind) OK #277/p pass ctx or null check, 9: ctx (cgroup/setsockopt) Did not run the program (not supported) OK #278/p pass ctx or null check, 10: null (cgroup/setsockopt) Did not run the program (not supported) OK Signed-off-by: Stanislav Fomichev <sdf@google.com>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Aug 12, 2021
…CKOPT Add verifier ctx test to call bpf_get_netns_cookie from cgroup/setsockopt. #269/p pass ctx or null check, 1: ctx Did not run the program (not supported) OK #270/p pass ctx or null check, 2: null Did not run the program (not supported) OK #271/p pass ctx or null check, 3: 1 OK #272/p pass ctx or null check, 4: ctx - const OK #273/p pass ctx or null check, 5: null (connect) Did not run the program (not supported) OK #274/p pass ctx or null check, 6: null (bind) Did not run the program (not supported) OK #275/p pass ctx or null check, 7: ctx (bind) Did not run the program (not supported) OK #276/p pass ctx or null check, 8: null (bind) OK #277/p pass ctx or null check, 9: ctx (cgroup/setsockopt) Did not run the program (not supported) OK #278/p pass ctx or null check, 10: null (cgroup/setsockopt) Did not run the program (not supported) OK Signed-off-by: Stanislav Fomichev <sdf@google.com>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Aug 12, 2021
…CKOPT Add verifier ctx test to call bpf_get_netns_cookie from cgroup/setsockopt. #269/p pass ctx or null check, 1: ctx Did not run the program (not supported) OK #270/p pass ctx or null check, 2: null Did not run the program (not supported) OK #271/p pass ctx or null check, 3: 1 OK #272/p pass ctx or null check, 4: ctx - const OK #273/p pass ctx or null check, 5: null (connect) Did not run the program (not supported) OK #274/p pass ctx or null check, 6: null (bind) Did not run the program (not supported) OK #275/p pass ctx or null check, 7: ctx (bind) Did not run the program (not supported) OK #276/p pass ctx or null check, 8: null (bind) OK #277/p pass ctx or null check, 9: ctx (cgroup/setsockopt) Did not run the program (not supported) OK #278/p pass ctx or null check, 10: null (cgroup/setsockopt) Did not run the program (not supported) OK Signed-off-by: Stanislav Fomichev <sdf@google.com>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Aug 13, 2021
…CKOPT Add verifier ctx test to call bpf_get_netns_cookie from cgroup/setsockopt. #269/p pass ctx or null check, 1: ctx Did not run the program (not supported) OK #270/p pass ctx or null check, 2: null Did not run the program (not supported) OK #271/p pass ctx or null check, 3: 1 OK #272/p pass ctx or null check, 4: ctx - const OK #273/p pass ctx or null check, 5: null (connect) Did not run the program (not supported) OK #274/p pass ctx or null check, 6: null (bind) Did not run the program (not supported) OK #275/p pass ctx or null check, 7: ctx (bind) Did not run the program (not supported) OK #276/p pass ctx or null check, 8: null (bind) OK #277/p pass ctx or null check, 9: ctx (cgroup/setsockopt) Did not run the program (not supported) OK #278/p pass ctx or null check, 10: null (cgroup/setsockopt) Did not run the program (not supported) OK Signed-off-by: Stanislav Fomichev <sdf@google.com>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Aug 13, 2021
…CKOPT Add verifier ctx test to call bpf_get_netns_cookie from cgroup/setsockopt. #269/p pass ctx or null check, 1: ctx Did not run the program (not supported) OK #270/p pass ctx or null check, 2: null Did not run the program (not supported) OK #271/p pass ctx or null check, 3: 1 OK #272/p pass ctx or null check, 4: ctx - const OK #273/p pass ctx or null check, 5: null (connect) Did not run the program (not supported) OK #274/p pass ctx or null check, 6: null (bind) Did not run the program (not supported) OK #275/p pass ctx or null check, 7: ctx (bind) Did not run the program (not supported) OK #276/p pass ctx or null check, 8: null (bind) OK #277/p pass ctx or null check, 9: ctx (cgroup/setsockopt) Did not run the program (not supported) OK #278/p pass ctx or null check, 10: null (cgroup/setsockopt) Did not run the program (not supported) OK Signed-off-by: Stanislav Fomichev <sdf@google.com>
kernel-patches-daemon-bpf bot
pushed a commit
that referenced
this pull request
Apr 2, 2024
When testing send_signal and stacktrace_build_id_nmi using the riscv sbi pmu driver without the sscofpmf extension or the riscv legacy pmu driver, they encountered failures as follows: test_send_signal_common:FAIL:perf_event_open unexpected perf_event_open: actual -1 < expected 0 #272/3 send_signal/send_signal_nmi:FAIL test_stacktrace_build_id_nmi:FAIL:perf_event_open err -1 errno 95 #304 stacktrace_build_id_nmi:FAIL The reason is that the above pmu driver or hardware does not support sampling events, that is, PERF_PMU_CAP_NO_INTERRUPT is set to pmu capabilities, and then perf_event_open returns EOPNOTSUPP. Since PERF_PMU_CAP_NO_INTERRUPT is not only set in the riscv-related pmu driver, it is better to skip testing when this capability is set. Signed-off-by: Pu Lehui <pulehui@huawei.com>
kernel-patches-daemon-bpf bot
pushed a commit
that referenced
this pull request
Apr 2, 2024
When testing send_signal and stacktrace_build_id_nmi using the riscv sbi pmu driver without the sscofpmf extension or the riscv legacy pmu driver, they encountered failures as follows: test_send_signal_common:FAIL:perf_event_open unexpected perf_event_open: actual -1 < expected 0 #272/3 send_signal/send_signal_nmi:FAIL test_stacktrace_build_id_nmi:FAIL:perf_event_open err -1 errno 95 #304 stacktrace_build_id_nmi:FAIL The reason is that the above pmu driver or hardware does not support sampling events, that is, PERF_PMU_CAP_NO_INTERRUPT is set to pmu capabilities, and then perf_event_open returns EOPNOTSUPP. Since PERF_PMU_CAP_NO_INTERRUPT is not only set in the riscv-related pmu driver, it is better to skip testing when this capability is set. Signed-off-by: Pu Lehui <pulehui@huawei.com>
kernel-patches-daemon-bpf bot
pushed a commit
that referenced
this pull request
Apr 2, 2024
When testing send_signal and stacktrace_build_id_nmi using the riscv sbi pmu driver without the sscofpmf extension or the riscv legacy pmu driver, they encountered failures as follows: test_send_signal_common:FAIL:perf_event_open unexpected perf_event_open: actual -1 < expected 0 #272/3 send_signal/send_signal_nmi:FAIL test_stacktrace_build_id_nmi:FAIL:perf_event_open err -1 errno 95 #304 stacktrace_build_id_nmi:FAIL The reason is that the above pmu driver or hardware does not support sampling events, that is, PERF_PMU_CAP_NO_INTERRUPT is set to pmu capabilities, and then perf_event_open returns EOPNOTSUPP. Since PERF_PMU_CAP_NO_INTERRUPT is not only set in the riscv-related pmu driver, it is better to skip testing when this capability is set. Signed-off-by: Pu Lehui <pulehui@huawei.com>
kernel-patches-daemon-bpf bot
pushed a commit
that referenced
this pull request
Apr 2, 2024
When testing send_signal and stacktrace_build_id_nmi using the riscv sbi pmu driver without the sscofpmf extension or the riscv legacy pmu driver, then failures as follows are encountered: test_send_signal_common:FAIL:perf_event_open unexpected perf_event_open: actual -1 < expected 0 #272/3 send_signal/send_signal_nmi:FAIL test_stacktrace_build_id_nmi:FAIL:perf_event_open err -1 errno 95 #304 stacktrace_build_id_nmi:FAIL The reason is that the above pmu driver or hardware does not support sampling events, that is, PERF_PMU_CAP_NO_INTERRUPT is set to pmu capabilities, and then perf_event_open returns EOPNOTSUPP. Since PERF_PMU_CAP_NO_INTERRUPT is not only set in the riscv-related pmu driver, it is better to skip testing when this capability is set. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240402073029.1299085-1-pulehui@huaweicloud.com
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
May 9, 2025
In case of UDP peer timeout, an openvpn client (userspace) performs the following actions: 1. receives the peer deletion notification (reason=timeout) 2. closes the socket Upon 1. we have the following: - ovpn_peer_keepalive_work() - ovpn_socket_release() - synchronize_rcu() At this point, 2. gets a chance to complete and ovpn_sock->sock->sk becomes NULL. ovpn_socket_release() will then attempt dereferencing it, resulting in the following crash log: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) Reason for accessing sk is ithat we need to retrieve its protocol and continue the cleanup routine accordingly. Fix the crash by grabbing a reference to sk before proceeding with the cleanup. If the refcounter has reached zero, we know that the socket is being destroyed and thus we skip the cleanup in ovpn_socket_release(). Fixes: 11851cb ("ovpn: implement TCP transport") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Dumazet <edumazet@google.com> Reported-by: Qingfang Deng <dqfext@gmail.com> Tested-By: Gert Doering <gert@greenie.muc.de> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
May 9, 2025
In case of UDP peer timeout, an openvpn client (userspace) performs the following actions: 1. receives the peer deletion notification (reason=timeout) 2. closes the socket Upon 1. we have the following: - ovpn_peer_keepalive_work() - ovpn_socket_release() - synchronize_rcu() At this point, 2. gets a chance to complete and ovpn_sock->sock->sk becomes NULL. ovpn_socket_release() will then attempt dereferencing it, resulting in the following crash log: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) Reason for accessing sk is ithat we need to retrieve its protocol and continue the cleanup routine accordingly. Fix the crash by grabbing a reference to sk before proceeding with the cleanup. If the refcounter has reached zero, we know that the socket is being destroyed and thus we skip the cleanup in ovpn_socket_release(). Fixes: 11851cb ("ovpn: implement TCP transport") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Dumazet <edumazet@google.com> Reported-by: Qingfang Deng <dqfext@gmail.com> Tested-By: Gert Doering <gert@greenie.muc.de> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
May 10, 2025
In case of UDP peer timeout, an openvpn client (userspace) performs the following actions: 1. receives the peer deletion notification (reason=timeout) 2. closes the socket Upon 1. we have the following: - ovpn_peer_keepalive_work() - ovpn_socket_release() - synchronize_rcu() At this point, 2. gets a chance to complete and ovpn_sock->sock->sk becomes NULL. ovpn_socket_release() will then attempt dereferencing it, resulting in the following crash log: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) Reason for accessing sk is ithat we need to retrieve its protocol and continue the cleanup routine accordingly. Fix the crash by grabbing a reference to sk before proceeding with the cleanup. If the refcounter has reached zero, we know that the socket is being destroyed and thus we skip the cleanup in ovpn_socket_release(). Fixes: 11851cb ("ovpn: implement TCP transport") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Dumazet <edumazet@google.com> Reported-by: Qingfang Deng <dqfext@gmail.com> Tested-By: Gert Doering <gert@greenie.muc.de> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
May 10, 2025
In case of UDP peer timeout, an openvpn client (userspace) performs the following actions: 1. receives the peer deletion notification (reason=timeout) 2. closes the socket Upon 1. we have the following: - ovpn_peer_keepalive_work() - ovpn_socket_release() - synchronize_rcu() At this point, 2. gets a chance to complete and ovpn_sock->sock->sk becomes NULL. ovpn_socket_release() will then attempt dereferencing it, resulting in the following crash log: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) Reason for accessing sk is ithat we need to retrieve its protocol and continue the cleanup routine accordingly. Fix the crash by grabbing a reference to sk before proceeding with the cleanup. If the refcounter has reached zero, we know that the socket is being destroyed and thus we skip the cleanup in ovpn_socket_release(). Fixes: 11851cb ("ovpn: implement TCP transport") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Dumazet <edumazet@google.com> Reported-by: Qingfang Deng <dqfext@gmail.com> Tested-By: Gert Doering <gert@greenie.muc.de> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
May 10, 2025
In case of UDP peer timeout, an openvpn client (userspace) performs the following actions: 1. receives the peer deletion notification (reason=timeout) 2. closes the socket Upon 1. we have the following: - ovpn_peer_keepalive_work() - ovpn_socket_release() - synchronize_rcu() At this point, 2. gets a chance to complete and ovpn_sock->sock->sk becomes NULL. ovpn_socket_release() will then attempt dereferencing it, resulting in the following crash log: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) Reason for accessing sk is ithat we need to retrieve its protocol and continue the cleanup routine accordingly. Fix the crash by grabbing a reference to sk before proceeding with the cleanup. If the refcounter has reached zero, we know that the socket is being destroyed and thus we skip the cleanup in ovpn_socket_release(). Fixes: 11851cb ("ovpn: implement TCP transport") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Dumazet <edumazet@google.com> Reported-by: Qingfang Deng <dqfext@gmail.com> Tested-By: Gert Doering <gert@greenie.muc.de> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 1, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 1, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 2, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 3, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 3, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 3, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 3, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 3, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 3, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 3, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 4, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 5, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 5, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 5, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 5, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Signed-off-by: NipaLocal <nipa@local>
kuba-moo
pushed a commit
to linux-netdev/testing-bpf-ci
that referenced
this pull request
Jun 5, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [kernel-patches#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 kernel-patches#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN/ovpn-net-next#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request for series with
subject: xdp: introduce bulking for page_pool tx return path
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=371737