loss and delay without reordering causes very slow transfer #6

matttbe · 2020-03-23T10:07:38Z

When running mptcp_connect kselftest, we can have timeouts when losses and delays are important but there is no re-ordering, e.g.

00:28:22.462 # INFO: Using loss of 0.52% delay 381 ms on ns3eth4
00:28:22.513 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration   293ms) [ OK ]
00:28:23.075 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration   178ms) [ OK ]
00:28:23.492 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration   166ms) [ OK ]
00:28:23.909 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration   231ms) [ OK ]
00:28:24.386 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration   203ms) [ OK ]
00:28:24.834 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration   167ms) [ OK ]
00:28:25.253 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration   454ms) [ OK ]
00:28:25.944 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   465ms) [ OK ]
00:28:26.651 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration   464ms) [ OK ]
00:28:27.363 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration   476ms) [ OK ]
00:28:28.087 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration   718ms) [ OK ]
00:28:29.063 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration   728ms) [ OK ]
00:28:30.038 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration   674ms) [ OK ]
00:28:30.972 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration   742ms) [ OK ]
00:28:31.964 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 54811ms) [ OK ]
00:29:27.032 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 58387ms) [ OK ]
00:30:25.712 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration   551ms) [ OK ]
00:30:26.542 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration   590ms) [ OK ]
00:30:27.419 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration   590ms) [ OK ]
00:30:28.281 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration   735ms) [ OK ]
00:30:29.290 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration   662ms) [ OK ]
00:30:30.239 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration   633ms) [ OK ]
00:30:31.144 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 47109ms) [ OK ]
00:31:18.538 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 64366ms) [ OK ]
00:32:23.171 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration   509ms) [ OK ]
00:32:23.925 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration   564ms) [ OK ]
00:32:24.733 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration   467ms) [ OK ]
00:32:25.450 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration   473ms) [ OK ]
00:32:26.178 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration   456ms) [ OK ]
00:32:31.022 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration   466ms) [ OK ]
00:32:31.022 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 43213ms) [ OK ]
00:33:11.030 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 43608ms) [ OK ]
00:33:54.904 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	(duration 42663ms) [ OK ]
00:34:37.837 # ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP	(duration 43245ms) [ OK ]
00:35:21.343 # ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP	./mptcp_connect.sh: line 114:  1124 Terminated              ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} $extra_args $local_addr < "$sin" > "$sout"
00:35:46.159 # ./mptcp_connect.sh: line 114:  1130 Terminated              ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $extra_args $connect_addr < "$cin" > "$cout"
00:35:46.413 #
00:35:46.415 not ok 1 selftests: net/mptcp: mptcp_connect.sh # TIMEOUT

We only have this issue when there is no re-ordering added with TC netem.

Note: when it is fixed, it could be good to reduce the default timeout, linked to https://patchwork.ozlabs.org/patch/1196109/

The text was updated successfully, but these errors were encountered:

matttbe · 2020-03-23T20:58:19Z

I don't know if it is luck or not but I got a new timeout again. I didn't get it for weeks:

00:09:39.764 # INFO: Using loss of 0.50% delay 294 ms on ns3eth4
00:09:39.812 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration   257ms) [ OK ]
00:09:40.327 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration   167ms) [ OK ]
00:09:40.739 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration   155ms) [ OK ]
00:09:41.134 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration   243ms) [ OK ]
00:09:41.627 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration   168ms) [ OK ]
00:09:42.040 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration   154ms) [ OK ]
00:09:42.433 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration   407ms) [ OK ]
00:09:43.168 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   447ms) [ OK ]
00:09:43.840 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration   410ms) [ OK ]
00:09:44.497 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration   653ms) [ OK ]
00:09:45.402 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration   892ms) [ OK ]
00:09:46.545 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration   886ms) [ OK ]
00:09:47.679 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration   861ms) [ OK ]
00:09:48.782 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration   864ms) [ OK ]
00:09:49.895 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 33113ms) [ OK ]
00:10:23.260 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 33693ms) [ OK ]
00:10:57.195 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration   419ms) [ OK ]
00:10:57.867 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration   480ms) [ OK ]
00:10:58.594 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration   732ms) [ OK ]
00:10:59.573 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration   991ms) [ OK ]
00:11:00.809 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration   736ms) [ OK ]
00:11:01.795 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration   741ms) [ OK ]
00:11:02.784 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 33075ms) [ OK ]
00:11:36.105 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 33417ms) [ OK ]
00:12:09.776 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration   659ms) [ OK ]
00:12:10.679 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration   722ms) [ OK ]
00:12:11.646 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration   569ms) [ OK ]
00:12:12.454 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration  1228ms) [ OK ]
00:12:13.925 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration   775ms) [ OK ]
00:12:14.944 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration   626ms) [ OK ]
00:12:15.808 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 33110ms) [ OK ]
00:12:49.167 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 33660ms) [ OK ]
00:13:23.073 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	(duration 53653ms) [ OK ]
00:14:16.973 # ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP	(duration 46157ms) [ OK ]
00:15:03.372 # ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP	(duration 37611ms) [ OK ]
00:15:41.232 # ns4 MPTCP -> ns2 (dead:beef:1::2:10035) MPTCP	(duration 38234ms) [ OK ]
00:16:19.712 # ns4 MPTCP -> ns2 (10.0.2.1:10036      ) MPTCP	(duration 39697ms) [ OK ]
00:16:59.648 # ns4 MPTCP -> ns2 (dead:beef:2::1:10037) MPTCP	./mptcp_connect.sh: line 114:  1162 Terminated              ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} $extra_args $local_addr < "$sin" > "$sout"
00:17:03.832 # ./mptcp_connect.sh: line 114:  1168 Terminated              ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $extra_args $connect_addr < "$cin" > "$cout"
00:17:04.079 #
00:17:04.080 not ok 1 selftests: net/mptcp: mptcp_connect.sh # TIMEOUT

matttbe · 2020-03-26T10:44:48Z

just to track the frequency, I got a new warning last night:

00:31:58.802 # INFO: Using loss of 0.89% delay 290 ms on ns3eth4
00:31:58.846 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration   500ms) [ OK ]
00:31:59.629 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration   206ms) [ OK ]
00:32:00.092 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration   196ms) [ OK ]
00:32:00.542 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration   289ms) [ OK ]
00:32:01.083 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration   223ms) [ OK ]
00:32:01.574 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration   195ms) [ OK ]
00:32:02.031 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration   541ms) [ OK ]
00:32:02.827 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   591ms) [ OK ]
00:32:03.673 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration   563ms) [ OK ]
00:32:04.499 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration   585ms) [ OK ]
00:32:05.355 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration  1070ms) [ OK ]
00:32:06.702 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration  1480ms) [ OK ]
00:32:08.458 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration  1041ms) [ OK ]
00:32:09.762 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration  1185ms) [ OK ]
00:32:11.209 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 57810ms) [ OK ]
00:33:09.282 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 79353ms) [ OK ]
00:34:28.890 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration   577ms) [ OK ]
00:34:29.742 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration   576ms) [ OK ]
00:34:30.581 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration  1138ms) [ OK ]
00:34:31.989 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration   992ms) [ OK ]
00:34:33.243 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration   936ms) [ OK ]
00:34:34.448 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration  1048ms) [ OK ]
00:34:35.754 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 62675ms) [ OK ]
00:35:38.677 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 62678ms) [ OK ]
00:36:41.613 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration  1162ms) [ OK ]
00:36:43.030 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration  1342ms) [ OK ]
00:36:44.639 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration  1000ms) [ OK ]
00:36:45.903 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration  1063ms) [ OK ]
00:36:47.221 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration  1025ms) [ OK ]
00:36:48.503 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration  1079ms) [ OK ]
00:36:49.833 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 39549ms) [ OK ]
00:37:29.638 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 40083ms) [ OK ]
00:38:09.982 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	(duration 46505ms) [ OK ]
00:38:56.746 # ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP	./mptcp_connect.sh: line 114:  1112 Terminated              ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} $extra_args $local_addr < "$sin" > "$sout"
00:39:22.289 # ./mptcp_connect.sh: line 114:  1118 Terminated              ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $extra_args $connect_addr < "$cin" > "$cout"
00:39:22.562 #
00:39:22.562 not ok 1 selftests: net/mptcp: mptcp_connect.sh # TIMEOUT

When experimenting with bpf_send_signal() helper in our production environment (5.2 based), we experienced a deadlock in NMI mode: #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24 #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012 #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55 #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602 #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227 #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140 #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09 #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47 #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d #17 [ffffc9002219fc90] schedule at ffffffff81a3e219 #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9 #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529 #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602 #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068 The above call stack is actually very similar to an issue reported by Commit eac9153 ("bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()") by Song Liu. The only difference is bpf_send_signal() helper instead of bpf_get_stack() helper. The above deadlock is triggered with a perf_sw_event. Similar to Commit eac9153, the below almost identical reproducer used tracepoint point sched/sched_switch so the issue can be easily caught. /* stress_test.c */ #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define THREAD_COUNT 1000 char *filename; void *worker(void *p) { void *ptr; int fd; char *pptr; fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000}; ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; } int main(int argc, char *argv[]) { void *ptr; int i; pthread_t threads[THREAD_COUNT]; if (argc < 2) return 0; filename = argv[1]; for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } } for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } and the following command: 1. run `stress_test /bin/ls` in one windown 2. hack bcc trace.py with the following change: --- a/tools/trace.py +++ b/tools/trace.py @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s); __data.tgid = __tgid; __data.pid = __pid; bpf_get_current_comm(&__data.comm, sizeof(__data.comm)); + bpf_send_signal(10); %s %s %s.perf_submit(%s, &__data, sizeof(__data)); 3. in a different window run ./trace.py -p $(pidof stress_test) t:sched:sched_switch The deadlock can be reproduced in our production system. Similar to Song's fix, the fix is to delay sending signal if irqs is disabled to avoid deadlocks involving with rq_lock. With this change, my above stress-test in our production system won't cause deadlock any more. I also implemented a scale-down version of reproducer in the selftest (a subsequent commit). With latest bpf-next, it complains for the following potential deadlock. [ 32.832450] -> #1 (&p->pi_lock){-.-.}: [ 32.833100] _raw_spin_lock_irqsave+0x44/0x80 [ 32.833696] task_rq_lock+0x2c/0xa0 [ 32.834182] task_sched_runtime+0x59/0xd0 [ 32.834721] thread_group_cputime+0x250/0x270 [ 32.835304] thread_group_cputime_adjusted+0x2e/0x70 [ 32.835959] do_task_stat+0x8a7/0xb80 [ 32.836461] proc_single_show+0x51/0xb0 ... [ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}: [ 32.840275] __lock_acquire+0x1358/0x1a20 [ 32.840826] lock_acquire+0xc7/0x1d0 [ 32.841309] _raw_spin_lock_irqsave+0x44/0x80 [ 32.841916] __lock_task_sighand+0x79/0x160 [ 32.842465] do_send_sig_info+0x35/0x90 [ 32.842977] bpf_send_signal+0xa/0x10 [ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000 [ 32.844301] trace_call_bpf+0x115/0x270 [ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0 [ 32.845411] perf_trace_sched_switch+0x10f/0x180 [ 32.846014] __schedule+0x45d/0x880 [ 32.846483] schedule+0x5f/0xd0 ... [ 32.853148] Chain exists of: [ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock [ 32.853148] [ 32.854451] Possible unsafe locking scenario: [ 32.854451] [ 32.855173] CPU0 CPU1 [ 32.855745] ---- ---- [ 32.856278] lock(&rq->lock); [ 32.856671] lock(&p->pi_lock); [ 32.857332] lock(&rq->lock); [ 32.857999] lock(&(&sighand->siglock)->rlock); Deadlock happens on CPU0 when it tries to acquire &sighand->siglock but it has been held by CPU1 and CPU1 tries to grab &rq->lock and cannot get it. This is not exactly the callstack in our production environment, but sympotom is similar and both locks are using spin_lock_irqsave() to acquire the lock, and both involves rq_lock. The fix to delay sending signal when irq is disabled also fixed this issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com

Ido Schimmel says: ==================== mlxsw: Offload TC action pedit munge dsfield Petr says: The Spectrum switches allow packet prioritization based on DSCP on ingress, and update of DSCP on egress. This is configured through the DCB APP rules. For some use cases, assigning a custom DSCP value based on an ACL match is a better tool. To that end, offload FLOW_ACTION_MANGLE to permit changing of dsfield as a whole, or DSCP and ECN values in isolation. After fixing a commentary nit in patch #1, and mlxsw naming in patch #2, patches #3 and #4 add the offload to mlxsw. Patch #5 adds a forwarding selftest for pedit dsfield, applicable to SW as well as HW datapaths. Patch #6 adds a mlxsw-specific test to verify DSCP rewrite due to DCB APP rules is not performed on pedited packets. The tests only cover IPv4 dsfield setting. We have tests for IPv6 as well, but would like to postpone their contribution until the corresponding iproute patches have been accepted. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

matttbe · 2020-03-27T09:27:12Z

01:01:20.795 # INFO: Using loss of 0.53% delay 280 ms on ns3eth4
01:01:20.837 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration  1348ms) [ OK ]
01:01:22.433 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration   272ms) [ OK ]
01:01:22.938 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration   164ms) [ OK ]
01:01:23.347 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration   463ms) [ OK ]
01:01:24.050 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration   174ms) [ OK ]
01:01:24.469 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration   175ms) [ OK ]
01:01:24.899 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration   670ms) [ OK ]
01:01:25.810 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   511ms) [ OK ]
01:01:26.563 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration   441ms) [ OK ]
01:01:27.250 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration   660ms) [ OK ]
01:01:28.146 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration  1043ms) [ OK ]
01:01:29.428 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration   888ms) [ OK ]
01:01:30.578 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration  1182ms) [ OK ]
01:01:32.007 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration   970ms) [ OK ]
01:01:33.215 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 30081ms) [ OK ]
01:02:03.540 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 35176ms) [ OK ]
01:02:38.959 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration  3851ms) [ OK ]
01:02:43.061 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration   724ms) [ OK ]
01:02:44.025 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration   975ms) [ OK ]
01:02:45.236 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration  1165ms) [ OK ]
01:02:46.648 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration   764ms) [ OK ]
01:02:47.664 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration   802ms) [ OK ]
01:02:48.702 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 33128ms) [ OK ]
01:03:22.085 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 33594ms) [ OK ]
01:03:55.924 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration  1105ms) [ OK ]
01:03:57.280 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration  1282ms) [ OK ]
01:03:58.802 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration   574ms) [ OK ]
01:03:59.635 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration  1112ms) [ OK ]
01:04:00.987 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration  1001ms) [ OK ]
01:04:02.232 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration  1129ms) [ OK ]
01:04:03.602 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 27538ms) [ OK ]
01:04:31.395 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 27845ms) [ OK ]
01:04:59.479 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	(duration 35166ms) [ OK ]
01:05:34.907 # ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP	(duration 33306ms) [ OK ]
01:06:08.472 # ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP	(duration 35082ms) [ OK ]
01:06:43.798 # ns4 MPTCP -> ns2 (dead:beef:1::2:10035) MPTCP	(duration 39561ms) [ OK ]
01:07:23.601 # ns4 MPTCP -> ns2 (10.0.2.1:10036      ) MPTCP	(duration 35211ms) [ OK ]
01:07:59.061 # ns4 MPTCP -> ns2 (dead:beef:2::1:10037) MPTCP	(duration 34521ms) [ OK ]
01:08:33.833 # ns4 MPTCP -> ns3 (10.0.2.2:10038      ) MPTCP	./mptcp_connect.sh: line 114:  1177 Terminated              ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} $extra_args $local_addr < "$sin" > "$sout"
01:08:44.770 # ./mptcp_connect.sh: line 114:  1183 Terminated              ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $extra_args $connect_addr < "$cin" > "$cout"
01:08:45.014 #
01:08:45.014 not ok 1 selftests: net/mptcp: mptcp_connect.sh # TIMEOUT

Ido Schimmel says: ==================== mlxsw: Various static checkers fixes Jakub told me he gets some warnings with W=1, so I decided to check with sparse, smatch and coccinelle as well. This patch set fixes all the issues found. None are actual bugs / regressions and therefore not targeted at net. Patches #1-#2 add missing kernel-doc comments. Patch #3 removes dead code. Patch #4 reworks the ACL code to avoid defining a static variable in a header file. Patch #5 removes unnecessary conversion to bool that coccinelle warns about. Patch #6 avoids false-positive uninitialized symbol errors emitted by smatch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Ido Schimmel says: ==================== Add packet trap policers support Background ========== Devices capable of offloading the kernel's datapath and perform functions such as bridging and routing must also be able to send (trap) specific packets to the kernel (i.e., the CPU) for processing. For example, a device acting as a multicast-aware bridge must be able to trap IGMP membership reports to the kernel for processing by the bridge module. Motivation ========== In most cases, the underlying device is capable of handling packet rates that are several orders of magnitude higher compared to those that can be handled by the CPU. Therefore, in order to prevent the underlying device from overwhelming the CPU, devices usually include packet trap policers that are able to police the trapped packets to rates that can be handled by the CPU. Proposed solution ================= This patch set allows capable device drivers to register their supported packet trap policers with devlink. User space can then tune the parameters of these policers (currently, rate and burst size) and read from the device the number of packets that were dropped by the policer, if supported. These packet trap policers can then be bound to existing packet trap groups, which are used to aggregate logically related packet traps. As a result, trapped packets are policed to rates that can be handled the host CPU. Example usage ============= Instantiate netdevsim: Dump available packet trap policers: netdevsim/netdevsim10: policer 1 rate 1000 burst 128 policer 2 rate 2000 burst 256 policer 3 rate 3000 burst 512 Change the parameters of a packet trap policer: Bind a packet trap policer to a packet trap group: Dump parameters and statistics of a packet trap policer: netdevsim/netdevsim10: policer 3 rate 100 burst 16 stats: rx: dropped 92 Unbind a packet trap policer from a packet trap group: Patch set overview ================== Patch #1 adds the core infrastructure in devlink which allows capable device drivers to register their supported packet trap policers with devlink. Patch #2 extends the existing devlink-trap documentation. Patch #3 extends netdevsim to register a few dummy packet trap policers with devlink. Used later on to selftests the core infrastructure. Patches #4-#5 adds infrastructure in devlink to allow binding of packet trap policers to packet trap groups. Patch #6 extends netdevsim to allow such binding. Patch #7 adds a selftest over netdevsim that verifies the core devlink-trap policers functionality. Patches #8-#14 gradually add devlink-trap policers support in mlxsw. Patch #15 adds a selftest over mlxsw. All registered packet trap policers are verified to handle the configured rate and burst size. Future plans ============ * Allow changing default association between packet traps and packet trap groups * Add more packet traps. For example, for control packets (e.g., IGMP) v3: * Rebase v2 (address comments from Jiri and Jakub): * Patch #1: Add 'strict_start_type' in devlink policy * Patch #1: Have device drivers provide max/min rate/burst size for each policer. Use them to check validity of user provided parameters * Patch #3: Remove check about burst size being a power of 2 and instead add a debugfs knob to fail the operation * Patch #3: Provide max/min rate/burst size when registering policers and remove the validity checks from nsim_dev_devlink_trap_policer_set() * Patch #5: Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in devlink_trap_group_set() and bail if not present * Patch #5: Add extack error message in case trap group was partially modified * Patch #7: Add test case with new 'fail_trap_policer_set' knob * Patch #7: Add test case for partially modified trap group * Patch #10: Provide max/min rate/burst size when registering policers * Patch #11: Remove the max/min validity checks from __mlxsw_sp_trap_policer_set() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

matttbe · 2020-04-01T20:03:24Z

00:33:55.761 # INFO: Using loss of 0.73% delay 194 ms on ns3eth4
00:33:55.805 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration   463ms) [ OK ]
00:33:56.554 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration   207ms) [ OK ]
00:33:57.032 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration   205ms) [ OK ]
00:33:57.515 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration   379ms) [ OK ]
00:33:58.153 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration   211ms) [ OK ]
00:33:58.622 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration   193ms) [ OK ]
00:33:59.073 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration   351ms) [ OK ]
00:33:59.685 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   979ms) [ OK ]
00:34:00.924 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration  1069ms) [ OK ]
00:34:02.260 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration   986ms) [ OK ]
00:34:03.521 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration  2260ms) [ OK ]
00:34:06.058 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration   826ms) [ OK ]
00:34:07.168 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration   624ms) [ OK ]
00:34:08.058 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration  1290ms) [ OK ]
00:34:09.609 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 29502ms) [ OK ]
00:34:39.378 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 26776ms) [ OK ]
00:35:06.414 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration   634ms) [ OK ]
00:35:07.306 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration   863ms) [ OK ]
00:35:08.432 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration   838ms) [ OK ]
00:35:09.521 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration  1310ms) [ OK ]
00:35:11.098 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration   377ms) [ OK ]
00:35:11.768 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration  1042ms) [ OK ]
00:35:13.065 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 29459ms) [ OK ]
00:35:42.789 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 27490ms) [ OK ]
00:36:10.537 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration   554ms) [ OK ]
00:36:11.352 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration  1317ms) [ OK ]
00:36:12.938 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration   647ms) [ OK ]
00:36:13.858 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration   972ms) [ OK ]
00:36:15.094 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration  3448ms) [ OK ]
00:36:18.814 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration   606ms) [ OK ]
00:36:19.698 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 25988ms) [ OK ]
00:36:45.948 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 26135ms) [ OK ]
00:37:12.351 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	(duration 55200ms) [ OK ]
00:38:07.812 # ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP	(duration 38853ms) [ OK ]
00:38:46.937 # ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP	(duration 43675ms) [ OK ]
00:39:30.877 # ns4 MPTCP -> ns2 (dead:beef:1::2:10035) MPTCP	(duration 42223ms) [ OK ]
00:40:13.370 # ns4 MPTCP -> ns2 (10.0.2.1:10036      ) MPTCP	(duration 46038ms) [ OK ]
00:40:59.675 # ns4 MPTCP -> ns2 (dead:beef:2::1:10037) MPTCP	./mptcp_connect.sh: line 114:  1140 Terminated              ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} $extra_args $local_addr < "$sin" > "$sout"
00:41:19.435 # ./mptcp_connect.sh: line 114:  1146 Terminated              ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $extra_args $connect_addr < "$cin" > "$cout"
00:41:19.686 #
00:41:19.687 not ok 1 selftests: net/mptcp: mptcp_connect.sh # TIMEOUT

Undefined rproc_ops .kick method in remoteproc driver will result in "Unable to handle kernel NULL pointer dereference" in rproc_virtio_notify, after firmware loading if: 1) .kick method wasn't defined in driver 2) resource_table exists in firmware and has "Virtio device entry" defined Let's refuse to register an rproc-induced virtio device if no kick method was defined for rproc. [ 13.180049][ T415] 8<--- cut here --- [ 13.190558][ T415] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 13.212544][ T415] pgd = (ptrval) [ 13.217052][ T415] [00000000] *pgd=00000000 [ 13.224692][ T415] Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM [ 13.231318][ T415] Modules linked in: rpmsg_char imx_rproc virtio_rpmsg_bus rpmsg_core [last unloaded: imx_rproc] [ 13.241687][ T415] CPU: 0 PID: 415 Comm: unload-load.sh Not tainted 5.5.2-00002-g707df13bbbdd #6 [ 13.250561][ T415] Hardware name: Freescale i.MX7 Dual (Device Tree) [ 13.257009][ T415] PC is at 0x0 [ 13.260249][ T415] LR is at rproc_virtio_notify+0x2c/0x54 [ 13.265738][ T415] pc : [<00000000>] lr : [<8050f6b0>] psr: 60010113 [ 13.272702][ T415] sp : b8d47c48 ip : 00000001 fp : bc04de00 [ 13.278625][ T415] r10: bc04c000 r9 : 00000cc0 r8 : b8d46000 [ 13.284548][ T415] r7 : 00000000 r6 : b898f200 r5 : 00000000 r4 : b8a29800 [ 13.291773][ T415] r3 : 00000000 r2 : 990a3ad4 r1 : 00000000 r0 : b8a29800 [ 13.299000][ T415] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 13.306833][ T415] Control: 10c5387d Table: b8b4806a DAC: 00000051 [ 13.313278][ T415] Process unload-load.sh (pid: 415, stack limit = 0x(ptrval)) [ 13.320591][ T415] Stack: (0xb8d47c48 to 0xb8d48000) [ 13.325651][ T415] 7c40: b895b680 00000001 b898f200 803c6430 b895bc80 7f00ae18 [ 13.334531][ T415] 7c60: 00000035 00000000 00000000 b9393200 80b3ed80 00004000 b9393268 bbf5a9a2 [ 13.343410][ T415] 7c80: 00000e00 00000200 00000000 7f00aff0 7f00a014 b895b680 b895b800 990a3ad4 [ 13.352290][ T415] 7ca0: 00000001 b898f210 b898f200 00000000 00000000 7f00e000 00000001 00000000 [ 13.361170][ T415] 7cc0: 00000000 803c62e0 80b2169c 802a0924 b898f210 00000000 00000000 b898f210 [ 13.370049][ T415] 7ce0: 80b9ba44 00000000 80b9ba48 00000000 7f00e000 00000008 80b2169c 80400114 [ 13.378929][ T415] 7d00: 80b2169c 8061fd64 b898f210 7f00e000 80400744 b8d46000 80b21634 80b21634 [ 13.387809][ T415] 7d20: 80b2169c 80400614 80b21634 80400718 7f00e000 00000000 b8d47d7c 80400744 [ 13.396689][ T415] 7d40: b8d46000 80b21634 80b21634 803fe338 b898f254 b80fe76c b8d32e38 990a3ad4 [ 13.405569][ T415] 7d60: fffffff3 b898f210 b8d46000 00000001 b898f254 803ffe7c 80857a90 b898f210 [ 13.414449][ T415] 7d80: 00000001 990a3ad4 b8d46000 b898f210 b898f210 80b17aec b8a29c20 803ff0a4 [ 13.423328][ T415] 7da0: b898f210 00000000 b8d46000 803fb8e0 b898f200 00000000 80b17aec b898f210 [ 13.432209][ T415] 7dc0: b8a29c20 990a3ad4 b895b900 b898f200 8050fb7c 80b17aec b898f210 b8a29c20 [ 13.441088][ T415] 7de0: b8a29800 b895b900 b8a29a04 803c5ec0 b8a29c00 b898f200 b8a29a20 00000007 [ 13.449968][ T415] 7e00: b8a29c20 8050fd78 b8a29800 00000000 b8a29a20 b8a29c04 b8a29820 b8a299d0 [ 13.458848][ T415] 7e20: b895b900 8050e5a4 b8a29800 b8a299d8 b8d46000 b8a299e0 b8a29820 b8a299d0 [ 13.467728][ T415] 7e40: b895b900 8050e008 000041ed 00000000 b8b8c440 b8a299d8 b8a299e0 b8a299d8 [ 13.476608][ T415] 7e60: b8b8c440 990a3ad4 00000000 b8a29820 b8b8c400 00000006 b8a29800 b895b880 [ 13.485487][ T415] 7e80: b8d47f78 00000000 00000000 8050f4b4 00000006 b895b890 b8b8c400 008fbea0 [ 13.494367][ T415] 7ea0: b895b880 8029f530 00000000 00000000 b8d46000 00000006 b8d46000 008fbea0 [ 13.503246][ T415] 7ec0: 8029f434 00000000 b8d46000 00000000 00000000 8021e2e4 0000000a 8061fd0c [ 13.512125][ T415] 7ee0: 0000000a b8af0c00 0000000a b8af0c40 00000001 b8af0c40 00000000 8061f910 [ 13.521005][ T415] 7f00: 0000000a 80240af4 00000002 b8d46000 00000000 8061fd0c 00000002 80232d7c [ 13.529884][ T415] 7f20: 00000000 b8d46000 00000000 990a3ad4 00000000 00000006 b8a62d80 008fbea0 [ 13.538764][ T415] 7f40: b8d47f78 00000000 b8d46000 00000000 00000000 802210c0 b88f2900 00000000 [ 13.547644][ T415] 7f60: b8a62d80 b8a62d80 b8d46000 00000006 008fbea0 80221320 00000000 00000000 [ 13.556524][ T415] 7f80: b8af0c00 990a3ad4 0000006c 008fbea0 76f1cda0 00000004 80101204 00000004 [ 13.565403][ T415] 7fa0: 00000000 80101000 0000006c 008fbea0 00000001 008fbea0 00000006 00000000 [ 13.574283][ T415] 7fc0: 0000006c 008fbea0 76f1cda0 00000004 00000006 00000006 00000000 00000000 [ 13.583162][ T415] 7fe0: 00000004 7ebaf7d0 76eb4c0b 76e3f206 600d0030 00000001 00000000 00000000 [ 13.592056][ T415] [<8050f6b0>] (rproc_virtio_notify) from [<803c6430>] (virtqueue_notify+0x1c/0x34) [ 13.601298][ T415] [<803c6430>] (virtqueue_notify) from [<7f00ae18>] (rpmsg_probe+0x280/0x380 [virtio_rpmsg_bus]) [ 13.611663][ T415] [<7f00ae18>] (rpmsg_probe [virtio_rpmsg_bus]) from [<803c62e0>] (virtio_dev_probe+0x1f8/0x2c4) [ 13.622022][ T415] [<803c62e0>] (virtio_dev_probe) from [<80400114>] (really_probe+0x200/0x450) [ 13.630817][ T415] [<80400114>] (really_probe) from [<80400614>] (driver_probe_device+0x16c/0x1ac) [ 13.639873][ T415] [<80400614>] (driver_probe_device) from [<803fe338>] (bus_for_each_drv+0x84/0xc8) [ 13.649102][ T415] [<803fe338>] (bus_for_each_drv) from [<803ffe7c>] (__device_attach+0xd4/0x164) [ 13.658069][ T415] [<803ffe7c>] (__device_attach) from [<803ff0a4>] (bus_probe_device+0x84/0x8c) [ 13.666950][ T415] [<803ff0a4>] (bus_probe_device) from [<803fb8e0>] (device_add+0x444/0x768) [ 13.675572][ T415] [<803fb8e0>] (device_add) from [<803c5ec0>] (register_virtio_device+0xa4/0xfc) [ 13.684541][ T415] [<803c5ec0>] (register_virtio_device) from [<8050fd78>] (rproc_add_virtio_dev+0xcc/0x1b8) [ 13.694466][ T415] [<8050fd78>] (rproc_add_virtio_dev) from [<8050e5a4>] (rproc_start+0x148/0x200) [ 13.703521][ T415] [<8050e5a4>] (rproc_start) from [<8050e008>] (rproc_boot+0x384/0x5c0) [ 13.711708][ T415] [<8050e008>] (rproc_boot) from [<8050f4b4>] (state_store+0x3c/0xc8) [ 13.719723][ T415] [<8050f4b4>] (state_store) from [<8029f530>] (kernfs_fop_write+0xfc/0x214) [ 13.728348][ T415] [<8029f530>] (kernfs_fop_write) from [<8021e2e4>] (__vfs_write+0x30/0x1cc) [ 13.736971][ T415] [<8021e2e4>] (__vfs_write) from [<802210c0>] (vfs_write+0xac/0x17c) [ 13.744985][ T415] [<802210c0>] (vfs_write) from [<80221320>] (ksys_write+0x64/0xe4) [ 13.752825][ T415] [<80221320>] (ksys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54) [ 13.761178][ T415] Exception stack(0xb8d47fa8 to 0xb8d47ff0) [ 13.766932][ T415] 7fa0: 0000006c 008fbea0 00000001 008fbea0 00000006 00000000 [ 13.775811][ T415] 7fc0: 0000006c 008fbea0 76f1cda0 00000004 00000006 00000006 00000000 00000000 [ 13.784687][ T415] 7fe0: 00000004 7ebaf7d0 76eb4c0b 76e3f206 [ 13.790442][ T415] Code: bad PC value [ 13.839214][ T415] ---[ end trace 1fe21ecfc9f28852 ]--- Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Nikita Shubin <NShubin@topcon.com> Fixes: 7a18694 ("remoteproc: remove the single rpmsg vdev limitation") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200306072452.24743-1-NShubin@topcon.com Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Fix tcon use-after-free and NULL ptr deref. Customer system crashes with the following kernel log: [462233.169868] CIFS VFS: Cancelling wait for mid 4894753 cmd: 14 => a QUERY DIR [462233.228045] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4 [462233.305922] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4 [462233.306205] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4 [462233.347060] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4 [462233.347107] CIFS VFS: Close unmatched open [462233.347113] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 ... [exception RIP: cifs_put_tcon+0xa0] (this is doing tcon->ses->server) #6 [...] smb2_cancelled_close_fid at ... [cifs] #7 [...] process_one_work at ... #8 [...] worker_thread at ... #9 [...] kthread at ... The most likely explanation we have is: * When we put the last reference of a tcon (refcount=0), we close the cached share root handle. * If closing a handle is interrupted, SMB2_close() will queue a SMB2_close() in a work thread. * The queued object keeps a tcon ref so we bump the tcon refcount, jumping from 0 to 1. * We reach the end of cifs_put_tcon(), we free the tcon object despite it now having a refcount of 1. * The queued work now runs, but the tcon, ses & server was freed in the meantime resulting in a crash. THREAD 1 ======== cifs_put_tcon => tcon refcount reach 0 SMB2_tdis close_shroot_lease close_shroot_lease_locked => if cached root has lease && refcount = 0 smb2_close_cached_fid => if cached root valid SMB2_close => retry close in a thread if interrupted smb2_handle_cancelled_close __smb2_handle_cancelled_close => !! tcon refcount bump 0 => 1 !! INIT_WORK(&cancelled->work, smb2_cancelled_close_fid); queue_work(cifsiod_wq, &cancelled->work) => queue work tconInfoFree(tcon); ==> freed! cifs_put_smb_ses(ses); ==> freed! THREAD 2 (workqueue) ======== smb2_cancelled_close_fid SMB2_close(0, cancelled->tcon, ...); => use-after-free of tcon cifs_put_tcon(cancelled->tcon); => tcon refcount reach 0 second time *CRASH* Fixes: d919131 ("CIFS: Close cached root handle only if it has a lease") Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>

matttbe · 2020-04-22T16:49:20Z

00:25:21.726 # INFO: Using loss of 0.71% delay 399 ms on ns3eth4
00:25:21.757 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration  1670ms) [ OK ]
00:25:23.495 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration    37ms) [ OK ]
00:25:23.573 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration    28ms) [ OK ]
00:25:23.644 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration    34ms) [ OK ]
00:25:23.722 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration    30ms) [ OK ]
00:25:23.794 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration    29ms) [ OK ]
00:25:23.866 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration  7511ms) [ OK ]
00:25:31.420 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   485ms) [ OK ]
00:25:31.947 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration  6372ms) [ OK ]
00:25:38.415 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration  5986ms) [ OK ]
00:25:44.391 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration  7307ms) [ OK ]
00:25:51.742 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration  6079ms) [ OK ]
00:25:57.863 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration  6107ms) [ OK ]
00:26:04.011 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration    46ms) [ OK ]
00:26:04.099 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 45165ms) [ OK ]
00:26:49.312 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 45927ms) [ OK ]
00:27:35.279 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration  7100ms) [ OK ]
00:27:42.422 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration  8915ms) [ OK ]
00:27:51.382 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration  7693ms) [ OK ]
00:27:59.117 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration   247ms) [ OK ]
00:27:59.406 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration  9663ms) [ OK ]
00:28:09.118 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration    41ms) [ OK ]
00:28:09.202 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 44969ms) [ OK ]
00:28:54.209 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 45924ms) [ OK ]
00:29:40.172 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration  5669ms) [ OK ]
00:29:45.885 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration 13057ms) [ OK ]
00:29:58.982 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration  6148ms) [ OK ]
00:30:05.181 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration   248ms) [ OK ]
00:30:05.474 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration  6724ms) [ OK ]
00:30:12.239 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration  6784ms) [ OK ]
00:30:19.070 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 45526ms) [ OK ]
00:31:04.635 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 46004ms) [ OK ]
00:31:50.678 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	./mptcp_connect.sh: line 114:  1013 Terminated              ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} $extra_args $local_addr < "$sin" > "$sout"
00:32:48.166 # ./mptcp_connect.sh: line 114:  1019 Terminated              ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $extra_args $connect_addr < "$cin" > "$cout"

Ido Schimmel says: ==================== mlxsw: Prepare SPAN API for upcoming changes Switched port analyzer (SPAN) is used for packet mirroring. Over mlxsw this is achieved by attaching tc-mirred action to either matchall or flower classifier. The current API used to configure SPAN consists of two functions: mlxsw_sp_span_mirror_add() and mlxsw_sp_span_mirror_del(). These two functions pack a lot of different operations: * SPAN agent configuration: Determining the egress port and optional headers that need to encapsulate the mirrored packet (when mirroring to a gretap, for example) * Egress mirror buffer configuration: Allocating / freeing a buffer when port is analyzed (inspected) at egress * SPAN agent binding: Binding the SPAN agent to a trigger, if any. The current triggers are incoming / outgoing packet and they are only used for matchall-based mirroring This non-modular design makes it difficult to extend the API for future changes, such as new mirror targets (CPU) and new global triggers (early dropped packets, for example). Therefore, this patch set gradually adds APIs for above mentioned operations and then converts the two existing users to use it instead of the old API. No functional changes intended. Tested with existing mirroring selftests. Patch set overview: Patches #1-#5 gradually add the new API Patches #6-#8 convert existing users to use the new API Patch #9 removes the old API ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

matttbe · 2020-05-05T14:53:27Z

(again just for the record, from the last build today)

00:18:30.746 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration   174ms) [ OK ]
00:18:31.198 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration   150ms) [ OK ]
00:18:31.597 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration   145ms) [ OK ]
00:18:31.998 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration  1460ms) [ OK ]
00:18:33.712 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration   158ms) [ OK ]
00:18:34.129 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration   152ms) [ OK ]
00:18:34.534 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration  1436ms) [ OK ]
00:18:36.231 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   334ms) [ OK ]
00:18:36.816 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration  4616ms) [ OK ]
00:18:41.685 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration  3296ms) [ OK ]
00:18:45.237 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration   568ms) [ OK ]
00:18:46.071 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration  1297ms) [ OK ]
00:18:47.629 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration   969ms) [ OK ]
00:18:48.858 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration   246ms) [ OK ]
00:18:49.371 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 38070ms) [ OK ]
00:19:27.689 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 38185ms) [ OK ]
00:20:06.135 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration  1308ms) [ OK ]
00:20:07.688 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration   573ms) [ OK ]
00:20:08.511 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration   429ms) [ OK ]
00:20:09.179 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration   636ms) [ OK ]
00:20:10.067 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration  4633ms) [ OK ]
00:20:14.946 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration  6334ms) [ OK ]
00:20:21.524 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 37689ms) [ OK ]
00:20:59.460 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 38293ms) [ OK ]
00:21:38.002 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration  1229ms) [ OK ]
00:21:39.493 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration   349ms) [ OK ]
00:21:40.097 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration   350ms) [ OK ]
00:21:40.695 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration  5556ms) [ OK ]
00:21:46.501 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration   676ms) [ OK ]
00:21:47.429 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration   491ms) [ OK ]
00:21:48.168 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 37835ms) [ OK ]
00:22:26.256 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 38384ms) [ OK ]
00:23:04.890 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	(duration 52943ms) [ OK ]
00:23:58.077 # ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP	(duration 57209ms) [ OK ]
00:24:55.534 # ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP	copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
00:26:23.176 #
00:26:23.177 not ok 1 selftests: net/mptcp: mptcp_connect.sh # TIMEOUT```

matttbe · 2020-05-15T18:38:38Z

(again just for the record, from the last build today)

00:40:10.051 # INFO: Using loss of 0.82% delay 302 ms on ns3eth4
00:40:10.105 # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP	(duration  1804ms) [ OK ]
00:40:12.186 # ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP  	(duration   199ms) [ OK ]
00:40:12.653 # ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP	(duration   183ms) [ OK ]
00:40:13.096 # ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP	(duration  1760ms) [ OK ]
00:40:15.110 # ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP  	(duration   181ms) [ OK ]
00:40:15.546 # ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP	(duration   175ms) [ OK ]
00:40:15.984 # ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP	(duration   615ms) [ OK ]
00:40:16.856 # ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP	(duration   643ms) [ OK ]
00:40:17.761 # ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP	(duration   607ms) [ OK ]
00:40:18.631 # ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP	(duration   663ms) [ OK ]
00:40:19.553 # ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP	(duration  1056ms) [ OK ]
00:40:20.876 # ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP	(duration  3978ms) [ OK ]
00:40:25.108 # ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP	(duration  1001ms) [ OK ]
00:40:26.376 # ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP	(duration   954ms) [ OK ]
00:40:27.594 # ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP	(duration 34148ms) [ OK ]
00:41:02.005 # ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP	(duration 46205ms) [ OK ]
00:41:48.484 # ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP	(duration  5254ms) [ OK ]
00:41:54.001 # ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP	(duration   684ms) [ OK ]
00:41:54.944 # ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP	(duration  1042ms) [ OK ]
00:41:56.244 # ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP	(duration  1717ms) [ OK ]
00:41:58.215 # ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP	(duration   628ms) [ OK ]
00:41:59.111 # ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP	(duration   884ms) [ OK ]
00:42:00.268 # ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP	(duration 46580ms) [ OK ]
00:42:47.112 # ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP	(duration 54620ms) [ OK ]
00:43:41.989 # ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP	(duration  1140ms) [ OK ]
00:43:43.398 # ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP	(duration  1206ms) [ OK ]
00:43:44.861 # ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP	(duration  1000ms) [ OK ]
00:43:46.129 # ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP	(duration  1083ms) [ OK ]
00:43:47.485 # ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP	(duration   973ms) [ OK ]
00:43:48.735 # ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP	(duration  1079ms) [ OK ]
00:43:50.082 # ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP	(duration 31956ms) [ OK ]
00:44:22.295 # ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP	(duration 32188ms) [ OK ]
00:44:54.746 # ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP	(duration 31592ms) [ OK ]
00:45:26.591 # ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP	(duration 31916ms) [ OK ]
00:45:58.776 # ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP	(duration 33395ms) [ OK ]
00:46:32.445 # ns4 MPTCP -> ns2 (dead:beef:1::2:10035) MPTCP	(duration 31892ms) [ OK ]
00:47:04.586 # ns4 MPTCP -> ns2 (10.0.2.1:10036      ) MPTCP	copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
00:48:03.743 # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
00:48:03.777 #
00:48:03.777 not ok 1 selftests: net/mptcp: mptcp_connect.sh # TIMEOUT

This BUG halt was reported a while back, but the patch somehow got missed: PID: 2879 TASK: c16adaa0 CPU: 1 COMMAND: "sctpn" #0 [f418dd28] crash_kexec at c04a7d8c #1 [f418dd7c] oops_end at c0863e02 #2 [f418dd90] do_invalid_op at c040aaca #3 [f418de28] error_code (via invalid_op) at c08631a5 EAX: f34baac0 EBX: 00000090 ECX: f418deb0 EDX: f5542950 EBP: 00000000 DS: 007b ESI: f34ba800 ES: 007b EDI: f418dea0 GS: 00e0 CS: 0060 EIP: c046fa5e ERR: ffffffff EFLAGS: 00010286 #4 [f418de5c] add_timer at c046fa5e #5 [f418de68] sctp_do_sm at f8db8c77 [sctp] #6 [f418df30] sctp_primitive_SHUTDOWN at f8dcc1b5 [sctp] #7 [f418df48] inet_shutdown at c080baf9 #8 [f418df5c] sys_shutdown at c079eedf #9 [f418df70] sys_socketcall at c079fe88 EAX: ffffffda EBX: 0000000d ECX: bfceea90 EDX: 0937af98 DS: 007b ESI: 0000000c ES: 007b EDI: b7150ae4 SS: 007b ESP: bfceea7c EBP: bfceeaa8 GS: 0033 CS: 0073 EIP: b775c424 ERR: 00000066 EFLAGS: 00000282 It appears that the side effect that starts the shutdown timer was processed multiple times, which can happen as multiple paths can trigger it. This of course leads to the BUG halt in add_timer getting called. Fix seems pretty straightforward, just check before the timer is added if its already been started. If it has mod the timer instead to min(current expiration, new expiration) Its been tested but not confirmed to fix the problem, as the issue has only occured in production environments where test kernels are enjoined from being installed. It appears to be a sane fix to me though. Also, recentely, Jere found a reproducer posted on list to confirm that this resolves the issues Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: "David S. Miller" <davem@davemloft.net> CC: jere.leppanen@nokia.com CC: marcelo.leitner@gmail.com CC: netdev@vger.kernel.org Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Pablo Neira Ayuso says: ==================== the indirect flow_block infrastructure, revisited This series fixes b5140a3 ("netfilter: flowtable: add indr block setup support") that adds support for the indirect block for the flowtable. This patch crashes the kernel with the TC CT action. [ 630.908086] BUG: kernel NULL pointer dereference, address: 00000000000000f0 [ 630.908233] #PF: error_code(0x0000) - not-present page [ 630.908304] PGD 800000104addd067 P4D 800000104addd067 PUD 104311d067 PMD 0 [ 630.908380] Oops: 0000 [#1] SMP PTI [ 630.908615] RIP: 0010:nf_flow_table_indr_block_cb+0xc0/0x190 [nf_flow_table] [ 630.908690] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 75 a0 4c 89 65 a8 4d 89 ee 49 89 dd 4c 89 fe 48 c7 c7 b7 64 36 a0 31 c0 e8 ce ed d8 e0 <49> 8b b7 f0 00 00 00 48 c7 c7 c8 64 36 a0 31 c0 e8 b9 ed d8 e0 49[ 630.908790] RSP: 0018:ffffc9000895f8c0 EFLAGS: 00010246 [...] [ 630.910774] Call Trace: [ 630.911192] ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core] [ 630.911621] ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core] [ 630.912040] ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core] [ 630.912443] flow_block_cmd+0x51/0x80 [ 630.912844] __flow_indr_block_cb_register+0x26c/0x510 [ 630.913265] mlx5e_nic_rep_netdevice_event+0x9e/0x110 [mlx5_core] [ 630.913665] notifier_call_chain+0x53/0xa0 [ 630.914063] raw_notifier_call_chain+0x16/0x20 [ 630.914466] call_netdevice_notifiers_info+0x39/0x90 [ 630.914859] register_netdevice+0x484/0x550 [ 630.915256] __ip_tunnel_create+0x12b/0x1f0 [ip_tunnel] [ 630.915661] ip_tunnel_init_net+0x116/0x180 [ip_tunnel] [ 630.916062] ipgre_tap_init_net+0x22/0x30 [ip_gre] [ 630.916458] ops_init+0x44/0x110 [ 630.916851] register_pernet_operations+0x112/0x200 A workaround patch to cure this crash has been proposed. However, there is another problem: The indirect flow_block still does not work for the new TC CT action. The problem is that the existing flow_indr_block_entry callback assumes you can look up for the flowtable from the netdevice to get the flow_block. This flow_block allows you to offload the flows via TC_SETUP_CLSFLOWER. Unfortunately, it is not possible to get the flow_block from the TC CT flowtables because they are _not_ bound to any specific netdevice. = What is the indirect flow_block infrastructure? The indirect flow_block infrastructure allows drivers to offload tc/netfilter rules that belong to software tunnel netdevices, e.g. vxlan. This indirect flow_block infrastructure relates tunnel netdevices with drivers because there is no obvious way to relate these two things from the control plane. = How does the indirect flow_block work before this patchset? Front-ends register the indirect block callback through flow_indr_add_block_cb() if they support for offloading tunnel netdevices. == Setting up an indirect block 1) Drivers track tunnel netdevices via NETDEV_{REGISTER,UNREGISTER} events. If there is a new tunnel netdevice that the driver can offload, then the driver invokes __flow_indr_block_cb_register() with the new tunnel netdevice and the driver callback. The __flow_indr_block_cb_register() call iterates over the list of the front-end callbacks. 2) The front-end callback sets up the flow_block_offload structure and it invokes the driver callback to set up the flow_block. 3) The driver callback now registers the flow_block structure and it returns the flow_block back to the front-end. 4) The front-end gets the flow_block object and it is now ready to offload rules for this tunnel netdevice. A simplified callgraph is represented below. Front-end Driver NETDEV_REGISTER | __flow_indr_block_cb_register(netdev, cb_priv, driver_cb) | [1] .--------------frontend_indr_block_cb(cb_priv, driver_cb) | . setup_flow_block_offload(bo) | [2] driver_cb(bo, cb_priv) -----------. | \/ set up flow_blocks [3] | add rules to flow_block <---------- TC_SETUP_CLSFLOWER [4] == Releasing the indirect flow_block There are two possibilities, either tunnel netdevice is removed or a netdevice (port representor) is removed. === Tunnel netdevice is removed Driver waits for the NETDEV_UNREGISTER event that announces the tunnel netdevice removal. Then, it calls __flow_indr_block_cb_unregister() to remove the flow_block and rules. Callgraph is very similar to the one described above. === Netdevice is removed (port representor) Driver calls __flow_indr_block_cb_unregister() to remove the existing netfilter/tc rule that belong to the tunnel netdevice. = How does the indirect flow_block work after this patchset? Drivers register the indirect flow_block setup callback through flow_indr_dev_register() if they support for offloading tunnel netdevices. == Setting up an indirect flow_block 1) Frontends check if dev->netdev_ops->ndo_setup_tc is unset. If so, frontends call flow_indr_dev_setup_offload(). This call invokes the drivers' indirect flow_block setup callback. 2) The indirect flow_block setup callback sets up a flow_block structure which relates the tunnel netdevice and the driver. 3) The front-end uses flow_block and offload the rules. Note that the operational to set up (non-indirect) flow_block is very similar. == Releasing the indirect flow_block === Tunnel netdevice is removed This calls flow_indr_dev_setup_offload() to set down the flow_block and remove the offloaded rules. This alternate path is exercised if dev->netdev_ops->ndo_setup_tc is unset. === Netdevice is removed (port representor) If a netdevice is removed, then it might need to to clean up the offloaded tc/netfilter rules that belongs to the tunnel netdevice: 1) The driver invokes flow_indr_dev_unregister() when a netdevice is removed. 2) This call iterates over the existing indirect flow_blocks and it invokes the cleanup callback to let the front-end remove the tc/netfilter rules. The cleanup callback already provides the flow_block that the front-end needs to clean up. Front-end Driver | flow_indr_dev_unregister(...) | iterate over list of indirect flow_block and invoke cleanup callback | .----------------------------- | . frontend_flow_block_cleanup(flow_block) . | \/ remove rules to flow_block TC_SETUP_CLSFLOWER = About this patchset This patchset aims to address the existing TC CT problem while simplifying the indirect flow_block infrastructure. Saving 300 LoC in the flow_offload core and the drivers. The operational gets aligned with the (non-indirect) flow_blocks logic. Patchset is composed of: Patch #1 add nf_flow_table_gc_cleanup() which is required by the netfilter's flowtable new indirect flow_block approach. Patch #2 adds the flow_block_indr object which is actually part of of the flow_block object. This stores the indirect flow_block metadata such as the tunnel netdevice owner and the cleanup callback (in case the tunnel netdevice goes away). This patch adds flow_indr_dev_{un}register() to allow drivers to offer netdevice tunnel hardware offload to the front-ends. Then, front-ends call flow_indr_dev_setup_offload() to invoke the drivers to set up the (indirect) flow_block. Patch #3 add the tcf_block_offload_init() helper function, this is a preparation patch to adapt the tc front-end to use this new indirect flow_block infrastructure. Patch #4 updates the tc and netfilter front-ends to use the new indirect flow_block infrastructure. Patch #5 updates the mlx5 driver to use the new indirect flow_block infrastructure. Patch #6 updates the nfp driver to use the new indirect flow_block infrastructure. Patch #7 updates the bnxt driver to use the new indirect flow_block infrastructure. Patch #8 removes the indirect flow_block infrastructure version 1, now that frontends and drivers have been translated to version 2 (coming in this patchset). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Ido Schimmel says: ==================== devlink: Add support for control packet traps So far device drivers were only able to register drop and exception packet traps with devlink. These traps are used for packets that were either dropped by the underlying device or encountered an exception (e.g., missing neighbour entry) during forwarding. However, in the steady state, the majority of the packets being trapped to the CPU are packets that are required for the correct functioning of the control plane. For example, ARP request and IGMP query packets. This patch set allows device drivers to register such control traps with devlink and expose their default control plane policy to user space. User space can then tune the packet trap policer settings according to its needs, as with existing packet traps. In a similar fashion to exception traps, the action associated with such traps cannot be changed as it can easily break the control plane. Unlike drop and exception traps, packets trapped via control traps are not reported to the kernel's drop monitor as they are not indicative of any problem. Patch set overview: Patches #1-#3 break out layer 3 exceptions to a different group to provide better granularity. A future patch set will make this completely configurable. Patch #4 adds a new trap action ('mirror') that is used for packets that are forwarded by the device and sent to the CPU. Such packets are marked by device drivers with 'skb->offload_fwd_mark = 1' in order to prevent the kernel from forwarding them again. Patch #5 adds the new trap type, 'control'. Patches #6-#8 gradually add various control traps to devlink with proper documentation. Patch #9 adds a few control traps to netdevsim, which are automatically exercised by existing devlink-trap selftest. Patches #10 performs small refactoring in mlxsw. Patches #11-#13 change mlxsw to register its existing control traps with devlink. Patch #14 adds a selftest over mlxsw that exercises all the registered control traps. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Implement rtas_call_reentrant() for reentrant rtas-calls: "ibm,int-on", "ibm,int-off",ibm,get-xive" and "ibm,set-xive". On LoPAPR Version 1.1 (March 24, 2016), from 7.3.10.1 to 7.3.10.4, items 2 and 3 say: 2 - For the PowerPC External Interrupt option: The * call must be reentrant to the number of processors on the platform. 3 - For the PowerPC External Interrupt option: The * argument call buffer for each simultaneous call must be physically unique. So, these rtas-calls can be called in a lockless way, if using a different buffer for each cpu doing such rtas call. For this, it was suggested to add the buffer (struct rtas_args) in the PACA struct, so each cpu can have it's own buffer. The PACA struct received a pointer to rtas buffer, which is allocated in the memory range available to rtas 32-bit. Reentrant rtas calls are useful to avoid deadlocks in crashing, where rtas-calls are needed, but some other thread crashed holding the rtas.lock. This is a backtrace of a deadlock from a kdump testing environment: #0 arch_spin_lock #1 lock_rtas () #2 rtas_call (token=8204, nargs=1, nret=1, outputs=0x0) #3 ics_rtas_mask_real_irq (hw_irq=4100) #4 machine_kexec_mask_interrupts #5 default_machine_crash_shutdown #6 machine_crash_shutdown #7 __crash_kexec #8 crash_kexec #9 oops_end Signed-off-by: Leonardo Bras <leobras.c@gmail.com> [mpe: Move under #ifdef PSERIES to avoid build breakage] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200518234245.200672-3-leobras.c@gmail.com

The first version of Clang that supports -tsan-distinguish-volatile will be able to support KCSAN. The first Clang release to do so, will be Clang 11. This is due to satisfying all the following requirements: 1. Never emit calls to __tsan_func_{entry,exit}. 2. __no_kcsan functions should not call anything, not even kcsan_{enable,disable}_current(), when using __{READ,WRITE}_ONCE => Requires leaving them plain! 3. Support atomic_{read,set}*() with KCSAN, which rely on arch_atomic_{read,set}*() using __{READ,WRITE}_ONCE() => Because of #2, rely on Clang 11's -tsan-distinguish-volatile support. We will double-instrument atomic_{read,set}*(), but that's reasonable given it's still lower cost than the data_race() variant due to avoiding 2 extra calls (kcsan_{en,dis}able_current() calls). 4. __always_inline functions inlined into __no_kcsan functions are never instrumented. 5. __always_inline functions inlined into instrumented functions are instrumented. 6. __no_kcsan_or_inline functions may be inlined into __no_kcsan functions => Implies leaving 'noinline' off of __no_kcsan_or_inline. 7. Because of #6, __no_kcsan and __no_kcsan_or_inline functions should never be spuriously inlined into instrumented functions, causing the accesses of the __no_kcsan function to be instrumented. Older versions of Clang do not satisfy #3. The latest GCC currently doesn't support at least #1, #3, and #7. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/CANpmjNMTsY_8241bS7=XAfqvZHFLrVEkv_uM4aDUWE_kh3Rvbw@mail.gmail.com Link: https://lkml.kernel.org/r/20200521142047.169334-7-elver@google.com

matttbe · 2020-06-15T10:40:02Z

@pabeni I got the issue even after having applied "mptcp: add receive buffer auto-tuning":

# ./mptcp_connect.sh -r 0
[ 1002.231694] IPv6: ADDRCONF(NETDEV_CHANGE): ns1eth2: link becomes ready
[ 1002.420845] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth3: link becomes ready
[ 1002.635760] IPv6: ADDRCONF(NETDEV_CHANGE): ns3eth4: link becomes ready
INFO: set ns3-5ee0e8b5-nEz9hH dev ns3eth2: ethtool -K tso off
INFO: set ns4-5ee0e8b5-nEz9hH dev ns4eth3: ethtool -K  gro off
[ 1002.895137] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth1: link becomes ready
Created /tmp/tmp.hedfopzWJM (size 7533596       /tmp/tmp.hedfopzWJM) containing data sent by client
Created /tmp/tmp.h9rDBjuvQw (size 3219484       /tmp/tmp.h9rDBjuvQw) containing data sent by server
New MPTCP socket can be blocked via sysctl              [ OK ]
setsockopt(..., TCP_ULP, "mptcp", ...) blocked  [ OK ]
INFO: validating network environment with pings
INFO: Using loss of 0.67% delay 287 ms on ns3eth4
ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP   (duration   397ms) [ OK ]
ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP     (duration    66ms) [ OK ]
ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP   (duration    66ms) [ OK ]
ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP   (duration    60ms) [ OK ]
ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP     (duration    64ms) [ OK ]
ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP   (duration    65ms) [ OK ]
ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP   (duration    80ms) [ OK ]
ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP   (duration   747ms) [ OK ]
ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP   (duration    75ms) [ OK ]
ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP   (duration    89ms) [ OK ]
ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP   (duration    99ms) [ OK ]
ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP   (duration    94ms) [ OK ]
ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP   (duration    90ms) [ OK ]
ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP   (duration    99ms) [ OK ]
ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP   (duration 37475ms) [ OK ]
ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP   (duration 22472ms) [ OK ]
ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP   (duration    74ms) [ OK ]
ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP   (duration  5204ms) [ OK ]
ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP   (duration    77ms) [ OK ]
ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP   (duration    86ms) [ OK ]
ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP   (duration   468ms) [ OK ]
ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP   (duration    84ms) [ OK ]
ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP   (duration 51300ms) [ OK ]
ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP   (duration 37310ms) [ OK ]
ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP   (duration    95ms) [ OK ]
ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP   (duration  1214ms) [ OK ]
ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP   (duration    92ms) [ OK ]
ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP   (duration    89ms) [ OK ]
ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP   (duration    87ms) [ OK ]
ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP   (duration    92ms) [ OK ]
ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP   (duration  3785ms) [ OK ]
ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP   (duration  3785ms) [ OK ]
ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP   (duration  5798ms) [ OK ]
ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP   (duration  9826ms) [ OK ]
ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP   (duration  4070ms) [ OK ]
ns4 MPTCP -> ns2 (dead:beef:1::2:10035) MPTCP   (duration 19215ms) [ OK ]
ns4 MPTCP -> ns2 (10.0.2.1:10036      ) MPTCP   (duration 10105ms) [ OK ]
ns4 MPTCP -> ns2 (dead:beef:2::1:10037) MPTCP   (duration 12409ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.2.2:10038      ) MPTCP   (duration  3502ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:2::2:10039) MPTCP   (duration  3258ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10040      ) MPTCP   (duration  3264ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10041) MPTCP   (duration  3219ms) [ OK ]
Time: 251 seconds

(not detected by my CI but manually launched on my VM using ./mptcp_connect.sh -r 0

when a MPTCP client tries to connect to itself, tcp_finish_connect() is never reached. Because of this, depending on the socket current state, multiple faulty behaviours can be observed: 1) a WARN_ON() in subflow_data_ready() is hit WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230 [...] CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187 [...] RIP: 0010:subflow_data_ready+0x18b/0x230 [...] Call Trace: tcp_data_queue+0xd2f/0x4250 tcp_rcv_state_process+0xb1c/0x49d3 tcp_v4_do_rcv+0x2bc/0x790 __release_sock+0x153/0x2d0 release_sock+0x4f/0x170 mptcp_shutdown+0x167/0x4e0 __sys_shutdown+0xe6/0x180 __x64_sys_shutdown+0x50/0x70 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 2) client is stuck forever in mptcp_sendmsg() because the socket is not TCP_ESTABLISHED crash> bt 4847 PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35" #0 [ffff8881376ff680] __schedule at ffffffff97248da4 #1 [ffff8881376ff778] schedule at ffffffff9724a34f #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0 #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859 #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52 #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217 RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003 RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720 R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0 R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b 3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake didn't complete. $ tcpdump -tnnr bad.pcap IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0 force a fallback to TCP in these cases, and adjust the main socket state to avoid hanging in mptcp_sendmsg(). Closes: #35 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

matttbe · 2020-06-15T16:32:09Z

I reproduced the issue with the capture option:

# ./mptcp_connect.sh -r 0 -c
INFO: set ns3-5ee79a56-X4O6gS dev ns3eth2: ethtool -K tso off gro off
INFO: set ns4-5ee79a56-X4O6gS dev ns4eth3: ethtool -K  gso off gro off
Created /tmp/tmp.3ghGaS9IRK (size 8003612       /tmp/tmp.3ghGaS9IRK) containing data sent by client
Created /tmp/tmp.dOnJLeBfMQ (size 1786908       /tmp/tmp.dOnJLeBfMQ) containing data sent by server
New MPTCP socket can be blocked via sysctl              [ OK ]
setsockopt(..., TCP_ULP, "mptcp", ...) blocked  [ OK ]
INFO: validating network environment with pings
INFO: Using loss of 0.33% delay 187 ms on ns3eth4
ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP   (duration   120ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
470 packets captured
939 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP     (duration    94ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
943 packets captured
1886 packets received by filter
0 packets dropped by kernel
ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP   (duration    88ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
481 packets captured
962 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns1 (dead:beef:1::1:10003) MPTCP   (duration    89ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
469 packets captured
946 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns1 (dead:beef:1::1:10004) TCP     (duration   116ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
935 packets captured
1870 packets received by filter
0 packets dropped by kernel
ns1 TCP   -> ns1 (dead:beef:1::1:10005) MPTCP   (duration    87ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
462 packets captured
922 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns2 (10.0.1.2:10006      ) MPTCP   (duration   163ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3450 packets captured
3450 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns2 (dead:beef:1::2:10007) MPTCP   (duration   144ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3443 packets captured
3443 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns2 (10.0.2.1:10008      ) MPTCP   (duration   138ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3451 packets captured
3451 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns2 (dead:beef:2::1:10009) MPTCP   (duration   138ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3365 packets captured
3365 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns3 (10.0.2.2:10010      ) MPTCP   (duration   215ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
6087 packets captured
6100 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns3 (dead:beef:2::2:10011) MPTCP   (duration   177ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
5527 packets captured
5527 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns3 (10.0.3.2:10012      ) MPTCP   (duration   165ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
5565 packets captured
5565 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns3 (dead:beef:3::2:10013) MPTCP   (duration   177ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
5587 packets captured
5615 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns4 (10.0.3.1:10014      ) MPTCP   (duration  2875ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3042 packets captured
3042 packets received by filter
0 packets dropped by kernel
ns1 MPTCP -> ns4 (dead:beef:3::1:10015) MPTCP   (duration  2873ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
2846 packets captured
2905 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns1 (10.0.1.1:10016      ) MPTCP   (duration   149ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3506 packets captured
3593 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns1 (dead:beef:1::1:10017) MPTCP   (duration   140ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3320 packets captured
3355 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns3 (10.0.2.2:10018      ) MPTCP   (duration   151ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
5263 packets captured
5266 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns3 (dead:beef:2::2:10019) MPTCP   (duration   275ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
6426 packets captured
6426 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns3 (10.0.3.2:10020      ) MPTCP   (duration   164ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
5845 packets captured
5845 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns3 (dead:beef:3::2:10021) MPTCP   (duration   162ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
5482 packets captured
5492 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns4 (10.0.3.1:10022      ) MPTCP   (duration 25512ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
6570 packets captured
6571 packets received by filter
0 packets dropped by kernel
ns2 MPTCP -> ns4 (dead:beef:3::1:10023) MPTCP   (duration  2680ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3087 packets captured
3087 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns1 (10.0.1.1:10024      ) MPTCP   (duration   202ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9519 packets captured
9540 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns1 (dead:beef:1::1:10025) MPTCP   (duration  1136ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9620 packets captured
9620 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns2 (10.0.1.2:10026      ) MPTCP   (duration   171ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9376 packets captured
9597 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns2 (dead:beef:1::2:10027) MPTCP   (duration   189ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9606 packets captured
9606 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns2 (10.0.2.1:10028      ) MPTCP   (duration   599ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9569 packets captured
9598 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns2 (dead:beef:2::1:10029) MPTCP   (duration   172ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9728 packets captured
9728 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns4 (10.0.3.1:10030      ) MPTCP   (duration  2488ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3435 packets captured
3435 packets received by filter
0 packets dropped by kernel
ns3 MPTCP -> ns4 (dead:beef:3::1:10031) MPTCP   (duration  2311ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3454 packets captured
3454 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns1 (10.0.1.1:10032      ) MPTCP   (duration  2679ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9629 packets captured
9629 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns1 (dead:beef:1::1:10033) MPTCP   (duration  4174ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
10047 packets captured
10047 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns2 (10.0.1.2:10034      ) MPTCP   (duration  3047ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9816 packets captured
9816 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns2 (dead:beef:1::2:10035) MPTCP   (duration  6600ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
10161 packets captured
10161 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns2 (10.0.2.1:10036      ) MPTCP   (duration  2301ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9375 packets captured
9375 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns2 (dead:beef:2::1:10037) MPTCP   (duration  2711ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
9628 packets captured
9628 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns3 (10.0.2.2:10038      ) MPTCP   (duration  3248ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3968 packets captured
3968 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns3 (dead:beef:2::2:10039) MPTCP   (duration  2141ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3457 packets captured
3639 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns3 (10.0.3.2:10040      ) MPTCP   (duration  2145ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3400 packets captured
3669 packets received by filter
0 packets dropped by kernel
ns4 MPTCP -> ns3 (dead:beef:3::2:10041) MPTCP   (duration  2136ms) [ OK ]
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
3373 packets captured
3641 packets received by filter
0 packets dropped by kernel
Time: 171 seconds

Here is the capture for the transfer which took 25 seconds (in a zip file for GitHub...):

ns2 MPTCP -> ns4 (10.0.3.1:10022 ) MPTCP (duration 25512ms) [ OK ]

ns4-5ee79a56-X4O6gS-ns2-5ee79a56-X4O6gS-MPTCP-MPTCP-10.0.3.1.zip

(please tell me if you need the other ones)

cpaasch · 2020-06-16T00:46:06Z

I changed mptcp_connect.sh to get me a sender-side pcap. From what I see this is just due to unfortunate packet-loss (early-on in the connection) and high-RTT. It does not look like an MPTCP-issue but rather simply congestion-control kicking in.

Are we sure that these timeouts are not happening with regular TCP? If that's the case, some interaction between TCP and MPTCP is making congestion-control slower to increase the congestion-window.

when a MPTCP client tries to connect to itself, tcp_finish_connect() is never reached. Because of this, depending on the socket current state, multiple faulty behaviours can be observed: 1) a WARN_ON() in subflow_data_ready() is hit WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230 [...] CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187 [...] RIP: 0010:subflow_data_ready+0x18b/0x230 [...] Call Trace: tcp_data_queue+0xd2f/0x4250 tcp_rcv_state_process+0xb1c/0x49d3 tcp_v4_do_rcv+0x2bc/0x790 __release_sock+0x153/0x2d0 release_sock+0x4f/0x170 mptcp_shutdown+0x167/0x4e0 __sys_shutdown+0xe6/0x180 __x64_sys_shutdown+0x50/0x70 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 2) client is stuck forever in mptcp_sendmsg() because the socket is not TCP_ESTABLISHED crash> bt 4847 PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35" #0 [ffff8881376ff680] __schedule at ffffffff97248da4 #1 [ffff8881376ff778] schedule at ffffffff9724a34f #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0 #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859 #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52 #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217 RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003 RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720 R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0 R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b 3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake didn't complete. $ tcpdump -tnnr bad.pcap IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0 force a fallback to TCP in these cases, and adjust the main socket state to avoid hanging in mptcp_sendmsg(). Closes: #35 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

matttbe · 2020-06-18T08:42:58Z

@cpaasch indeed, you are right: I have the same issue with TCP!
Thank you for having looked at this!

# ./mptcp_connect.sh -r 0 -c -t -t
[231933.822463] IPv6: ADDRCONF(NETDEV_CHANGE): ns1eth2: link becomes ready
[231934.040915] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth3: link becomes ready
[231934.261250] IPv6: ADDRCONF(NETDEV_CHANGE): ns3eth4: link becomes ready
INFO: set ns3-5eeb212b-8upuZx dev ns3eth2: ethtool -K  gso off
[231934.510852] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth1: link becomes ready
Created /tmp/tmp.hpljXvNHWt (size 7634972       /tmp/tmp.hpljXvNHWt) containing data sent by client
Created /tmp/tmp.0tHo7yf9kV (size 1398812       /tmp/tmp.0tHo7yf9kV) containing data sent by server
New MPTCP socket can be blocked via sysctl              [ OK ]
setsockopt(..., TCP_ULP, "mptcp", ...) blocked  [ OK ]
INFO: validating network environment with pings
INFO: Using loss of 0.37% delay 356 ms on ns3eth4
ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP   (duration   243ms) [ OK ]
ns1 MPTCP -> ns1 (10.0.1.1:10001      ) TCP     (duration   136ms) [ OK ]
ns1 TCP   -> ns1 (10.0.1.1:10002      ) MPTCP   (duration   124ms) [ OK ]
ns1 TCP   -> ns1 (10.0.1.1:10003      ) TCP     (duration   147ms) [ OK ]
ns1 MPTCP -> ns1 (dead:beef:1::1:10004) MPTCP   (duration   133ms) [ OK ]
ns1 MPTCP -> ns1 (dead:beef:1::1:10005) TCP     (duration   140ms) [ OK ]
ns1 TCP   -> ns1 (dead:beef:1::1:10006) MPTCP   (duration   119ms) [ OK ]
ns1 TCP   -> ns1 (dead:beef:1::1:10007) TCP     (duration   146ms) [ OK ]
ns1 MPTCP -> ns2 (10.0.1.2:10008      ) MPTCP   (duration   330ms) [ OK ]
ns1 MPTCP -> ns2 (10.0.1.2:10009      ) TCP     (duration   267ms) [ OK ]
ns1 TCP   -> ns2 (10.0.1.2:10010      ) MPTCP   (duration   241ms) [ OK ]
ns1 TCP   -> ns2 (10.0.1.2:10011      ) TCP     (duration   245ms) [ OK ]
ns1 MPTCP -> ns2 (dead:beef:1::2:10012) MPTCP   (duration   251ms) [ OK ]
ns1 MPTCP -> ns2 (dead:beef:1::2:10013) TCP     (duration   249ms) [ OK ]
ns1 TCP   -> ns2 (dead:beef:1::2:10014) MPTCP   (duration   255ms) [ OK ]
ns1 TCP   -> ns2 (dead:beef:1::2:10015) TCP     (duration   251ms) [ OK ]
ns1 MPTCP -> ns2 (10.0.2.1:10016      ) MPTCP   (duration   244ms) [ OK ]
ns1 MPTCP -> ns2 (10.0.2.1:10017      ) TCP     (duration   199ms) [ OK ]
ns1 TCP   -> ns2 (10.0.2.1:10018      ) MPTCP   (duration   236ms) [ OK ]
ns1 TCP   -> ns2 (10.0.2.1:10019      ) TCP     (duration   224ms) [ OK ]
ns1 MPTCP -> ns2 (dead:beef:2::1:10020) MPTCP   (duration   523ms) [ OK ]
ns1 MPTCP -> ns2 (dead:beef:2::1:10021) TCP     (duration   243ms) [ OK ]
ns1 TCP   -> ns2 (dead:beef:2::1:10022) MPTCP   (duration   244ms) [ OK ]
ns1 TCP   -> ns2 (dead:beef:2::1:10023) TCP     (duration   287ms) [ OK ]
ns1 MPTCP -> ns3 (10.0.2.2:10024      ) MPTCP   (duration   489ms) [ OK ]
ns1 MPTCP -> ns3 (10.0.2.2:10025      ) TCP     (duration   437ms) [ OK ]
ns1 TCP   -> ns3 (10.0.2.2:10026      ) MPTCP   (duration   317ms) [ OK ]
ns1 TCP   -> ns3 (10.0.2.2:10027      ) TCP     (duration   171ms) [ OK ]
ns1 MPTCP -> ns3 (dead:beef:2::2:10028) MPTCP   (duration   323ms) [ OK ]
ns1 MPTCP -> ns3 (dead:beef:2::2:10029) TCP     (duration   331ms) [ OK ]
ns1 TCP   -> ns3 (dead:beef:2::2:10030) MPTCP   (duration   206ms) [ OK ]
ns1 TCP   -> ns3 (dead:beef:2::2:10031) TCP     (duration   294ms) [ OK ]
ns1 MPTCP -> ns3 (10.0.3.2:10032      ) MPTCP   (duration   397ms) [ OK ]
ns1 MPTCP -> ns3 (10.0.3.2:10033      ) TCP     (duration   295ms) [ OK ]
ns1 TCP   -> ns3 (10.0.3.2:10034      ) MPTCP   (duration   294ms) [ OK ]
ns1 TCP   -> ns3 (10.0.3.2:10035      ) TCP     (duration   247ms) [ OK ]
ns1 MPTCP -> ns3 (dead:beef:3::2:10036) MPTCP   (duration   308ms) [ OK ]
ns1 MPTCP -> ns3 (dead:beef:3::2:10037) TCP     (duration   389ms) [ OK ]
ns1 TCP   -> ns3 (dead:beef:3::2:10038) MPTCP   (duration   241ms) [ OK ]
ns1 TCP   -> ns3 (dead:beef:3::2:10039) TCP     (duration   237ms) [ OK ]
ns1 MPTCP -> ns4 (10.0.3.1:10040      ) MPTCP   (duration 39996ms) [ OK ]
ns1 MPTCP -> ns4 (10.0.3.1:10041      ) TCP     (duration  6135ms) [ OK ]
ns1 TCP   -> ns4 (10.0.3.1:10042      ) MPTCP   (duration 13592ms) [ OK ]
ns1 TCP   -> ns4 (10.0.3.1:10043      ) TCP     (duration 13625ms) [ OK ]
ns1 MPTCP -> ns4 (dead:beef:3::1:10044) MPTCP   (duration 27709ms) [ OK ]
ns1 MPTCP -> ns4 (dead:beef:3::1:10045) TCP     (duration 37940ms) [ OK ]
ns1 TCP   -> ns4 (dead:beef:3::1:10046) MPTCP   (duration 15018ms) [ OK ]
ns1 TCP   -> ns4 (dead:beef:3::1:10047) TCP     (duration 26065ms) [ OK ]
ns2 MPTCP -> ns1 (10.0.1.1:10048      ) MPTCP   (duration   208ms) [ OK ]
ns2 MPTCP -> ns1 (10.0.1.1:10049      ) TCP     (duration   263ms) [ OK ]
ns2 TCP   -> ns1 (10.0.1.1:10050      ) MPTCP   (duration   228ms) [ OK ]
ns2 TCP   -> ns1 (10.0.1.1:10051      ) TCP     (duration   250ms) [ OK ]
ns2 MPTCP -> ns1 (dead:beef:1::1:10052) MPTCP   (duration   218ms) [ OK ]
ns2 MPTCP -> ns1 (dead:beef:1::1:10053) TCP     (duration   229ms) [ OK ]
ns2 TCP   -> ns1 (dead:beef:1::1:10054) MPTCP   (duration   255ms) [ OK ]
ns2 TCP   -> ns1 (dead:beef:1::1:10055) TCP     (duration   249ms) [ OK ]
ns2 MPTCP -> ns3 (10.0.2.2:10056      ) MPTCP   (duration   277ms) [ OK ]
ns2 MPTCP -> ns3 (10.0.2.2:10057      ) TCP     (duration   208ms) [ OK ]
ns2 TCP   -> ns3 (10.0.2.2:10058      ) MPTCP   (duration   247ms) [ OK ]
ns2 TCP   -> ns3 (10.0.2.2:10059      ) TCP     (duration   246ms) [ OK ]
ns2 MPTCP -> ns3 (dead:beef:2::2:10060) MPTCP   (duration   253ms) [ OK ]
ns2 MPTCP -> ns3 (dead:beef:2::2:10061) TCP     (duration   305ms) [ OK ]
ns2 TCP   -> ns3 (dead:beef:2::2:10062) MPTCP   (duration   263ms) [ OK ]
ns2 TCP   -> ns3 (dead:beef:2::2:10063) TCP     (duration   259ms) [ OK ]
ns2 MPTCP -> ns3 (10.0.3.2:10064      ) MPTCP   (duration   244ms) [ OK ]
ns2 MPTCP -> ns3 (10.0.3.2:10065      ) TCP     (duration   247ms) [ OK ]
ns2 TCP   -> ns3 (10.0.3.2:10066      ) MPTCP   (duration   370ms) [ OK ]
ns2 TCP   -> ns3 (10.0.3.2:10067      ) TCP     (duration   254ms) [ OK ]
ns2 MPTCP -> ns3 (dead:beef:3::2:10068) MPTCP   (duration   260ms) [ OK ]
ns2 MPTCP -> ns3 (dead:beef:3::2:10069) TCP     (duration   225ms) [ OK ]
ns2 TCP   -> ns3 (dead:beef:3::2:10070) MPTCP   (duration   276ms) [ OK ]
ns2 TCP   -> ns3 (dead:beef:3::2:10071) TCP     (duration   225ms) [ OK ]
ns2 MPTCP -> ns4 (10.0.3.1:10072      ) MPTCP   (duration 21828ms) [ OK ]
ns2 MPTCP -> ns4 (10.0.3.1:10073      ) TCP     (duration 40111ms) [ OK ]
ns2 TCP   -> ns4 (10.0.3.1:10074      ) MPTCP   (duration 20722ms) [ OK ]
ns2 TCP   -> ns4 (10.0.3.1:10075      ) TCP     (duration 23566ms) [ OK ]
ns2 MPTCP -> ns4 (dead:beef:3::1:10076) MPTCP   (duration 21086ms) [ OK ]
ns2 MPTCP -> ns4 (dead:beef:3::1:10077) TCP     (duration 37818ms) [ OK ]
ns2 TCP   -> ns4 (dead:beef:3::1:10078) MPTCP   (duration 15744ms) [ OK ]
ns2 TCP   -> ns4 (dead:beef:3::1:10079) TCP     (duration  5406ms) [ OK ]
ns3 MPTCP -> ns1 (10.0.1.1:10080      ) MPTCP   (duration   289ms) [ OK ]
ns3 MPTCP -> ns1 (10.0.1.1:10081      ) TCP     (duration   296ms) [ OK ]
ns3 TCP   -> ns1 (10.0.1.1:10082      ) MPTCP(duration   275ms) [ OK ]
ns3 TCP   -> ns1 (10.0.1.1:10083      ) TCP     (duration   227ms) [ OK ]
ns3 MPTCP -> ns1 (dead:beef:1::1:10084) MPTCP   (duration   302ms) [ OK ]
ns3 MPTCP -> ns1 (dead:beef:1::1:10085) TCP     (duration   275ms) [ OK ]
ns3 TCP   -> ns1 (dead:beef:1::1:10086) MPTCP   (duration   199ms) [ OK ]
ns3 TCP   -> ns1 (dead:beef:1::1:10087) TCP     (duration   265ms) [ OK ]
ns3 MPTCP -> ns2 (10.0.1.2:10088      ) MPTCP   (duration   401ms) [ OK ]
ns3 MPTCP -> ns2 (10.0.1.2:10089      ) TCP     (duration   241ms) [ OK ]
ns3 TCP   -> ns2 (10.0.1.2:10090      ) MPTCP   (duration   199ms) [ OK ]
ns3 TCP   -> ns2 (10.0.1.2:10091      ) TCP     (duration   250ms) [ OK ]
ns3 MPTCP -> ns2 (dead:beef:1::2:10092) MPTCP   (duration   322ms) [ OK ]
ns3 MPTCP -> ns2 (dead:beef:1::2:10093) TCP     (duration   262ms) [ OK ]
ns3 TCP   -> ns2 (dead:beef:1::2:10094) MPTCP   (duration   230ms) [ OK ]
ns3 TCP   -> ns2 (dead:beef:1::2:10095) TCP     (duration   245ms) [ OK ]
ns3 MPTCP -> ns2 (10.0.2.1:10096      ) MPTCP   (duration   324ms) [ OK ]
ns3 MPTCP -> ns2 (10.0.2.1:10097      ) TCP     (duration   348ms) [ OK ]
ns3 TCP   -> ns2 (10.0.2.1:10098      ) MPTCP   (duration   183ms) [ OK ]
ns3 TCP   -> ns2 (10.0.2.1:10099      ) TCP     (duration   208ms) [ OK ]
ns3 MPTCP -> ns2 (dead:beef:2::1:10100) MPTCP   (duration   341ms) [ OK ]
ns3 MPTCP -> ns2 (dead:beef:2::1:10101) TCP     (duration   254ms) [ OK ]
ns3 TCP   -> ns2 (dead:beef:2::1:10102) MPTCP   (duration   253ms) [ OK ]
ns3 TCP   -> ns2 (dead:beef:2::1:10103) TCP     (duration   248ms) [ OK ]
ns3 MPTCP -> ns4 (10.0.3.1:10104      ) MPTCP   (duration  4345ms) [ OK ]
ns3 MPTCP -> ns4 (10.0.3.1:10105      ) TCP     (duration  4345ms) [ OK ]
ns3 TCP   -> ns4 (10.0.3.1:10106      ) MPTCP   (duration  4689ms) [ OK ]
ns3 TCP   -> ns4 (10.0.3.1:10107      ) TCP     (duration  4352ms) [ OK ]
ns3 MPTCP -> ns4 (dead:beef:3::1:10108) MPTCP   (duration  4364ms) [ OK ]
ns3 MPTCP -> ns4 (dead:beef:3::1:10109) TCP     (duration  4342ms) [ OK ]
ns3 TCP   -> ns4 (dead:beef:3::1:10110) MPTCP   (duration  4338ms) [ OK ]
ns3 TCP   -> ns4 (dead:beef:3::1:10111) TCP     (duration  4344ms) [ OK ]
ns4 MPTCP -> ns1 (10.0.1.1:10112      ) MPTCP   (duration  4003ms) [ OK ]
ns4 MPTCP -> ns1 (10.0.1.1:10113      ) TCP     (duration  9310ms) [ OK ]
ns4 TCP   -> ns1 (10.0.1.1:10114      ) MPTCP   (duration  8597ms) [ OK ]
ns4 TCP   -> ns1 (10.0.1.1:10115      ) TCP     (duration 10743ms) [ OK ]
ns4 MPTCP -> ns1 (dead:beef:1::1:10116) MPTCP   (duration  4000ms) [ OK ]
ns4 MPTCP -> ns1 (dead:beef:1::1:10117) TCP     (duration  3976ms) [ OK ]
ns4 TCP   -> ns1 (dead:beef:1::1:10118) MPTCP   (duration 11459ms) [ OK ]
ns4 TCP   -> ns1 (dead:beef:1::1:10119) TCP     (duration 11097ms) [ OK ]
ns4 MPTCP -> ns2 (10.0.1.2:10120      ) MPTCP   (duration  4681ms) [ OK ]
ns4 MPTCP -> ns2 (10.0.1.2:10121      ) TCP     (duration  4009ms) [ OK ]
ns4 TCP   -> ns2 (10.0.1.2:10122      ) MPTCP   (duration  3996ms) [ OK ]
ns4 TCP   -> ns2 (10.0.1.2:10123      ) TCP     (duration  4330ms) [ OK ]
ns4 MPTCP -> ns2 (dead:beef:1::2:10124) MPTCP   (duration  3988ms) [ OK ]
ns4 MPTCP -> ns2 (dead:beef:1::2:10125) TCP     (duration  4037ms) [ OK ]
ns4 TCP   -> ns2 (dead:beef:1::2:10126) MPTCP   (duration  3997ms) [ OK ]
ns4 TCP   -> ns2 (dead:beef:1::2:10127) TCP     (duration  3650ms) [ OK ]
ns4 MPTCP -> ns2 (10.0.2.1:10128      ) MPTCP   (duration  4014ms) [ OK ]
ns4 MPTCP -> ns2 (10.0.2.1:10129      ) TCP     (duration  3992ms) [ OK ]
ns4 TCP   -> ns2 (10.0.2.1:10130      ) MPTCP   (duration  3993ms) [ OK ]
ns4 TCP   -> ns2 (10.0.2.1:10131      ) TCP     (duration  3989ms) [ OK ]
ns4 MPTCP -> ns2 (dead:beef:2::1:10132) MPTCP   (duration 18298ms) [ OK ]
ns4 MPTCP -> ns2 (dead:beef:2::1:10133) TCP     (duration  3995ms) [ OK ]
ns4 TCP   -> ns2 (dead:beef:2::1:10134) MPTCP   (duration  7228ms) [ OK ]
ns4 TCP   -> ns2 (dead:beef:2::1:10135) TCP     (duration  3658ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.2.2:10136      ) MPTCP   (duration  4017ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.2.2:10137      ) TCP     (duration  3651ms) [ OK ]
ns4 TCP   -> ns3 (10.0.2.2:10138      ) MPTCP   (duration  3976ms) [ OK ]
ns4 TCP   -> ns3 (10.0.2.2:10139      ) TCP     (duration  3655ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:2::2:10140) MPTCP   (duration  3643ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:2::2:10141) TCP     (duration  3984ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:2::2:10142) MPTCP   (duration  3631ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:2::2:10143) TCP     (duration  3986ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10144      ) MPTCP   (duration  4332ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10145      ) TCP     (duration  3968ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10146      ) MPTCP   (duration  3973ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10147      ) TCP     (duration  3634ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10148) MPTCP   (duration  3667ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10149) TCP     (duration  3981ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10150) MPTCP   (duration  3629ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10151) TCP     (duration  3673ms) [ OK ]
Time: 967 seconds

That's quite visible for the transfer between ns2 and ns4 in both v4 and v6:

ns2 MPTCP -> ns4 (10.0.3.1:10072      ) MPTCP   (duration 21828ms) [ OK ]
ns2 MPTCP -> ns4 (10.0.3.1:10073      ) TCP     (duration 40111ms) [ OK ]
ns2 TCP   -> ns4 (10.0.3.1:10074      ) MPTCP   (duration 20722ms) [ OK ]
ns2 TCP   -> ns4 (10.0.3.1:10075      ) TCP     (duration 23566ms) [ OK ]
ns2 MPTCP -> ns4 (dead:beef:3::1:10076) MPTCP   (duration 21086ms) [ OK ]
ns2 MPTCP -> ns4 (dead:beef:3::1:10077) TCP     (duration 37818ms) [ OK ]
ns2 TCP   -> ns4 (dead:beef:3::1:10078) MPTCP   (duration 15744ms) [ OK ]
ns2 TCP   -> ns4 (dead:beef:3::1:10079) TCP     (duration  5406ms) [ OK ]

The delay is high (Using loss of 0.37% delay 356 ms on ns3eth4) but why do we only see that when there is no reordering added with TC?

Anyway, it looks like it is not linked to MPTCP. Should we close this issue? Or before, avoid having no re-ordering in our tests? Report this somewhere?

cpaasch · 2020-06-18T16:03:24Z

From the pcap it looked like simply congestion-control that is kicking in. Thus, no concern regarding TCP.

I don't know why this is not happening when reordering is enabled. Is the file-size in that case the same?

when a MPTCP client tries to connect to itself, tcp_finish_connect() is never reached. Because of this, depending on the socket current state, multiple faulty behaviours can be observed: 1) a WARN_ON() in subflow_data_ready() is hit WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230 [...] CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187 [...] RIP: 0010:subflow_data_ready+0x18b/0x230 [...] Call Trace: tcp_data_queue+0xd2f/0x4250 tcp_rcv_state_process+0xb1c/0x49d3 tcp_v4_do_rcv+0x2bc/0x790 __release_sock+0x153/0x2d0 release_sock+0x4f/0x170 mptcp_shutdown+0x167/0x4e0 __sys_shutdown+0xe6/0x180 __x64_sys_shutdown+0x50/0x70 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 2) client is stuck forever in mptcp_sendmsg() because the socket is not TCP_ESTABLISHED crash> bt 4847 PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35" #0 [ffff8881376ff680] __schedule at ffffffff97248da4 #1 [ffff8881376ff778] schedule at ffffffff9724a34f #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0 #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859 #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52 #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217 RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003 RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720 R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0 R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b 3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake didn't complete. $ tcpdump -tnnr bad.pcap IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0 force a fallback to TCP in these cases, and adjust the main socket state to avoid hanging in mptcp_sendmsg(). Closes: #35 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Daniel Machon says: ==================== net: lan966x: use the newly introduced FDMA library This patch series is the second of a 2-part series [1], that adds a new common FDMA library for Microchip switch chips Sparx5 and lan966x. These chips share the same FDMA engine, and as such will benefit from a common library with a common implementation. This also has the benefit of removing a lot of open-coded bookkeeping and duplicate code for the two drivers. In this second series, the FDMA library will be taken into use by the lan966x switch driver. ################### # Example of use: # ################### - Initialize the rx and tx fdma structs with values for: number of DCB's, number of DB's, channel ID, DB size (data buffer size), and total size of the requested memory. Also provide two callbacks: nextptr_cb() and dataptr_cb() for getting the nextptr and dataptr. - Allocate memory using fdma_alloc_phys() or fdma_alloc_coherent(). - Initialize the DCB's with fdma_dcb_init(). - Add new DCB's with fdma_dcb_add(). - Free memory with fdma_free_phys() or fdma_free_coherent(). ##################### # Patch breakdown: # ##################### Patch #1: select FDMA library for lan966x. Patch #2: includes the fdma_api.h header and removes old symbols. Patch #3: replaces old rx and tx variables with equivalent ones from the fdma struct. Only the variables that can be changed without breaking traffic is changed in this patch. Patch #4: uses the library for allocation of rx buffers. This requires quite a bit of refactoring in this single patch. Patch #5: uses the library for adding DCB's in the rx path. Patch #6: uses the library for freeing rx buffers. Patch #7: uses the library for allocation of tx buffers. This requires quite a bit of refactoring in this single patch. Patch #8: uses the library for adding DCB's in the tx path. Patch #9: uses the library helpers in the tx path. Patch #10: ditch last_in_use variable and use library instead. Patch #11: uses library helpers throughout. Patch #12: refactor lan966x_fdma_reload() function. [1] https://lore.kernel.org/netdev/20240902-fdma-sparx5-v1-0-1e7d5e5a9f34@microchip.com/ Signed-off-by: Daniel Machon <daniel.machon@microchip.com> ==================== Link: https://patch.msgid.link/20240905-fdma-lan966x-v1-0-e083f8620165@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

jira LE-1907 Rebuild_History Non-Buildable kernel-4.18.0-294.el8 commit-author Christoph Paasch <cpaasch@apple.com> commit e548465 The delay was intended to be configured to "simulate" a high(er) BDP link. As such, it needs to be set as part of the loss-configuration and not as part of the netem reordering configuration. The reordering-config also requires a delay but that delay is the reordering-extend. So, a good approach is to set the reordering-extend as a function of the configured latency. E.g., 25% of the overall latency. To speed up the selftests, we limit the delay to 50ms maximum to avoid having the selftests run for too long. Finally, the intention of tc_reorder was that when it is unset, the test picks a random configuration. However, currently it is always initialized and thus the random config won't be picked up. Closes: multipath-tcp/mptcp_net-next#6 Reported-and-reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit e548465) Signed-off-by: Jonathan Maple <jmaple@ciq.com>

Ido Schimmel says: ==================== net: fib_rules: Add DSCP selector support Currently, the kernel rejects IPv4 FIB rules that try to match on the upper three DSCP bits: # ip -4 rule add tos 0x1c table 100 # ip -4 rule add tos 0x3c table 100 Error: Invalid tos. The reason for that is that historically users of the FIB lookup API only populated the lower three DSCP bits in the TOS field of the IPv4 flow key ('flowi4_tos'), which fits the TOS definition from the initial IPv4 specification (RFC 791). This is not very useful nowadays and instead some users want to be able to match on the six bits DSCP field, which replaced the TOS and IP precedence fields over 25 years ago (RFC 2474). In addition, the current behavior differs between IPv4 and IPv6 which does allow users to match on the entire DSCP field using the TOS selector. Recent patchsets made sure that callers of the FIB lookup API now populate the entire DSCP field in the IPv4 flow key. Therefore, it is now possible to extend FIB rules to match on DSCP. This is done by adding a new DSCP attribute which is implemented for both IPv4 and IPv6 to provide user space programs a consistent behavior between both address families. The behavior of the old TOS selector is unchanged and IPv4 FIB rules using it will only match on the lower three DSCP bits. The kernel will reject rules that try to use both selectors. Patch #1 adds the new DSCP attribute but rejects its usage. Patches #2-#3 implement IPv4 and IPv6 support. Patch #4 allows user space to use the new attribute. Patches #5-#6 add selftests. ==================== Link: https://patch.msgid.link/20240911093748.3662015-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Syzkaller reported a lockdep splat: ============================================ WARNING: possible recursive locking detected 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted -------------------------------------------- syz-executor364/5113 is trying to acquire lock: ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 but task is already holding lock: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(k-slock-AF_INET); lock(k-slock-AF_INET); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by syz-executor364/5113: #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806 #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727 #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470 #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228 #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104 #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232 #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 stack backtrace: CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_deadlock kernel/locking/lockdep.c:3061 [inline] validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855 __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279 subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874 tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853 tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6108 __napi_poll+0xcb/0x490 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6963 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235 ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline] tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729 tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934 sk_backlog_rcv include/net/sock.h:1111 [inline] __release_sock+0x214/0x350 net/core/sock.c:3004 release_sock+0x61/0x1f0 net/core/sock.c:3558 mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f04fb13a6b9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9 RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004 RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300 </TASK> As noted by Cong Wang, the splat is false positive, but the code path leading to the report is an unexpected one: a client is attempting an MPC handshake towards the in-kernel listener created by the in-kernel PM for a port based signal endpoint. Such connection will be never accepted; many of them can make the listener queue full and preventing the creation of MPJ subflow via such listener - its intended role. Explicitly detect this scenario at initial-syn time and drop the incoming MPC request. Fixes: 1729cf1 ("mptcp: create the listening socket for new port") Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e Signed-off-by: Paolo Abeni <pabeni@redhat.com> Message-ID: <833cae5982ac5d5b3236845c6db4315e634f5705.1727974826.git.pabeni@redhat.com>

Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <chao.gao@intel.com> Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Tariq Toukan says: ==================== net/mlx5: hw counters refactor This is a patchset re-post, see: https://lore.kernel.org/20240815054656.2210494-7-tariqt@nvidia.com In this patchset, Cosmin refactors hw counters and solves perf scaling issue. Series generated against: commit c824deb ("cxgb4: clip_tbl: Fix spelling mistake "wont" -> "won't"") HW counters are central to mlx5 driver operations. They are hardware objects created and used alongside most steering operations, and queried from a variety of places. Most counters are queried in bulk from a periodic task in fs_counters.c. Counter performance is important and as such, a variety of improvements have been done over the years. Currently, counters are allocated from pools, which are bulk allocated to amortize the cost of firmware commands. Counters are managed through an IDR, a doubly linked list and two atomic single linked lists. Adding/removing counters is a complex dance between user contexts requesting it and the mlx5_fc_stats_work task which does most of the work. Under high load (e.g. from connection tracking flow insertion/deletion), the counter code becomes a bottleneck, as seen on flame graphs. Whenever a counter is deleted, it gets added to a list and the wq task is scheduled to run immediately to actually delete it. This is done via mod_delayed_work which uses an internal spinlock. In some tests, waiting for this spinlock took up to 66% of all samples. This series refactors the counter code to use a more straight-forward approach, avoiding the mod_delayed_work problem and making the code easier to understand. For that: - patch #1 moves counters data structs to a more appropriate place. - patch #2 simplifies the bulk query allocation scheme by using vmalloc. - patch #3 replaces the IDR+3 lists with an xarray. This is the main patch of the series, solving the spinlock congestion issue. - patch #4 removes an unnecessary cacheline alignment causing a lot of memory to be wasted. - patches #5 and #6 are small cleanups enabled by the refactoring. ==================== Link: https://patch.msgid.link/20241001103709.58127-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Edward Cree says: ==================== sfc: per-queue stats This series implements the netdev_stat_ops interface for per-queue statistics in the sfc driver, partly using existing counters that were originally added for ethtool -S output. Changed in v4: * remove RFC tags Changed in v3: * make TX stats count completions rather than enqueues * add new patch #4 to account for XDP TX separately from netdev traffic and include it in base_stats * move the tx_queue->old_* members out of the fastpath cachelines * note on patch #6 that our hw_gso stats still count enqueues * RFC since net-next is closed right now Changed in v2: * exclude (dedicated) XDP TXQ stats from per-queue TX stats * explain patch #3 better ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Syzkaller reported a lockdep splat: ============================================ WARNING: possible recursive locking detected 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted -------------------------------------------- syz-executor364/5113 is trying to acquire lock: ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 but task is already holding lock: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(k-slock-AF_INET); lock(k-slock-AF_INET); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by syz-executor364/5113: #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806 #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727 #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470 #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228 #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104 #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232 #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 stack backtrace: CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_deadlock kernel/locking/lockdep.c:3061 [inline] validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855 __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279 subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874 tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853 tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6108 __napi_poll+0xcb/0x490 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6963 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235 ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline] tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729 tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934 sk_backlog_rcv include/net/sock.h:1111 [inline] __release_sock+0x214/0x350 net/core/sock.c:3004 release_sock+0x61/0x1f0 net/core/sock.c:3558 mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f04fb13a6b9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9 RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004 RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300 </TASK> As noted by Cong Wang, the splat is false positive, but the code path leading to the report is an unexpected one: a client is attempting an MPC handshake towards the in-kernel listener created by the in-kernel PM for a port based signal endpoint. Such connection will be never accepted; many of them can make the listener queue full and preventing the creation of MPJ subflow via such listener - its intended role. Explicitly detect this scenario at initial-syn time and drop the incoming MPC request. Fixes: 1729cf1 ("mptcp: create the listening socket for new port") Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Message-Id: <20241007-mpc-hs-port-v2-1-0c9e7827bd0f@kernel.org>

Syzkaller reported a lockdep splat: ============================================ WARNING: possible recursive locking detected 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted -------------------------------------------- syz-executor364/5113 is trying to acquire lock: ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 but task is already holding lock: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(k-slock-AF_INET); lock(k-slock-AF_INET); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by syz-executor364/5113: #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806 #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727 #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470 #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228 #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104 #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232 #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 stack backtrace: CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_deadlock kernel/locking/lockdep.c:3061 [inline] validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855 __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279 subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874 tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853 tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6108 __napi_poll+0xcb/0x490 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6963 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235 ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline] tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729 tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934 sk_backlog_rcv include/net/sock.h:1111 [inline] __release_sock+0x214/0x350 net/core/sock.c:3004 release_sock+0x61/0x1f0 net/core/sock.c:3558 mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f04fb13a6b9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9 RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004 RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300 </TASK> As noted by Cong Wang, the splat is false positive, but the code path leading to the report is an unexpected one: a client is attempting an MPC handshake towards the in-kernel listener created by the in-kernel PM for a port based signal endpoint. Such connection will be never accepted; many of them can make the listener queue full and preventing the creation of MPJ subflow via such listener - its intended role. Explicitly detect this scenario at initial-syn time and drop the incoming MPC request. Fixes: 1729cf1 ("mptcp: create the listening socket for new port") Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Message-Id: <20241008-mpc-hs-port-v3-1-cec1363f0353@kernel.org>

Syzkaller reported a lockdep splat: ============================================ WARNING: possible recursive locking detected 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted -------------------------------------------- syz-executor364/5113 is trying to acquire lock: ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 but task is already holding lock: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(k-slock-AF_INET); lock(k-slock-AF_INET); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by syz-executor364/5113: #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806 #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727 #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470 #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228 #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104 #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232 #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 stack backtrace: CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_deadlock kernel/locking/lockdep.c:3061 [inline] validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855 __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279 subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874 tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853 tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6108 __napi_poll+0xcb/0x490 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6963 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235 ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline] tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729 tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934 sk_backlog_rcv include/net/sock.h:1111 [inline] __release_sock+0x214/0x350 net/core/sock.c:3004 release_sock+0x61/0x1f0 net/core/sock.c:3558 mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f04fb13a6b9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9 RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004 RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300 </TASK> As noted by Cong Wang, the splat is false positive, but the code path leading to the report is an unexpected one: a client is attempting an MPC handshake towards the in-kernel listener created by the in-kernel PM for a port based signal endpoint. Such connection will be never accepted; many of them can make the listener queue full and preventing the creation of MPJ subflow via such listener - its intended role. Explicitly detect this scenario at initial-syn time and drop the incoming MPC request. Fixes: 1729cf1 ("mptcp: create the listening socket for new port") Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org>

Eric Dumazet says: ==================== net: remove RTNL from fib_seq_sum() This series is inspired by a syzbot report showing rtnl contention and one thread blocked in: 7 locks held by syz-executor/10835: #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2931 [inline] #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: vfs_write+0x224/0xc90 fs/read_write.c:679 #1: ffff88806df6bc88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x1ea/0x500 fs/kernfs/file.c:325 #2: ffff888026fcf3c8 (kn->active#50){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x20e/0x500 fs/kernfs/file.c:326 #3: ffffffff8f56f848 (nsim_bus_dev_list_lock){+.+.}-{3:3}, at: new_device_store+0x1b4/0x890 drivers/net/netdevsim/bus.c:166 #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: device_lock include/linux/device.h:1014 [inline] #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x8e/0x520 drivers/base/dd.c:1005 #5: ffff88805c5fb250 (&devlink->lock_key#55){+.+.}-{3:3}, at: nsim_drv_probe+0xcb/0xb80 drivers/net/netdevsim/dev.c:1534 #6: ffffffff8fcd1748 (rtnl_mutex){+.+.}-{3:3}, at: fib_seq_sum+0x31/0x290 net/core/fib_notifier.c:46 ==================== Link: https://patch.msgid.link/20241009184405.3752829-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…ation When testing the XDP_REDIRECT function on the LS1028A platform, we found a very reproducible issue that the Tx frames can no longer be sent out even if XDP_REDIRECT is turned off. Specifically, if there is a lot of traffic on Rx direction, when XDP_REDIRECT is turned on, the console may display some warnings like "timeout for tx ring #6 clear", and all redirected frames will be dropped, the detailed log is as follows. root@ls1028ardb:~# ./xdp-bench redirect eno0 eno2 Redirecting from eno0 (ifindex 3; driver fsl_enetc) to eno2 (ifindex 4; driver fsl_enetc) [203.849809] fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #5 clear [204.006051] fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #6 clear [204.161944] fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #7 clear eno0->eno2 1420505 rx/s 1420590 err,drop/s 0 xmit/s xmit eno0->eno2 0 xmit/s 1420590 drop/s 0 drv_err/s 15.71 bulk-avg eno0->eno2 1420484 rx/s 1420485 err,drop/s 0 xmit/s xmit eno0->eno2 0 xmit/s 1420485 drop/s 0 drv_err/s 15.71 bulk-avg By analyzing the XDP_REDIRECT implementation of enetc driver, the driver will reconfigure Tx and Rx BD rings when a bpf program is installed or uninstalled, but there is no mechanisms to block the redirected frames when enetc driver reconfigures rings. Similarly, XDP_TX verdicts on received frames can also lead to frames being enqueued in the Tx rings. Because XDP ignores the state set by the netif_tx_wake_queue() API, so introduce the ENETC_TX_DOWN flag to suppress transmission of XDP frames. Fixes: c33bfaf ("net: enetc: set up XDP program under enetc_reconfigure()") Cc: stable@vger.kernel.org Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20241010092056.298128-3-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

The Tx BD rings are disabled first in enetc_stop() and the driver waits for them to become empty. This operation is not safe while the ring is actively transmitting frames, and will cause the ring to not be empty and hardware exception. As described in the NETC block guide, software should only disable an active Tx ring after all pending ring entries have been consumed (i.e. when PI = CI). Disabling a transmit ring that is actively processing BDs risks a HW-SW race hazard whereby a hardware resource becomes assigned to work on one or more ring entries only to have those entries be removed due to the ring becoming disabled. When testing XDP_REDIRECT feautre, although all frames were blocked from being put into Tx rings during ring reconfiguration, the similar warning log was still encountered: fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #6 clear fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #7 clear The reason is that when there are still unsent frames in the Tx ring, disabling the Tx ring causes the remaining frames to be unable to be sent out. And the Tx ring cannot be restored, which means that even if the xdp program is uninstalled, the Tx frames cannot be sent out anymore. Therefore, correct the operation order in enect_start() and enect_stop(). Fixes: ff58fda ("net: enetc: prioritize ability to go down over packet processing") Cc: stable@vger.kernel.org Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20241010092056.298128-4-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Syzkaller reported a lockdep splat: ============================================ WARNING: possible recursive locking detected 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted -------------------------------------------- syz-executor364/5113 is trying to acquire lock: ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 but task is already holding lock: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(k-slock-AF_INET); lock(k-slock-AF_INET); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by syz-executor364/5113: #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806 #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727 #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470 #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228 #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104 #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232 #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 stack backtrace: CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_deadlock kernel/locking/lockdep.c:3061 [inline] validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855 __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279 subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874 tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853 tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6108 __napi_poll+0xcb/0x490 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6963 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235 ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline] tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729 tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934 sk_backlog_rcv include/net/sock.h:1111 [inline] __release_sock+0x214/0x350 net/core/sock.c:3004 release_sock+0x61/0x1f0 net/core/sock.c:3558 mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f04fb13a6b9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9 RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004 RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300 </TASK> As noted by Cong Wang, the splat is false positive, but the code path leading to the report is an unexpected one: a client is attempting an MPC handshake towards the in-kernel listener created by the in-kernel PM for a port based signal endpoint. Such connection will be never accepted; many of them can make the listener queue full and preventing the creation of MPJ subflow via such listener - its intended role. Explicitly detect this scenario at initial-syn time and drop the incoming MPC request. Fixes: 1729cf1 ("mptcp: create the listening socket for new port") Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org>

Syzkaller reported a lockdep splat: ============================================ WARNING: possible recursive locking detected 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted -------------------------------------------- syz-executor364/5113 is trying to acquire lock: ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 but task is already holding lock: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(k-slock-AF_INET); lock(k-slock-AF_INET); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by syz-executor364/5113: #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806 #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727 #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470 #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228 #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104 #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232 #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 stack backtrace: CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_deadlock kernel/locking/lockdep.c:3061 [inline] validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855 __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279 subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874 tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853 tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6108 __napi_poll+0xcb/0x490 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6963 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235 ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline] tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729 tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934 sk_backlog_rcv include/net/sock.h:1111 [inline] __release_sock+0x214/0x350 net/core/sock.c:3004 release_sock+0x61/0x1f0 net/core/sock.c:3558 mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f04fb13a6b9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9 RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004 RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300 </TASK> As noted by Cong Wang, the splat is false positive, but the code path leading to the report is an unexpected one: a client is attempting an MPC handshake towards the in-kernel listener created by the in-kernel PM for a port based signal endpoint. Such connection will be never accepted; many of them can make the listener queue full and preventing the creation of MPJ subflow via such listener - its intended role. Explicitly detect this scenario at initial-syn time and drop the incoming MPC request. Fixes: 1729cf1 ("mptcp: create the listening socket for new port") Cc: stable@vger.kernel.org Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241014-net-mptcp-mpc-port-endp-v2-1-7faea8e6b6ae@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

On the node of an NFS client, some files saved in the mountpoint of the NFS server were copied to another location of the same NFS server. Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference crash with the following syslog: [232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 [232066.588586] Mem abort info: [232066.588701] ESR = 0x0000000096000007 [232066.588862] EC = 0x25: DABT (current EL), IL = 32 bits [232066.589084] SET = 0, FnV = 0 [232066.589216] EA = 0, S1PTW = 0 [232066.589340] FSC = 0x07: level 3 translation fault [232066.589559] Data abort info: [232066.589683] ISV = 0, ISS = 0x00000007 [232066.589842] CM = 0, WnR = 0 [232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400 [232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000 [232066.590757] Internal error: Oops: 96000007 [#1] SMP [232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2 [232066.591052] vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs [232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1 [232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06 [232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4] [232066.598595] sp : ffff8000f568fc70 [232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000 [232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001 [232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050 [232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000 [232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000 [232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6 [232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828 [232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a [232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058 [232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000 [232066.601636] Call trace: [232066.601749] nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.601998] nfs4_do_reclaim+0x1b8/0x28c [nfsv4] [232066.602218] nfs4_state_manager+0x928/0x10f0 [nfsv4] [232066.602455] nfs4_run_state_manager+0x78/0x1b0 [nfsv4] [232066.602690] kthread+0x110/0x114 [232066.602830] ret_from_fork+0x10/0x20 [232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00) [232066.603284] SMP: stopping secondary CPUs [232066.606936] Starting crashdump kernel... [232066.607146] Bye! Analysing the vmcore, we know that nfs4_copy_state listed by destination nfs_server->ss_copies was added by the field copies in handle_async_copy(), and we found a waiting copy process with the stack as: PID: 3511963 TASK: ffff710028b47e00 CPU: 0 COMMAND: "cp" #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4 #1 [ffff8001116ef760] __schedule at ffff800008dd0650 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898 #6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4] #7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4] #8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4] #9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4] The NULL-pointer dereference was due to nfs42_complete_copies() listed the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state. So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and the data accessed through this pointer was also incorrect. Generally, the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state(). When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED and copies are not deleted in nfs_server->ss_copies, the source state may be passed to the nfs42_complete_copies() process earlier, resulting in this crash scene finally. To solve this issue, we add a list_head nfs_server->ss_src_copies for a server-to-server copy specially. Fixes: 0e65a32 ("NFS: handle source server reboot") Signed-off-by: Yanjun Zhang <zhangyanjun@cestc.cn> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss and delay without reordering causes very slow transfer #6

loss and delay without reordering causes very slow transfer #6

matttbe commented Mar 23, 2020

matttbe commented Mar 23, 2020 •

edited

Loading

matttbe commented Mar 26, 2020

matttbe commented Mar 27, 2020

matttbe commented Apr 1, 2020

matttbe commented Apr 22, 2020

matttbe commented May 5, 2020

matttbe commented May 15, 2020

matttbe commented Jun 15, 2020 •

edited

Loading

matttbe commented Jun 15, 2020

cpaasch commented Jun 16, 2020

matttbe commented Jun 18, 2020

cpaasch commented Jun 18, 2020

loss and delay without reordering causes very slow transfer #6

loss and delay without reordering causes very slow transfer #6

Comments

matttbe commented Mar 23, 2020

matttbe commented Mar 23, 2020 • edited Loading

matttbe commented Mar 26, 2020

matttbe commented Mar 27, 2020

matttbe commented Apr 1, 2020

matttbe commented Apr 22, 2020

matttbe commented May 5, 2020

matttbe commented May 15, 2020

matttbe commented Jun 15, 2020 • edited Loading

matttbe commented Jun 15, 2020

cpaasch commented Jun 16, 2020

matttbe commented Jun 18, 2020

cpaasch commented Jun 18, 2020

matttbe commented Mar 23, 2020 •

edited

Loading

matttbe commented Jun 15, 2020 •

edited

Loading