BPF: packet scheduler #75

matttbe · 2020-08-07T20:57:34Z

Extending MPTCP with BPF is clearly something we want.

It looks like extending the Upstream MPTCP kernel to allow taking some packet scheduling decisions with BPF will be needed and would be needed in priority to #74.

I think the implementation would be similar to what is done in the kernel with BPF TCP CC: the ability to write a congestion control protocol in BPF with BPF_STRUCT_OPS, see: https://linuxplumbersconf.org/event/7/contributions/687/

Or check these file:

BPF "kernelspace": net/ipv4/bpf_tcp_ca.c
BPF "userspace": tools/testing/selftests/bpf/progs/bpf_cubic.c

From what I saw, the kernel side is a bit tricky. Here, it looks like this solution with TCP CC is designed like that because adding a new TCP CC is done by adding a new TCP CC kernel module. For BPF TCP CC, this module can be controlled via BPF.

On our side with MPTCP, we currently don't have the ability to create other packet schedulers (or path managers).

Maybe a first step would be to add the ability to select different packets schedulers implemented in the kernel.
Or maybe we could have the current scheduler having this ability to be controlled via BPF. But in this case, can we easily have both: a single packet scheduler that can do the job with and without a BPF program controlling it?

Issues:

Issues with BPF packet scheduler → Issues with BPF packet scheduler #336

Linked to #350:

Ability to write data in dedicated socket structure: MPTCP and subflow levels → scheduler: area in the socket structures reserved for schedulers #342
New callback to initiate optimisations → scheduler: new callback to initiate optimisations #344
Ability to penalise some subflows (and remove that) → scheduler: "penalise" some subflows by sending less than their cwnd #345
Ability to initiate opportunistic retransmissions → scheduler: implement a "opportunistic retransmission" #332
Ability to (un)mark a subflow as "stale" → scheduler: (un)mark a subflow as "stale" #349
Ability to change the behaviour depending on the backup flag
(and start/stop probing if not only managed by the core → scheduler: frequently probe "stale" subflow with reinjected data #348)
BPF selftests: use a dedicated netns for each test, see 02d6a05

The text was updated successfully, but these errors were encountered:

…abled When booting a kernel which has been built with CONFIG_AMD_MEM_ENCRYPT enabled as a Xen pv guest a warning is issued for each processor: [ 5.964347] ------------[ cut here ]------------ [ 5.968314] WARNING: CPU: 0 PID: 1 at /home/gross/linux/head/arch/x86/xen/enlighten_pv.c:660 get_trap_addr+0x59/0x90 [ 5.972321] Modules linked in: [ 5.976313] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.11.0-rc5-default #75 [ 5.980313] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A05 12/05/2013 [ 5.984313] RIP: e030:get_trap_addr+0x59/0x90 [ 5.988313] Code: 42 10 83 f0 01 85 f6 74 04 84 c0 75 1d b8 01 00 00 00 c3 48 3d 00 80 83 82 72 08 48 3d 20 81 83 82 72 0c b8 01 00 00 00 eb db <0f> 0b 31 c0 c3 48 2d 00 80 83 82 48 ba 72 1c c7 71 1c c7 71 1c 48 [ 5.992313] RSP: e02b:ffffc90040033d38 EFLAGS: 00010202 [ 5.996313] RAX: 0000000000000001 RBX: ffffffff82a141d0 RCX: ffffffff8222ec38 [ 6.000312] RDX: ffffffff8222ec38 RSI: 0000000000000005 RDI: ffffc90040033d40 [ 6.004313] RBP: ffff8881003984a0 R08: 0000000000000007 R09: ffff888100398000 [ 6.008312] R10: 0000000000000007 R11: ffffc90040246000 R12: ffff8884082182a8 [ 6.012313] R13: 0000000000000100 R14: 000000000000001d R15: ffff8881003982d0 [ 6.016316] FS: 0000000000000000(0000) GS:ffff888408200000(0000) knlGS:0000000000000000 [ 6.020313] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.024313] CR2: ffffc900020ef000 CR3: 000000000220a000 CR4: 0000000000050660 [ 6.028314] Call Trace: [ 6.032313] cvt_gate_to_trap.part.7+0x3f/0x90 [ 6.036313] ? asm_exc_double_fault+0x30/0x30 [ 6.040313] xen_convert_trap_info+0x87/0xd0 [ 6.044313] xen_pv_cpu_up+0x17a/0x450 [ 6.048313] bringup_cpu+0x2b/0xc0 [ 6.052313] ? cpus_read_trylock+0x50/0x50 [ 6.056313] cpuhp_invoke_callback+0x80/0x4c0 [ 6.060313] _cpu_up+0xa7/0x140 [ 6.064313] cpu_up+0x98/0xd0 [ 6.068313] bringup_nonboot_cpus+0x4f/0x60 [ 6.072313] smp_init+0x26/0x79 [ 6.076313] kernel_init_freeable+0x103/0x258 [ 6.080313] ? rest_init+0xd0/0xd0 [ 6.084313] kernel_init+0xa/0x110 [ 6.088313] ret_from_fork+0x1f/0x30 [ 6.092313] ---[ end trace be9ecf17dceeb4f3 ]--- Reason is that there is no Xen pv trap entry for X86_TRAP_VC. Fix that by adding a generic trap handler for unknown traps and wire all unknown bare metal handlers to this generic handler, which will just crash the system in case such a trap will ever happen. Fixes: 0786138 ("x86/sev-es: Add a Runtime #VC Exception Handler") Cc: <stable@vger.kernel.org> # v5.10 Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>

matttbe · 2021-04-19T15:19:19Z

@geliangtang I just updated the description following our discussion we had.

geliangtang · 2021-05-15T13:13:23Z

Round-robin packet scheduler support #194

geliangtang · 2021-10-01T00:28:29Z

Hi Matt, I just assigned this issue to myself. I'll dry to implement the Round-robin scheduler using BPF.

matttbe · 2021-10-07T10:16:44Z

(PS: I don't know if notifications are sent when I move items in Github Project but just in case: I'm moving all assigned tickets from "Future" to "Next". It doesn't mean it has to be implemented for the next version, just easier for the tracking to generate a changelog ;-) )

…fails Check for a valid hv_vp_index array prior to derefencing hv_vp_index when setting Hyper-V's TSC change callback. If Hyper-V setup failed in hyperv_init(), the kernel will still report that it's running under Hyper-V, but will have silently disabled nearly all functionality. BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.15.0-rc2+ #75 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:set_hv_tscchange_cb+0x15/0xa0 Code: <8b> 04 82 8b 15 12 17 85 01 48 c1 e0 20 48 0d ee 00 01 00 f6 c6 08 ... Call Trace: kvm_arch_init+0x17c/0x280 kvm_init+0x31/0x330 vmx_init+0xba/0x13a do_one_initcall+0x41/0x1c0 kernel_init_freeable+0x1f2/0x23b kernel_init+0x16/0x120 ret_from_fork+0x22/0x30 Fixes: 9328626 ("x86/hyperv: Reenlightenment notifications support") Cc: stable@vger.kernel.org Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20211104182239.1302956-2-seanjc@google.com Signed-off-by: Wei Liu <wei.liu@kernel.org>

matttbe · 2022-09-08T16:07:24Z

Status update:

some patches are already in our 'export' branch
but still in development, e.g. patches

matttbe · 2022-09-19T13:51:24Z

Some feedbacks from LPC2022:

BPF dev is going to be similar to working on kernel modules but helped by the verifier and other stuff
using BPF STRUCT_OPS seems to be the right direction
BPF code depends on the kernel version, it is not an API that is exposed to userspace and cannot be changed (!= UAPI). So we can change the callbacks, kfunc, etc.
It is possible to mark an API as unstable/stable
There are techniques to have a BPF code working on multiple kernels (CO-RE: Compile Once, Run Everywhere) but it might require specific modifications to support that
READ_ONCE(), WRITE_ONCE(), etc. should be supported by BPF: to be tested. (but maybe not needed?)
Regarding the security (e.g. access to the token), the best is to clearly mention that in cover-letters
Not all the smart stuff should be done in kfunc: a userspace scheduler should be able to iterate over all subflows and take decisions itself. Not just asking the kernel to use one mode or another.

The slides and the video are available online: https://lpc.events/event/16/contributions/1354/

VenkateswaranJ · 2022-11-30T10:07:32Z

Does this task implement Redundant scheduler?

matttbe · 2022-11-30T10:15:00Z

@VenkateswaranJ not yet but it is in development to validate the API, see https://lore.kernel.org/all/cover.1669605531.git.geliang.tang@suse.com/

Syzkaller reports a NULL deref bug as follows: BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 Read of size 4 at addr 0000000000000138 by task file1/1955 CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 ? io_tctx_exit_cb+0x53/0xd3 kasan_report+0xbb/0x1f0 ? io_tctx_exit_cb+0x53/0xd3 kasan_check_range+0x140/0x190 io_tctx_exit_cb+0x53/0xd3 task_work_run+0x164/0x250 ? task_work_cancel+0x30/0x30 get_signal+0x1c3/0x2440 ? lock_downgrade+0x6e0/0x6e0 ? lock_downgrade+0x6e0/0x6e0 ? exit_signals+0x8b0/0x8b0 ? do_raw_read_unlock+0x3b/0x70 ? do_raw_spin_unlock+0x50/0x230 arch_do_signal_or_restart+0x82/0x2470 ? kmem_cache_free+0x260/0x4b0 ? putname+0xfe/0x140 ? get_sigframe_size+0x10/0x10 ? do_execveat_common.isra.0+0x226/0x710 ? lockdep_hardirqs_on+0x79/0x100 ? putname+0xfe/0x140 ? do_execveat_common.isra.0+0x238/0x710 exit_to_user_mode_prepare+0x15f/0x250 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x42/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0023:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Kernel panic - not syncing: panic_on_warn set ... This happens because the adding of task_work from io_ring_exit_work() isn't synchronized with canceling all work items from eg exec. The execution of the two are ordered in that they are both run by the task itself, but if io_tctx_exit_cb() is queued while we're canceling all work items off exec AND gets executed when the task exits to userspace rather than in the main loop in io_uring_cancel_generic(), then we can find current->io_uring == NULL and hit the above crash. It's safe to add this NULL check here, because the execution of the two paths are done by the task itself. Cc: stable@vger.kernel.org Fixes: d56d938 ("io_uring: do ctx initiated file note removal") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Link: https://lore.kernel.org/r/20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com [axboe: add code comment and also put an explanation in the commit msg] Signed-off-by: Jens Axboe <axboe@kernel.dk>

matttbe · 2023-02-23T11:39:46Z

(I just updated the description to add this: )

Issues:

Issues with BPF packet scheduler → Issues with BPF packet scheduler #336

matttbe · 2023-04-17T12:31:00Z

I just added one item to the TODO list:

BPF selftests: use a dedicated netns for each test, see 02d6a05

geliangtang · 2023-05-31T10:09:31Z

@matttbe Matt, the task "BPF selftests: use a dedicated netns for each test" has been completed and can be closed now.

With latest clang18, I hit test_progs failures for the following test: #13/2 bpf_cookie/multi_kprobe_link_api:FAIL #13/3 bpf_cookie/multi_kprobe_attach_api:FAIL #13 bpf_cookie:FAIL #75 fentry_fexit:FAIL #76/1 fentry_test/fentry:FAIL #76 fentry_test:FAIL #80/1 fexit_test/fexit:FAIL #80 fexit_test:FAIL #110/1 kprobe_multi_test/skel_api:FAIL #110/2 kprobe_multi_test/link_api_addrs:FAIL #110/3 kprobe_multi_test/link_api_syms:FAIL #110/4 kprobe_multi_test/attach_api_pattern:FAIL #110/5 kprobe_multi_test/attach_api_addrs:FAIL #110/6 kprobe_multi_test/attach_api_syms:FAIL #110 kprobe_multi_test:FAIL For example, for #13/2, the error messages are: [...] kprobe_multi_test_run:FAIL:kprobe_test7_result unexpected kprobe_test7_result: actual 0 != expected 1 [...] kprobe_multi_test_run:FAIL:kretprobe_test7_result unexpected kretprobe_test7_result: actual 0 != expected 1 clang17 does not have this issue. Further investigation shows that kernel func bpf_fentry_test7(), used in the above tests, is inlined by the compiler although it is marked as noinline. int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg) { return (long)arg; } It is known that for simple functions like the above (e.g. just returning a constant or an input argument), the clang compiler may still do inlining for a noinline function. Adding 'asm volatile ("")' in the beginning of the bpf_fentry_test7() can prevent inlining. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230826200843.2210074-1-yonghong.song@linux.dev

matttbe added the enhancement label Aug 7, 2020

matttbe mentioned this issue Apr 19, 2021

BPF: path manager #74

Open

matttbe mentioned this issue Sep 30, 2021

Round-robin packet scheduler support #194

Closed

geliangtang self-assigned this Oct 1, 2021

matttbe mentioned this issue May 4, 2022

MPTCP is enabled but not working #272

Closed

matttbe mentioned this issue Aug 31, 2022

Meaning of newly added sysctl fields in upstream kernel #297

Closed

matttbe mentioned this issue Sep 8, 2022

Scheduler, pathmanager, congestion control #300

Closed

matttbe mentioned this issue Jan 16, 2023

Issues with BPF packet scheduler #336

Open

matttbe added the sched packets scheduler label Feb 1, 2023

matttbe mentioned this issue Feb 1, 2023

scheduler: API changes (tasks) #350

Open

8 tasks

matttbe mentioned this issue Apr 21, 2023

Backup function in BPF schedulers #393

Closed

matttbe mentioned this issue Jun 2, 2023

Issues with backup flow #315

Open

geliangtang added the bpf label Aug 4, 2023

matttbe mentioned this issue Sep 6, 2023

Supports redundant backup transmission of multiple subflows #436

Closed

mjmartineau mentioned this issue Dec 8, 2023

Scheduler: add redundant scheduler support in BPF #467

Open

matttbe mentioned this issue Apr 5, 2024

MPTCP Linux Kernel specifics multipath-tcp/mptcp#523

Closed

matttbe mentioned this issue Sep 11, 2024

Round-Robin Packet Scheduler Support #517

Closed

matttbe mentioned this issue Oct 14, 2024

Test and Add MPTCP Schedulers #522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPF: packet scheduler #75

BPF: packet scheduler #75

matttbe commented Aug 7, 2020 •

edited

Loading

matttbe commented Apr 19, 2021

geliangtang commented May 15, 2021

geliangtang commented Oct 1, 2021

matttbe commented Oct 7, 2021

matttbe commented Sep 8, 2022

matttbe commented Sep 19, 2022

VenkateswaranJ commented Nov 30, 2022

matttbe commented Nov 30, 2022

matttbe commented Feb 23, 2023

matttbe commented Apr 17, 2023 •

edited

Loading

geliangtang commented May 31, 2023

BPF: packet scheduler #75

BPF: packet scheduler #75

Comments

matttbe commented Aug 7, 2020 • edited Loading

matttbe commented Apr 19, 2021

geliangtang commented May 15, 2021

geliangtang commented Oct 1, 2021

matttbe commented Oct 7, 2021

matttbe commented Sep 8, 2022

matttbe commented Sep 19, 2022

VenkateswaranJ commented Nov 30, 2022

matttbe commented Nov 30, 2022

matttbe commented Feb 23, 2023

matttbe commented Apr 17, 2023 • edited Loading

geliangtang commented May 31, 2023

matttbe commented Aug 7, 2020 •

edited

Loading

matttbe commented Apr 17, 2023 •

edited

Loading