-
Notifications
You must be signed in to change notification settings - Fork 58.9k
[CP 5.3.1] Custom update version #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @phungmanhcuong! Thanks for your contribution to the Linux kernel! Linux kernel development happens on mailing lists, rather than on GitHub - this GitHub repository is a read-only mirror that isn't used for accepting contributions. So that your change can become part of Linux, please email it to us as a patch. Sending patches isn't quite as simple as sending a pull request, but fortunately it is a well documented process. Here's what to do:
How do I format my contribution?The Linux kernel community is notoriously picky about how contributions are formatted and sent. Fortunately, they have documented their expectations. Firstly, all contributions need to be formatted as patches. A patch is a plain text document showing the change you want to make to the code, and documenting why it is a good idea. You can create patches with Secondly, patches need 'commit messages', which is the human-friendly documentation explaining what the change is and why it's necessary. Thirdly, changes have some technical requirements. There is a Linux kernel coding style, and there are licensing requirements you need to comply with. Both of these are documented in the Submitting Patches documentation that is part of the kernel. Note that you will almost certainly have to modify your existing git commits to satisfy these requirements. Don't worry: there are many guides on the internet for doing this. Where do I send my contribution?The Linux kernel is composed of a number of subsystems. These subsystems are maintained by different people, and have different mailing lists where they discuss proposed changes. If you don't already know what subsystem your change belongs to, the
Make sure that your list of recipients includes a mailing list. If you can't find a more specific mailing list, then LKML - the Linux Kernel Mailing List - is the place to send your patches. It's not usually necessary to subscribe to the mailing list before you send the patches, but if you're interested in kernel development, subscribing to a subsystem mailing list is a good idea. (At this point, you probably don't need to subscribe to LKML - it is a very high traffic list with about a thousand messages per day, which is often not useful for beginners.) How do I send my contribution?Use For more information about using How do I get help if I'm stuck?Firstly, don't get discouraged! There are an enormous number of resources on the internet, and many kernel developers who would like to see you succeed. Many issues - especially about how to use certain tools - can be resolved by using your favourite internet search engine. If you can't find an answer, there are a few places you can turn:
If you get really, really stuck, you could try the owners of this bot, @daxtens and @ajdlinux. Please be aware that we do have full-time jobs, so we are almost certainly the slowest way to get answers! I sent my patch - now what?You wait. You can check that your email has been received by checking the mailing list archives for the mailing list you sent your patch to. Messages may not be received instantly, so be patient. Kernel developers are generally very busy people, so it may take a few weeks before your patch is looked at. Then, you keep waiting. Three things may happen:
Further information
Happy hacking! This message was posted by a bot - if you have any questions or suggestions, please talk to my owners, @ajdlinux and @daxtens, or raise an issue at https://github.com/ajdlinux/KernelPRBot. |
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for htb: tc qdisc add dev ens1f0 root handle 1: htb default 12 tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10 tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps Resulting dmesg: [ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc [ 4791.152805] INFO: lockdep is turned off. [ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 4791.155075] Call Trace: [ 4791.155803] dump_stack+0x85/0xc0 [ 4791.156529] ___might_sleep.cold+0xac/0xbc [ 4791.157251] __mutex_lock+0x5b/0x960 [ 4791.157966] ? console_unlock+0x363/0x5d0 [ 4791.158676] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.159395] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160103] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160815] tcf_block_put_ext.part.0+0x21/0x50 [ 4791.161530] tcf_block_put+0x50/0x70 [ 4791.162233] sfq_destroy+0x15/0x50 [sch_sfq] [ 4791.162936] qdisc_destroy+0x5f/0x160 [ 4791.163642] htb_change_class.cold+0x5df/0x69d [sch_htb] [ 4791.164505] tc_ctl_tclass+0x19d/0x480 [ 4791.165360] rtnetlink_rcv_msg+0x170/0x4b0 [ 4791.166191] ? netlink_deliver_tap+0x95/0x400 [ 4791.166907] ? rtnl_dellink+0x2d0/0x2d0 [ 4791.167625] netlink_rcv_skb+0x49/0x110 [ 4791.168345] netlink_unicast+0x171/0x200 [ 4791.169058] netlink_sendmsg+0x224/0x3f0 [ 4791.169771] sock_sendmsg+0x5e/0x60 [ 4791.170475] ___sys_sendmsg+0x2ae/0x330 [ 4791.171183] ? ___sys_recvmsg+0x159/0x1f0 [ 4791.171894] ? do_wp_page+0x9c/0x790 [ 4791.172595] ? __handle_mm_fault+0xcd3/0x19e0 [ 4791.173309] __sys_sendmsg+0x59/0xa0 [ 4791.174024] do_syscall_64+0x5c/0xb0 [ 4791.174725] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 4791.175435] RIP: 0033:0x7f0aa41497b8 [ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8 [ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003 [ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0 [ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 Extend sch API with new qdisc_put_empty() function that has same implementation as regular qdisc_put() but skips parts that reset qdisc and free all packet buffers from gso_skb and skb_bad_txq queues. In htb_change_class() function save parent->leaf.q to local temporary variable and put reference to it after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for multiq: tc qdisc add dev ens1f0 root handle 1: multiq tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 ethtool -L ens1f0 combined 2 tc qdisc change dev ens1f0 root handle 1: multiq Resulting dmesg: [ 5539.419344] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 5539.420945] in_atomic(): 1, irqs_disabled(): 0, pid: 27658, name: tc [ 5539.422435] INFO: lockdep is turned off. [ 5539.423904] CPU: 21 PID: 27658 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 5539.425400] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 5539.426911] Call Trace: [ 5539.428380] dump_stack+0x85/0xc0 [ 5539.429823] ___might_sleep.cold+0xac/0xbc [ 5539.431262] __mutex_lock+0x5b/0x960 [ 5539.432682] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.434103] ? __nla_validate_parse+0x51/0x840 [ 5539.435493] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.436903] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.438327] tcf_block_put_ext.part.0+0x21/0x50 [ 5539.439752] tcf_block_put+0x50/0x70 [ 5539.441165] sfq_destroy+0x15/0x50 [sch_sfq] [ 5539.442570] qdisc_destroy+0x5f/0x160 [ 5539.444000] multiq_tune+0x14a/0x420 [sch_multiq] [ 5539.445421] tc_modify_qdisc+0x324/0x840 [ 5539.446841] rtnetlink_rcv_msg+0x170/0x4b0 [ 5539.448269] ? netlink_deliver_tap+0x95/0x400 [ 5539.449691] ? rtnl_dellink+0x2d0/0x2d0 [ 5539.451116] netlink_rcv_skb+0x49/0x110 [ 5539.452522] netlink_unicast+0x171/0x200 [ 5539.453914] netlink_sendmsg+0x224/0x3f0 [ 5539.455304] sock_sendmsg+0x5e/0x60 [ 5539.456686] ___sys_sendmsg+0x2ae/0x330 [ 5539.458071] ? ___sys_recvmsg+0x159/0x1f0 [ 5539.459461] ? do_wp_page+0x9c/0x790 [ 5539.460846] ? __handle_mm_fault+0xcd3/0x19e0 [ 5539.462263] __sys_sendmsg+0x59/0xa0 [ 5539.463661] do_syscall_64+0x5c/0xb0 [ 5539.465044] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5539.466454] RIP: 0033:0x7f1fe08177b8 [ 5539.467863] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 5539.470906] RSP: 002b:00007ffe812de5d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 5539.472483] RAX: ffffffffffffffda RBX: 000000005d8135e3 RCX: 00007f1fe08177b8 [ 5539.474069] RDX: 0000000000000000 RSI: 00007ffe812de640 RDI: 0000000000000003 [ 5539.475655] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000182e9b0 [ 5539.477203] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 5539.478699] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 Rearrange locking in multiq_tune() in following ways: - In loop that removes Qdiscs from disabled queues, call qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that is being destroyed. Save the Qdisc in temporary allocated array and call qdisc_put_empty() on each element of the array after sch tree lock is released. This is safe to do because Qdiscs have already been reset by qdisc_purge_queue() inside sch tree lock critical section. - Do the same change for second loop that initializes Qdiscs for newly enabled queues in multiq_tune() function. Since sch tree lock is obtained and released on each iteration of this loop, just call qdisc_put_empty() directly outside of critical section. Don't verify that old Qdisc is not noop_qdisc before releasing reference to it because such check is already performed by qdisc_put*() functions. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for htb: tc qdisc add dev ens1f0 root handle 1: htb default 12 tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10 tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps Resulting dmesg: [ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc [ 4791.152805] INFO: lockdep is turned off. [ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 4791.155075] Call Trace: [ 4791.155803] dump_stack+0x85/0xc0 [ 4791.156529] ___might_sleep.cold+0xac/0xbc [ 4791.157251] __mutex_lock+0x5b/0x960 [ 4791.157966] ? console_unlock+0x363/0x5d0 [ 4791.158676] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.159395] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160103] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160815] tcf_block_put_ext.part.0+0x21/0x50 [ 4791.161530] tcf_block_put+0x50/0x70 [ 4791.162233] sfq_destroy+0x15/0x50 [sch_sfq] [ 4791.162936] qdisc_destroy+0x5f/0x160 [ 4791.163642] htb_change_class.cold+0x5df/0x69d [sch_htb] [ 4791.164505] tc_ctl_tclass+0x19d/0x480 [ 4791.165360] rtnetlink_rcv_msg+0x170/0x4b0 [ 4791.166191] ? netlink_deliver_tap+0x95/0x400 [ 4791.166907] ? rtnl_dellink+0x2d0/0x2d0 [ 4791.167625] netlink_rcv_skb+0x49/0x110 [ 4791.168345] netlink_unicast+0x171/0x200 [ 4791.169058] netlink_sendmsg+0x224/0x3f0 [ 4791.169771] sock_sendmsg+0x5e/0x60 [ 4791.170475] ___sys_sendmsg+0x2ae/0x330 [ 4791.171183] ? ___sys_recvmsg+0x159/0x1f0 [ 4791.171894] ? do_wp_page+0x9c/0x790 [ 4791.172595] ? __handle_mm_fault+0xcd3/0x19e0 [ 4791.173309] __sys_sendmsg+0x59/0xa0 [ 4791.174024] do_syscall_64+0x5c/0xb0 [ 4791.174725] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 4791.175435] RIP: 0033:0x7f0aa41497b8 [ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8 [ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003 [ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0 [ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In htb_change_class() function save parent->leaf.q to local temporary variable and put reference to it after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for multiq: tc qdisc add dev ens1f0 root handle 1: multiq tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 ethtool -L ens1f0 combined 2 tc qdisc change dev ens1f0 root handle 1: multiq Resulting dmesg: [ 5539.419344] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 5539.420945] in_atomic(): 1, irqs_disabled(): 0, pid: 27658, name: tc [ 5539.422435] INFO: lockdep is turned off. [ 5539.423904] CPU: 21 PID: 27658 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 5539.425400] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 5539.426911] Call Trace: [ 5539.428380] dump_stack+0x85/0xc0 [ 5539.429823] ___might_sleep.cold+0xac/0xbc [ 5539.431262] __mutex_lock+0x5b/0x960 [ 5539.432682] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.434103] ? __nla_validate_parse+0x51/0x840 [ 5539.435493] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.436903] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.438327] tcf_block_put_ext.part.0+0x21/0x50 [ 5539.439752] tcf_block_put+0x50/0x70 [ 5539.441165] sfq_destroy+0x15/0x50 [sch_sfq] [ 5539.442570] qdisc_destroy+0x5f/0x160 [ 5539.444000] multiq_tune+0x14a/0x420 [sch_multiq] [ 5539.445421] tc_modify_qdisc+0x324/0x840 [ 5539.446841] rtnetlink_rcv_msg+0x170/0x4b0 [ 5539.448269] ? netlink_deliver_tap+0x95/0x400 [ 5539.449691] ? rtnl_dellink+0x2d0/0x2d0 [ 5539.451116] netlink_rcv_skb+0x49/0x110 [ 5539.452522] netlink_unicast+0x171/0x200 [ 5539.453914] netlink_sendmsg+0x224/0x3f0 [ 5539.455304] sock_sendmsg+0x5e/0x60 [ 5539.456686] ___sys_sendmsg+0x2ae/0x330 [ 5539.458071] ? ___sys_recvmsg+0x159/0x1f0 [ 5539.459461] ? do_wp_page+0x9c/0x790 [ 5539.460846] ? __handle_mm_fault+0xcd3/0x19e0 [ 5539.462263] __sys_sendmsg+0x59/0xa0 [ 5539.463661] do_syscall_64+0x5c/0xb0 [ 5539.465044] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5539.466454] RIP: 0033:0x7f1fe08177b8 [ 5539.467863] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 5539.470906] RSP: 002b:00007ffe812de5d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 5539.472483] RAX: ffffffffffffffda RBX: 000000005d8135e3 RCX: 00007f1fe08177b8 [ 5539.474069] RDX: 0000000000000000 RSI: 00007ffe812de640 RDI: 0000000000000003 [ 5539.475655] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000182e9b0 [ 5539.477203] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 5539.478699] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 Rearrange locking in multiq_tune() in following ways: - In loop that removes Qdiscs from disabled queues, call qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that is being destroyed. Save the Qdisc in temporary allocated array and call qdisc_put() on each element of the array after sch tree lock is released. This is safe to do because Qdiscs have already been reset by qdisc_purge_queue() inside sch tree lock critical section. - Do the same change for second loop that initializes Qdiscs for newly enabled queues in multiq_tune() function. Since sch tree lock is obtained and released on each iteration of this loop, just call qdisc_put() directly outside of critical section. Don't verify that old Qdisc is not noop_qdisc before releasing reference to it because such check is already performed by qdisc_put*() functions. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for htb: tc qdisc add dev ens1f0 root handle 1: htb default 12 tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10 tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps Resulting dmesg: [ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc [ 4791.152805] INFO: lockdep is turned off. [ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G W 5.3.0-rc8+ #721 [ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 4791.155075] Call Trace: [ 4791.155803] dump_stack+0x85/0xc0 [ 4791.156529] ___might_sleep.cold+0xac/0xbc [ 4791.157251] __mutex_lock+0x5b/0x960 [ 4791.157966] ? console_unlock+0x363/0x5d0 [ 4791.158676] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.159395] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160103] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160815] tcf_block_put_ext.part.0+0x21/0x50 [ 4791.161530] tcf_block_put+0x50/0x70 [ 4791.162233] sfq_destroy+0x15/0x50 [sch_sfq] [ 4791.162936] qdisc_destroy+0x5f/0x160 [ 4791.163642] htb_change_class.cold+0x5df/0x69d [sch_htb] [ 4791.164505] tc_ctl_tclass+0x19d/0x480 [ 4791.165360] rtnetlink_rcv_msg+0x170/0x4b0 [ 4791.166191] ? netlink_deliver_tap+0x95/0x400 [ 4791.166907] ? rtnl_dellink+0x2d0/0x2d0 [ 4791.167625] netlink_rcv_skb+0x49/0x110 [ 4791.168345] netlink_unicast+0x171/0x200 [ 4791.169058] netlink_sendmsg+0x224/0x3f0 [ 4791.169771] sock_sendmsg+0x5e/0x60 [ 4791.170475] ___sys_sendmsg+0x2ae/0x330 [ 4791.171183] ? ___sys_recvmsg+0x159/0x1f0 [ 4791.171894] ? do_wp_page+0x9c/0x790 [ 4791.172595] ? __handle_mm_fault+0xcd3/0x19e0 [ 4791.173309] __sys_sendmsg+0x59/0xa0 [ 4791.174024] do_syscall_64+0x5c/0xb0 [ 4791.174725] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 4791.175435] RIP: 0033:0x7f0aa41497b8 [ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8 [ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003 [ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0 [ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In htb_change_class() function save parent->leaf.q to local temporary variable and put reference to it after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for multiq: tc qdisc add dev ens1f0 root handle 1: multiq tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 ethtool -L ens1f0 combined 2 tc qdisc change dev ens1f0 root handle 1: multiq Resulting dmesg: [ 5539.419344] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 5539.420945] in_atomic(): 1, irqs_disabled(): 0, pid: 27658, name: tc [ 5539.422435] INFO: lockdep is turned off. [ 5539.423904] CPU: 21 PID: 27658 Comm: tc Tainted: G W 5.3.0-rc8+ #721 [ 5539.425400] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 5539.426911] Call Trace: [ 5539.428380] dump_stack+0x85/0xc0 [ 5539.429823] ___might_sleep.cold+0xac/0xbc [ 5539.431262] __mutex_lock+0x5b/0x960 [ 5539.432682] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.434103] ? __nla_validate_parse+0x51/0x840 [ 5539.435493] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.436903] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.438327] tcf_block_put_ext.part.0+0x21/0x50 [ 5539.439752] tcf_block_put+0x50/0x70 [ 5539.441165] sfq_destroy+0x15/0x50 [sch_sfq] [ 5539.442570] qdisc_destroy+0x5f/0x160 [ 5539.444000] multiq_tune+0x14a/0x420 [sch_multiq] [ 5539.445421] tc_modify_qdisc+0x324/0x840 [ 5539.446841] rtnetlink_rcv_msg+0x170/0x4b0 [ 5539.448269] ? netlink_deliver_tap+0x95/0x400 [ 5539.449691] ? rtnl_dellink+0x2d0/0x2d0 [ 5539.451116] netlink_rcv_skb+0x49/0x110 [ 5539.452522] netlink_unicast+0x171/0x200 [ 5539.453914] netlink_sendmsg+0x224/0x3f0 [ 5539.455304] sock_sendmsg+0x5e/0x60 [ 5539.456686] ___sys_sendmsg+0x2ae/0x330 [ 5539.458071] ? ___sys_recvmsg+0x159/0x1f0 [ 5539.459461] ? do_wp_page+0x9c/0x790 [ 5539.460846] ? __handle_mm_fault+0xcd3/0x19e0 [ 5539.462263] __sys_sendmsg+0x59/0xa0 [ 5539.463661] do_syscall_64+0x5c/0xb0 [ 5539.465044] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5539.466454] RIP: 0033:0x7f1fe08177b8 [ 5539.467863] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 5539.470906] RSP: 002b:00007ffe812de5d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 5539.472483] RAX: ffffffffffffffda RBX: 000000005d8135e3 RCX: 00007f1fe08177b8 [ 5539.474069] RDX: 0000000000000000 RSI: 00007ffe812de640 RDI: 0000000000000003 [ 5539.475655] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000182e9b0 [ 5539.477203] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 5539.478699] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 Rearrange locking in multiq_tune() in following ways: - In loop that removes Qdiscs from disabled queues, call qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that is being destroyed. Save the Qdisc in temporary allocated array and call qdisc_put() on each element of the array after sch tree lock is released. This is safe to do because Qdiscs have already been reset by qdisc_purge_queue() inside sch tree lock critical section. - Do the same change for second loop that initializes Qdiscs for newly enabled queues in multiq_tune() function. Since sch tree lock is obtained and released on each iteration of this loop, just call qdisc_put() directly outside of critical section. Don't verify that old Qdisc is not noop_qdisc before releasing reference to it because such check is already performed by qdisc_put*() functions. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ #721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for htb: tc qdisc add dev ens1f0 root handle 1: htb default 12 tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10 tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps Resulting dmesg: [ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc [ 4791.152805] INFO: lockdep is turned off. [ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 4791.155075] Call Trace: [ 4791.155803] dump_stack+0x85/0xc0 [ 4791.156529] ___might_sleep.cold+0xac/0xbc [ 4791.157251] __mutex_lock+0x5b/0x960 [ 4791.157966] ? console_unlock+0x363/0x5d0 [ 4791.158676] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.159395] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160103] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160815] tcf_block_put_ext.part.0+0x21/0x50 [ 4791.161530] tcf_block_put+0x50/0x70 [ 4791.162233] sfq_destroy+0x15/0x50 [sch_sfq] [ 4791.162936] qdisc_destroy+0x5f/0x160 [ 4791.163642] htb_change_class.cold+0x5df/0x69d [sch_htb] [ 4791.164505] tc_ctl_tclass+0x19d/0x480 [ 4791.165360] rtnetlink_rcv_msg+0x170/0x4b0 [ 4791.166191] ? netlink_deliver_tap+0x95/0x400 [ 4791.166907] ? rtnl_dellink+0x2d0/0x2d0 [ 4791.167625] netlink_rcv_skb+0x49/0x110 [ 4791.168345] netlink_unicast+0x171/0x200 [ 4791.169058] netlink_sendmsg+0x224/0x3f0 [ 4791.169771] sock_sendmsg+0x5e/0x60 [ 4791.170475] ___sys_sendmsg+0x2ae/0x330 [ 4791.171183] ? ___sys_recvmsg+0x159/0x1f0 [ 4791.171894] ? do_wp_page+0x9c/0x790 [ 4791.172595] ? __handle_mm_fault+0xcd3/0x19e0 [ 4791.173309] __sys_sendmsg+0x59/0xa0 [ 4791.174024] do_syscall_64+0x5c/0xb0 [ 4791.174725] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 4791.175435] RIP: 0033:0x7f0aa41497b8 [ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8 [ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003 [ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0 [ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In htb_change_class() function save parent->leaf.q to local temporary variable and put reference to it after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for multiq: tc qdisc add dev ens1f0 root handle 1: multiq tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 ethtool -L ens1f0 combined 2 tc qdisc change dev ens1f0 root handle 1: multiq Resulting dmesg: [ 5539.419344] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 5539.420945] in_atomic(): 1, irqs_disabled(): 0, pid: 27658, name: tc [ 5539.422435] INFO: lockdep is turned off. [ 5539.423904] CPU: 21 PID: 27658 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 5539.425400] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 5539.426911] Call Trace: [ 5539.428380] dump_stack+0x85/0xc0 [ 5539.429823] ___might_sleep.cold+0xac/0xbc [ 5539.431262] __mutex_lock+0x5b/0x960 [ 5539.432682] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.434103] ? __nla_validate_parse+0x51/0x840 [ 5539.435493] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.436903] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.438327] tcf_block_put_ext.part.0+0x21/0x50 [ 5539.439752] tcf_block_put+0x50/0x70 [ 5539.441165] sfq_destroy+0x15/0x50 [sch_sfq] [ 5539.442570] qdisc_destroy+0x5f/0x160 [ 5539.444000] multiq_tune+0x14a/0x420 [sch_multiq] [ 5539.445421] tc_modify_qdisc+0x324/0x840 [ 5539.446841] rtnetlink_rcv_msg+0x170/0x4b0 [ 5539.448269] ? netlink_deliver_tap+0x95/0x400 [ 5539.449691] ? rtnl_dellink+0x2d0/0x2d0 [ 5539.451116] netlink_rcv_skb+0x49/0x110 [ 5539.452522] netlink_unicast+0x171/0x200 [ 5539.453914] netlink_sendmsg+0x224/0x3f0 [ 5539.455304] sock_sendmsg+0x5e/0x60 [ 5539.456686] ___sys_sendmsg+0x2ae/0x330 [ 5539.458071] ? ___sys_recvmsg+0x159/0x1f0 [ 5539.459461] ? do_wp_page+0x9c/0x790 [ 5539.460846] ? __handle_mm_fault+0xcd3/0x19e0 [ 5539.462263] __sys_sendmsg+0x59/0xa0 [ 5539.463661] do_syscall_64+0x5c/0xb0 [ 5539.465044] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5539.466454] RIP: 0033:0x7f1fe08177b8 [ 5539.467863] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 5539.470906] RSP: 002b:00007ffe812de5d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 5539.472483] RAX: ffffffffffffffda RBX: 000000005d8135e3 RCX: 00007f1fe08177b8 [ 5539.474069] RDX: 0000000000000000 RSI: 00007ffe812de640 RDI: 0000000000000003 [ 5539.475655] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000182e9b0 [ 5539.477203] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 5539.478699] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 Rearrange locking in multiq_tune() in following ways: - In loop that removes Qdiscs from disabled queues, call qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that is being destroyed. Save the Qdisc in temporary allocated array and call qdisc_put() on each element of the array after sch tree lock is released. This is safe to do because Qdiscs have already been reset by qdisc_purge_queue() inside sch tree lock critical section. - Do the same change for second loop that initializes Qdiscs for newly enabled queues in multiq_tune() function. Since sch tree lock is obtained and released on each iteration of this loop, just call qdisc_put() directly outside of critical section. Don't verify that old Qdisc is not noop_qdisc before releasing reference to it because such check is already performed by qdisc_put*() functions. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
commit e3ae1f9 upstream. Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
BugLink: https://bugs.launchpad.net/bugs/1851550 commit e3ae1f9 upstream. Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Running the checkpatch.pl script on the file for which patch was created, the following error was found to exist. ERROR: space required after that ',' (ctx:VxV) Fixed the above error which was found on line torvalds#721 by inserting a blank space at the appropriate position. Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Running the checkpatch.pl script on the file for which patch was created, the following error was found to exist. ERROR: space required after that ',' (ctx:VxV) Fixed the above error which was found on line torvalds#721 by inserting a blank space at the appropriate position. Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Link: https://lore.kernel.org/r/20200725122041.5663-1-anant.thazhemadam@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for htb: tc qdisc add dev ens1f0 root handle 1: htb default 12 tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10 tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps Resulting dmesg: [ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc [ 4791.152805] INFO: lockdep is turned off. [ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 4791.155075] Call Trace: [ 4791.155803] dump_stack+0x85/0xc0 [ 4791.156529] ___might_sleep.cold+0xac/0xbc [ 4791.157251] __mutex_lock+0x5b/0x960 [ 4791.157966] ? console_unlock+0x363/0x5d0 [ 4791.158676] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.159395] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160103] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160815] tcf_block_put_ext.part.0+0x21/0x50 [ 4791.161530] tcf_block_put+0x50/0x70 [ 4791.162233] sfq_destroy+0x15/0x50 [sch_sfq] [ 4791.162936] qdisc_destroy+0x5f/0x160 [ 4791.163642] htb_change_class.cold+0x5df/0x69d [sch_htb] [ 4791.164505] tc_ctl_tclass+0x19d/0x480 [ 4791.165360] rtnetlink_rcv_msg+0x170/0x4b0 [ 4791.166191] ? netlink_deliver_tap+0x95/0x400 [ 4791.166907] ? rtnl_dellink+0x2d0/0x2d0 [ 4791.167625] netlink_rcv_skb+0x49/0x110 [ 4791.168345] netlink_unicast+0x171/0x200 [ 4791.169058] netlink_sendmsg+0x224/0x3f0 [ 4791.169771] sock_sendmsg+0x5e/0x60 [ 4791.170475] ___sys_sendmsg+0x2ae/0x330 [ 4791.171183] ? ___sys_recvmsg+0x159/0x1f0 [ 4791.171894] ? do_wp_page+0x9c/0x790 [ 4791.172595] ? __handle_mm_fault+0xcd3/0x19e0 [ 4791.173309] __sys_sendmsg+0x59/0xa0 [ 4791.174024] do_syscall_64+0x5c/0xb0 [ 4791.174725] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 4791.175435] RIP: 0033:0x7f0aa41497b8 [ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8 [ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003 [ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0 [ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In htb_change_class() function save parent->leaf.q to local temporary variable and put reference to it after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for multiq: tc qdisc add dev ens1f0 root handle 1: multiq tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 ethtool -L ens1f0 combined 2 tc qdisc change dev ens1f0 root handle 1: multiq Resulting dmesg: [ 5539.419344] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 5539.420945] in_atomic(): 1, irqs_disabled(): 0, pid: 27658, name: tc [ 5539.422435] INFO: lockdep is turned off. [ 5539.423904] CPU: 21 PID: 27658 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 5539.425400] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 5539.426911] Call Trace: [ 5539.428380] dump_stack+0x85/0xc0 [ 5539.429823] ___might_sleep.cold+0xac/0xbc [ 5539.431262] __mutex_lock+0x5b/0x960 [ 5539.432682] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.434103] ? __nla_validate_parse+0x51/0x840 [ 5539.435493] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.436903] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.438327] tcf_block_put_ext.part.0+0x21/0x50 [ 5539.439752] tcf_block_put+0x50/0x70 [ 5539.441165] sfq_destroy+0x15/0x50 [sch_sfq] [ 5539.442570] qdisc_destroy+0x5f/0x160 [ 5539.444000] multiq_tune+0x14a/0x420 [sch_multiq] [ 5539.445421] tc_modify_qdisc+0x324/0x840 [ 5539.446841] rtnetlink_rcv_msg+0x170/0x4b0 [ 5539.448269] ? netlink_deliver_tap+0x95/0x400 [ 5539.449691] ? rtnl_dellink+0x2d0/0x2d0 [ 5539.451116] netlink_rcv_skb+0x49/0x110 [ 5539.452522] netlink_unicast+0x171/0x200 [ 5539.453914] netlink_sendmsg+0x224/0x3f0 [ 5539.455304] sock_sendmsg+0x5e/0x60 [ 5539.456686] ___sys_sendmsg+0x2ae/0x330 [ 5539.458071] ? ___sys_recvmsg+0x159/0x1f0 [ 5539.459461] ? do_wp_page+0x9c/0x790 [ 5539.460846] ? __handle_mm_fault+0xcd3/0x19e0 [ 5539.462263] __sys_sendmsg+0x59/0xa0 [ 5539.463661] do_syscall_64+0x5c/0xb0 [ 5539.465044] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5539.466454] RIP: 0033:0x7f1fe08177b8 [ 5539.467863] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 5539.470906] RSP: 002b:00007ffe812de5d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 5539.472483] RAX: ffffffffffffffda RBX: 000000005d8135e3 RCX: 00007f1fe08177b8 [ 5539.474069] RDX: 0000000000000000 RSI: 00007ffe812de640 RDI: 0000000000000003 [ 5539.475655] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000182e9b0 [ 5539.477203] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 5539.478699] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 Rearrange locking in multiq_tune() in following ways: - In loop that removes Qdiscs from disabled queues, call qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that is being destroyed. Save the Qdisc in temporary allocated array and call qdisc_put() on each element of the array after sch tree lock is released. This is safe to do because Qdiscs have already been reset by qdisc_purge_queue() inside sch tree lock critical section. - Do the same change for second loop that initializes Qdiscs for newly enabled queues in multiq_tune() function. Since sch tree lock is obtained and released on each iteration of this loop, just call qdisc_put() directly outside of critical section. Don't verify that old Qdisc is not noop_qdisc before releasing reference to it because such check is already performed by qdisc_put*() functions. Fixes: c266f64 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ torvalds#721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The qcom_lmh_dcvs_notify() will get the dev_pm_opp instance for throttling, but will not put it, ending up with leaking a reference count and the following backtrace when putting the CPU offline. Correctly put the reference count of the returned opp instance. [ 84.418025] ------------[ cut here ]------------ [ 84.422770] WARNING: CPU: 7 PID: 43 at drivers/opp/core.c:1396 _opp_table_kref_release+0x188/0x190 [ 84.431966] Modules linked in: [ 84.435106] CPU: 7 PID: 43 Comm: cpuhp/7 Tainted: G S 5.17.0-rc6-00388-g7cf3c0d89c44-dirty torvalds#721 [ 84.451631] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--) [ 84.458781] pc : _opp_table_kref_release+0x188/0x190 [ 84.463878] lr : _opp_table_kref_release+0x78/0x190 [ 84.468885] sp : ffff80000841bc70 [ 84.472294] x29: ffff80000841bc70 x28: ffff6664afe3d000 x27: ffff1db6729e5908 [ 84.479621] x26: 0000000000000000 x25: 0000000000000000 x24: ffff1db6729e58e0 [ 84.486946] x23: ffff8000080a5000 x22: ffff1db40aad80e0 x21: ffff1db4002fec80 [ 84.494277] x20: ffff1db40aad8000 x19: ffffb751c3186300 x18: ffffffffffffffff [ 84.501603] x17: 5300326563697665 x16: 645f676e696c6f6f x15: 00001186c1df5448 [ 84.508928] x14: 00000000000002e9 x13: 0000000000000000 x12: 0000000000000000 [ 84.516256] x11: ffffb751c3186368 x10: ffffb751c39a2a70 x9 : 0000000000000000 [ 84.523585] x8 : ffff1db4008edf00 x7 : ffffb751c328c000 x6 : 0000000000000001 [ 84.530916] x5 : 0000000000040000 x4 : 0000000000000001 x3 : ffff1db4008edf00 [ 84.538247] x2 : 0000000000000000 x1 : ffff1db400aa6100 x0 : ffff1db40aad80d0 [ 84.545579] Call trace: [ 84.548101] _opp_table_kref_release+0x188/0x190 [ 84.552842] dev_pm_opp_remove_all_dynamic+0x8c/0xc0 [ 84.557949] qcom_cpufreq_hw_cpu_exit+0x30/0xdc [ 84.562608] cpufreq_offline.isra.0+0x1b4/0x1d8 [ 84.567270] cpuhp_cpufreq_offline+0x10/0x6c [ 84.571663] cpuhp_invoke_callback+0x16c/0x2b0 [ 84.576231] cpuhp_thread_fun+0x190/0x250 [ 84.580353] smpboot_thread_fn+0x12c/0x230 [ 84.584568] kthread+0xfc/0x100 [ 84.587810] ret_from_fork+0x10/0x20 [ 84.591490] irq event stamp: 3482 [ 84.594901] hardirqs last enabled at (3481): [<ffffb751c13c3db0>] call_rcu+0x39c/0x50c [ 84.603119] hardirqs last disabled at (3482): [<ffffb751c236b518>] el1_dbg+0x24/0x8c [ 84.611074] softirqs last enabled at (310): [<ffffb751c1290410>] _stext+0x410/0x588 [ 84.619028] softirqs last disabled at (305): [<ffffb751c131bf68>] __irq_exit_rcu+0x158/0x174 [ 84.627691] ---[ end trace 0000000000000000 ]--- Fixes: 275157b ("cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support") Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
The qcom_lmh_dcvs_notify() will get the dev_pm_opp instance for throttling, but will not put it, ending up with leaking a reference count and the following backtrace when putting the CPU offline. Correctly put the reference count of the returned opp instance. [ 84.418025] ------------[ cut here ]------------ [ 84.422770] WARNING: CPU: 7 PID: 43 at drivers/opp/core.c:1396 _opp_table_kref_release+0x188/0x190 [ 84.431966] Modules linked in: [ 84.435106] CPU: 7 PID: 43 Comm: cpuhp/7 Tainted: G S 5.17.0-rc6-00388-g7cf3c0d89c44-dirty torvalds#721 [ 84.451631] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--) [ 84.458781] pc : _opp_table_kref_release+0x188/0x190 [ 84.463878] lr : _opp_table_kref_release+0x78/0x190 [ 84.468885] sp : ffff80000841bc70 [ 84.472294] x29: ffff80000841bc70 x28: ffff6664afe3d000 x27: ffff1db6729e5908 [ 84.479621] x26: 0000000000000000 x25: 0000000000000000 x24: ffff1db6729e58e0 [ 84.486946] x23: ffff8000080a5000 x22: ffff1db40aad80e0 x21: ffff1db4002fec80 [ 84.494277] x20: ffff1db40aad8000 x19: ffffb751c3186300 x18: ffffffffffffffff [ 84.501603] x17: 5300326563697665 x16: 645f676e696c6f6f x15: 00001186c1df5448 [ 84.508928] x14: 00000000000002e9 x13: 0000000000000000 x12: 0000000000000000 [ 84.516256] x11: ffffb751c3186368 x10: ffffb751c39a2a70 x9 : 0000000000000000 [ 84.523585] x8 : ffff1db4008edf00 x7 : ffffb751c328c000 x6 : 0000000000000001 [ 84.530916] x5 : 0000000000040000 x4 : 0000000000000001 x3 : ffff1db4008edf00 [ 84.538247] x2 : 0000000000000000 x1 : ffff1db400aa6100 x0 : ffff1db40aad80d0 [ 84.545579] Call trace: [ 84.548101] _opp_table_kref_release+0x188/0x190 [ 84.552842] dev_pm_opp_remove_all_dynamic+0x8c/0xc0 [ 84.557949] qcom_cpufreq_hw_cpu_exit+0x30/0xdc [ 84.562608] cpufreq_offline.isra.0+0x1b4/0x1d8 [ 84.567270] cpuhp_cpufreq_offline+0x10/0x6c [ 84.571663] cpuhp_invoke_callback+0x16c/0x2b0 [ 84.576231] cpuhp_thread_fun+0x190/0x250 [ 84.580353] smpboot_thread_fn+0x12c/0x230 [ 84.584568] kthread+0xfc/0x100 [ 84.587810] ret_from_fork+0x10/0x20 [ 84.591490] irq event stamp: 3482 [ 84.594901] hardirqs last enabled at (3481): [<ffffb751c13c3db0>] call_rcu+0x39c/0x50c [ 84.603119] hardirqs last disabled at (3482): [<ffffb751c236b518>] el1_dbg+0x24/0x8c [ 84.611074] softirqs last enabled at (310): [<ffffb751c1290410>] _stext+0x410/0x588 [ 84.619028] softirqs last disabled at (305): [<ffffb751c131bf68>] __irq_exit_rcu+0x158/0x174 [ 84.627691] ---[ end trace 0000000000000000 ]--- Fixes: 275157b ("cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support") Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
…only ANBZ: torvalds#721 commit f10fb85 upstream. fw_reset_no_pci_access is only applicable for MFI controllers and is not used for Fusion controllers. For all Fusion controllers, driver can check reset adapter bit in status register before performing a chip reset without setting "fw_reset_no_pci_access". Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
ANBZ: torvalds#721 commit 44e8d69 upstream. No functional change. This patch reworks code around controller reset path which gets rid of a couple of goto labels. This is in preparation for the next patch which adds PCI config space access locking while controller reset is in progress. Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
…ng OCR ANBZ: torvalds#721 commit 78409d4 upstream. While an online controller reset(OCR) is in progress, there is short duration where all access to controller's PCI config space from the host needs to be blocked. This is due to a hardware limitation of MegaRAID controllers. With this patch, driver will block all access to controller's config space from userland applications by calling pci_cfg_access_lock() while OCR is in progress and unlocking after controller comes back to ready state. Added helper function which locks the config space before initiating OCR and wait for controller to become READY. Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
ANBZ: torvalds#721 commit 3f6194a upstream. Currently driver checks for Firmware state change from ISR context, and only when there are interrupts tied with no I/O completions. We have seen multiple cases where doorbell interrupts sent by firmware to indicate FW state change are not processed by driver and it takes long time for driver to trigger OCR. And if there are no IOs running, since we only check the FW state as part of ISR code, fault goes undetected by driver and OCR will not be triggered. This patch introduces a separate workqueue that runs every one second to detect Firmware FAULT state and trigger reset immediately. As an additional gain, removing PCI reads from ISR to check FW state results in improved performance as well. Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
ANBZ: torvalds#721 commit 34bd9f2 upstream. Optimization: No need to hold hba_lock in dpc context for reading atomic variable. Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
ANBZ: torvalds#721 commit 62a04f8 upstream. Issue Description: We have seen cpu lock up issues from field if system has a large (more than 96) logical cpu count. SAS3.0 controller (Invader series) supports max 96 MSI-X vector and SAS3.5 product (Ventura) supports max 128 MSI-X vectors. This may be a generic issue (if PCI device support completion on multiple reply queues). Let me explain it w.r.t megaraid_sas supported h/w just to simplify the problem and possible changes to handle such issues. MegaRAID controller supports multiple reply queues in completion path. Driver creates MSI-X vectors for controller as "minimum of (FW supported Reply queues, Logical CPUs)". If submitter is not interrupted via completion on same CPU, there is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO timeout, system sluggish etc. Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU (e.g. CPU B) is busy with processing the corresponding IO's reply descriptors from reply descriptor queue upon receiving the interrupts from HBA. If CPU A is continuously pumping the IOs then always CPU B (which is executing the ISR) will see the valid reply descriptors in the reply descriptor queue and it will be continuously processing those reply descriptor in a loop without quitting the ISR handler. megaraid_sas driver will exit ISR handler if it finds unused reply descriptor in the reply descriptor queue. Since CPU A will be continuously sending the IOs, CPU B may always see a valid reply descriptor (posted by HBA Firmware after processing the IO) in the reply descriptor queue. In worst case, driver will not quit from this loop in the ISR handler. Eventually, CPU lockup will be detected by watchdog. Above mentioned behavior is not common if "rq_affinity" set to 2 or affinity_hint is honored by irqbalancer as "exact". If rq_affinity is set to 2, submitter will be always interrupted via completion on same CPU. If irqbalancer is using "exact" policy, interrupt will be delivered to submitter CPU. Problem statement: If CPU count to MSI-X vectors (reply descriptor Queues) count ratio is not 1:1, we still have exposure of issue explained above and for that we don't have any solution. Exposure of soft/hard lockup is seen if CPU count is more than MSI-X supported by device. If CPUs count to MSI-X vectors count ratio is not 1:1, (Other way, if CPU counts to MSI-X vector count ratio is something like X:1, where X > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU hard/soft lockups. There won't be any one to one mapping between CPU to MSI-X vector instead one MSI-X interrupt (or reply descriptor queue) is shared with group/set of CPUs and there is a possibility of having a loop in the IO path within that CPU group and may observe lockups. For example: Consider a system having two NUMA nodes and each node having four logical CPUs and also consider that number of MSI-X vectors enabled on the HBA is two, then CPUs count to MSI-X vector count ratio as 4:1. e.g. MSI-X vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and MSI-X vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1. numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 --> MSI-X 0 node 0 size: 65536 MB node 0 free: 63176 MB node 1 cpus: 4 5 6 7 --> MSI-X 1 node 1 size: 65536 MB node 1 free: 63176 MB Assume that user started an application which uses all the CPUs of NUMA node 0 for issuing the IOs. Only one CPU from affinity list (it can be any cpu since this behavior depends upon irqbalance) CPU0 will receive the interrupts from MSI-X 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be decreasing and ISR processing percentage will be increasing as it is more busy with processing the interrupts. Gradually IO submission percentage on CPU 0 will be zero and it's ISR processing percentage will be 100% as IO loop has already formed within the NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs and only CPU 0 is busy in the ISR path as it always find the valid reply descriptor in the reply descriptor queue. Eventually, we will observe the hard lockup here. Chances of occurring of hard/soft lockups are directly proportional to value of X. If value of X is high, then chances of observing CPU lockups is high. Solution: Use IRQ poll interface defined in "irq_poll.c". megaraid_sas driver will execute ISR routine in softirq context and it will always quit the loop based on budget provided in IRQ poll interface. Driver will switch to IRQ poll only when more than a threshold number of reply descriptors are handled in one ISR. Currently threshold is set as 1/4th of HBA queue depth. In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is X:1 (where X > 1)), IRQ poll interface will avoid CPU hard lockups due to voluntary exit from the reply queue processing based on budget. Note - Only one MSI-X vector is busy doing processing. Select CONFIG_IRQ_POLL from driver Kconfig for driver compilation. Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
…eout ANBZ: torvalds#721 commit 7fa3174 upstream. Issue: There is possibility of few DCMDs timing out with 'reset_mutex' lock held. As part of DCMD timeout handling, driver calls function megasas_reset_fusion which also tries to acquire same lock 'reset_mutex' and end up with deadlock. Fix: Upon timeout of DCMDs (which are fired with 'reset_mutex' lock held), driver will release 'reset_mutex' before calling OCR function and will acquire lock again after OCR function returns. Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
… is in fault ANBZ: torvalds#721 commit ccf6c1f upstream. Issue: Under certain conditions, controller goes in FAULT state after IOC INIT fired to firmware. Such Fault can be recovered through controller reset. Fix: In driver probe context, if firmware fault is observed post IOC INIT, driver would do controller reset followed by retry logic for IOC INIT command. Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
ANBZ: torvalds#721 commit 9e77018 upstream. No functional change. Use local variables when accessing raid context in IO path. Improves code readability. Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/598
…ters ANBZ: torvalds#721 commit 5885571 upstream. Aero adapters provides Atomic Request Descriptor as an alternative method for posting an entry onto a request queue. The posting of an Atomic Request Descriptor is an atomic operation, providing a safe mechanism for multiple processors on the host to post requests without synchronization. This Atomic Request Descriptor format is identical to first 32 bits of Default Request Descriptor and uses only 32 bits. If Aero adapters support Atomic descriptor, driver should use it for posting IOs and DCMDs to firmware. Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit dd80769 upstream. This patch will add support for non-secure Aero adapter PCI IDs. Driver will throw an error message when a non-secure type controller is detected. Purpose of this interface is to avoid interacting with any firmware which is not secured/signed by Broadcom. Any tampering on Firmware component will be detected by hardware and it will be communicated to the driver to avoid any further interaction with that component. Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 2181aac upstream. Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 798d44b upstream. Firmware does not expect FastPath IO sent through Region Lock Bypass queue. Though firmware never exposes such settings when fastpath IO can be sent to RL bypass queue but it's safer to remove dead code which directs fastpath IO to RL Bypass queue. Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 59db5a9 upstream. Issue: This issue is applicable to scenario when JBOD sequence map is unavailable (memory allocation for JBOD sequence map failed) to driver but feature is supported by firmware. If the driver sends a JBOD IO by not adding 255 (MAX_PHYSICAL_DEVICES - 1) to device ID when underlying firmware supports JBOD sequence map, it will lead to the IO failure. Fix: For JBOD IOs, driver will not use the RAID map to fetch the devhandle if JBOD sequence map is unavailable. Driver will set Devhandle to 0xffff and Target ID to 'device ID + 255 (MAX_PHYSICAL_DEVICES - 1)'. Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 7fc5570 upstream. For RAID5/RAID6 volumes configured behind Aero, driver will be doing 64bit division operations on behalf of firmware as controller's ARM CPU is very slow in this division. Later, driver calculates Q-ARM, P-ARM and Log-ARM and passes those values to firmware by writing these values to RAID_CONTEXT. Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 1d15d90 upstream. Driver will use "reply descriptor post queues" in round robin fashion when the combined MSI-X mode is not enabled. With this IO completions are distributed and load balanced across all the available reply descriptor post queues equally. This is enabled only if combined MSI-X mode is not enabled in firmware. This improves performance and also fixes soft lockups. When load balancing is enabled, IRQ affinity from driver needs to be disabled. Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 132147d upstream. Aero controllers support balanced performance mode through the ability to configure queues with different properties. Reply queues with interrupt coalescing enabled are called "high iops reply queues" and reply queues with interrupt coalescing disabled are called "low latency reply queues". The driver configures a combination of high iops and low latency reply queues if: - HBA is an AERO controller; - MSI-X vectors supported by the HBA is 128; - Total CPU count in the system more than high iops queue count; - Driver is loaded with default max_msix_vectors module parameter; and - System booted in non-kdump mode. Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit ea836f4 upstream. Driver should enable interrupt coalescing (during driver load and after Controller Reset) for High IOPS queues by masking appropriate bits in IOC INIT frame. Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit f0b9e7b upstream. High iops queues are mapped to non-managed IRQs. Set affinity of non-managed irqs to local numa node. Low latency queues are mapped to managed IRQs. Driver reserves some reply queues for high IOPS queues (through pci_alloc_irq_vectors_affinity and .pre_vectors interface). The rest of queues are for low latency. Based on IO workload, driver will decide which group of reply queues (either high IOPS queues or low latency queues) to be used. High IOPS queues will be mapped to local numa node of controller and low latency queues will be mapped to CPUs across numa nodes. In general, high IOPS and low latency queues should fit into 128 reply queues which is the max number of reply queues supported by Aero adapters. Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit f39e5e5 upstream. The driver will use round-robin method for IO submission in batches within the high IOPS queues when the number of in-flight ios on the target device is larger than 8. Otherwise the driver will use low latency reply queues. Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 299ee42 upstream. For Aero adapters, driver provides three different performance modes controlled through module parameter named 'perf_mode'. Below are those performance modes: 0: Balanced - Additional high IOPS reply queues will be enabled along with low latency queues. Interrupt coalescing will be enabled only for these high IOPS reply queues. 1: IOPS - No additional high IOPS queues are enabled. Interrupt coalescing will be enabled on all reply queues. 2: Latency - No additional high IOPS queues are enabled. Interrupt coalescing will be disabled on all reply queues. This is a legacy behavior similar to Ventura & Invader Series. Default performance mode settings: - Performance mode set to 'Balanced', if Aero controller is working in 16GT/s PCIe speed. - Performance mode will be set to 'Latency' mode for all other cases. Through module parameter 'perf_mode', user can override default performance mode to desired one. Captured some performance numbers with these performance modes. 4k Random Read IO performance numbers on 24 SAS SSD drives for above three performance modes. Performance data is from Intel Skylake and HGST SS300 (drive model SDLL1DLR400GCCA1). IOPS: ----------------------------------------------------------------------- |perf_mode | qd = 1 | qd = 64 | note | |-------------|--------|---------|------------------------------------- |balanced | 259K | 3061k | Provides max performance numbers | | | | | both on lower QD workload & | | | | | also on higher QD workload | |-------------|--------|---------|------------------------------------- |iops | 220K | 3100k | Provides max performance numbers | | | | | only on higher QD workload. | |-------------|--------|---------|------------------------------------- |latency | 246k | 2226k | Provides good performance numbers | | | | | only on lower QD worklaod. | ----------------------------------------------------------------------- Average Latency: ----------------------------------------------------- |perf_mode | qd = 1 | qd = 64 | |-------------|--------------|----------------------| |balanced | 92.05 usec | 501.12 usec | |-------------|--------------|----------------------| |iops | 108.40 usec | 498.10 usec | |-------------|--------------|----------------------| |latency | 97.10 usec | 689.26 usec | ----------------------------------------------------- Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/614
ANBZ: torvalds#721 commit 92b4f9d upstream. Streamline resume workflow by using the same functions for enabling MSIx interrupts as used during initialisation. Without it the driver might crash during resume with: WARNING: CPU: 2 PID: 4306 at ../drivers/pci/msi.c:1303 pci_irq_get_affinity+0x3b/0x90 Link: https://lore.kernel.org/r/20200113132609.69536-1-hare@suse.de Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/622 Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
…ntrollers ANBZ: torvalds#721 commit 1175b88 upstream. Load balancing IO completions across all available MSI-X vectors should be enabled for Invader and later generation controllers only. This needs to be disabled for older controllers. Add an adapter type check before setting msix_load_balance flag. Fixes: 1d15d90 ("scsi: megaraid_sas: Load balance completions across all MSI-X") Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/622 Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
…rough firmware ANBZ: torvalds#721 commit 79db830 upstream. The driver issues all non-ReadWrite I/Os for TYPE_ENCLOSURE devices through the fast path with invalid dev handle. Fast path in turn directs all the I/Os to the firmware. As firmware stopped handling those I/Os from SAS3.5 generation of controllers (Ventura generation and onwards) this will lead to I/O failures. Switch the driver to issue all the non-ReadWrite I/Os for TYPE_ENCLOSURE devices directly to firmware for SAS3.5 generation of controllers and later. Link: https://lore.kernel.org/r/20210528131307.25683-2-chandrakanth.patil@broadcom.com Cc: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/622 Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
…evice resume ANBZ: torvalds#721 commit 499e724 upstream. After device resume we expect the firmware to be in READY state. Transition to READY might fail due to unhandled exceptions, such as an internal error or a hardware failure. Retry initiating chip reset and wait for the controller to come to ready state. Link: https://lore.kernel.org/r/1579000882-20246-2-git-send-email-anand.lodnoor@broadcom.com Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Anand Lodnoor <anand.lodnoor@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/622 Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: torvalds#721 commit b5438f4 upstream. The driver doesn't clean up all the allocated resources properly when scsi_add_host(), megasas_start_aen() function fails during the PCI device probe. Clean up all those resources. Link: https://lore.kernel.org/r/20210528131307.25683-3-chandrakanth.patil@broadcom.com Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/622 Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: torvalds#721 commit 6e73550 upstream. Ideally, optimal queue depth will be provided by firmware. The driver defines will be used as a fallback mechanism in case the FW assisted QD is not supported. The driver defined values provide optimal queue depth for most of the drives and the workloads, as is learned from the firmware assisted QD results. Link: https://lore.kernel.org/r/1579000882-20246-4-git-send-email-anand.lodnoor@broadcom.com Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Anand Lodnoor <anand.lodnoor@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/622 Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
With below scripts, it will trigger panic in f2fs: mkfs.f2fs -f /dev/vdd mount /dev/vdd /mnt/f2fs touch /mnt/f2fs/foo sync echo 111 >> /mnt/f2fs/foo f2fs_io fsync /mnt/f2fs/foo f2fs_io shutdown 2 /mnt/f2fs umount /mnt/f2fs mount -o ro,norecovery /dev/vdd /mnt/f2fs or mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0 F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f F2FS-fs (vdd): Stopped filesystem due to reason: 0 F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1 Filesystem f2fs get_tree() didn't set fc->root, returned 1 ------------[ cut here ]------------ kernel BUG at fs/super.c:1761! Oops: invalid opcode: 0000 [#1] SMP PTI CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ torvalds#721 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:vfs_get_tree.cold+0x18/0x1a Call Trace: <TASK> fc_mount+0x13/0xa0 path_mount+0x34e/0xc50 __x64_sys_mount+0x121/0x150 do_syscall_64+0x84/0x800 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa6cc126cfe The root cause is we missed to handle error number returned from f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or ro,disable_roll_forward mount option, result in returning a positive error number to vfs_get_tree(), fix it. Cc: stable@kernel.org Fixes: 6781eab ("f2fs: give -EINVAL for norecovery and rw mount") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With below scripts, it will trigger panic in f2fs: mkfs.f2fs -f /dev/vdd mount /dev/vdd /mnt/f2fs touch /mnt/f2fs/foo sync echo 111 >> /mnt/f2fs/foo f2fs_io fsync /mnt/f2fs/foo f2fs_io shutdown 2 /mnt/f2fs umount /mnt/f2fs mount -o ro,norecovery /dev/vdd /mnt/f2fs or mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0 F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f F2FS-fs (vdd): Stopped filesystem due to reason: 0 F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1 Filesystem f2fs get_tree() didn't set fc->root, returned 1 ------------[ cut here ]------------ kernel BUG at fs/super.c:1761! Oops: invalid opcode: 0000 [#1] SMP PTI CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ torvalds#721 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:vfs_get_tree.cold+0x18/0x1a Call Trace: <TASK> fc_mount+0x13/0xa0 path_mount+0x34e/0xc50 __x64_sys_mount+0x121/0x150 do_syscall_64+0x84/0x800 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa6cc126cfe The root cause is we missed to handle error number returned from f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or ro,disable_roll_forward mount option, result in returning a positive error number to vfs_get_tree(), fix it. Cc: stable@kernel.org Fixes: 6781eab ("f2fs: give -EINVAL for norecovery and rw mount") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With below scripts, it will trigger panic in f2fs: mkfs.f2fs -f /dev/vdd mount /dev/vdd /mnt/f2fs touch /mnt/f2fs/foo sync echo 111 >> /mnt/f2fs/foo f2fs_io fsync /mnt/f2fs/foo f2fs_io shutdown 2 /mnt/f2fs umount /mnt/f2fs mount -o ro,norecovery /dev/vdd /mnt/f2fs or mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0 F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f F2FS-fs (vdd): Stopped filesystem due to reason: 0 F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1 Filesystem f2fs get_tree() didn't set fc->root, returned 1 ------------[ cut here ]------------ kernel BUG at fs/super.c:1761! Oops: invalid opcode: 0000 [#1] SMP PTI CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ torvalds#721 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:vfs_get_tree.cold+0x18/0x1a Call Trace: <TASK> fc_mount+0x13/0xa0 path_mount+0x34e/0xc50 __x64_sys_mount+0x121/0x150 do_syscall_64+0x84/0x800 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa6cc126cfe The root cause is we missed to handle error number returned from f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or ro,disable_roll_forward mount option, result in returning a positive error number to vfs_get_tree(), fix it. Cc: stable@kernel.org Fixes: 6781eab ("f2fs: give -EINVAL for norecovery and rw mount") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
No description provided.