-
Notifications
You must be signed in to change notification settings - Fork 58.5k
Dev39 #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Dev39 #112
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Author
|
I'm really sorry, this was created by mistake. Please ignore! |
cyndis
pushed a commit
to cyndis/linux
that referenced
this pull request
Aug 5, 2014
Set disconnected flag in struct usbhid when a usb device is removed. Check for disconnected flag before sending urb requests. This prevents a kernel panic when a hid driver calls hid_hw_request() after removing a usb device. BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffff8161746f>] hid_submit_ctrl+0x7f/0x290 PGD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 2 PID: 39 Comm: khubd Tainted: G IO 3.16.0-rc5+ torvalds#112 Hardware name: Microsoft Corporation Surface Pro 2/Surface Pro 2, BIOS 2.03.0250 09/06/2013 task: ffff880118aba6e0 ti: ffff8800daf80000 task.ti: ffff8800daf80000 RIP: 0010:[<ffffffff8161746f>] [<ffffffff8161746f>] hid_submit_ctrl+0x7f/0x290 RSP: 0018:ffff8800daf83750 EFLAGS: 00010086 RAX: 0000000080000300 RBX: ffff88003f60c000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880117f78000 RBP: ffff8800daf83788 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880117f78000 R13: ffff88003f11a290 R14: 000000000000000c R15: ffff880091cb3ab8 FS: 0000000000000000(0000) GS:ffff88011b000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000058 CR3: 0000000001c11000 CR4: 00000000001407e0 Stack: ffff880117f3dcd0 ffff880117f78000 ffff88003f60c000 ffff880117f78000 ffff880117f78000 ffff88003f11a290 0000000000000000 ffff8800daf837b0 ffffffff81617707 ffff880117f78000 ffff88003f60c000 0000000000000013 Call Trace: [<ffffffff81617707>] usbhid_restart_ctrl_queue+0x87/0x140 [<ffffffff81617a88>] usbhid_submit_report+0x2c8/0x370 [<ffffffff81617b4a>] usbhid_request+0x1a/0x30 [<ffffffffa020edfb>] sensor_hub_set_feature+0x8b/0xd0 [hid_sensor_hub] [<ffffffffa02d9084>] hid_sensor_power_state+0x84/0x110 [hid_sensor_trigger] [<ffffffffa02d9129>] hid_sensor_data_rdy_trigger_set_state+0x19/0x20 [hid_sensor_trigger] [<ffffffffa034d5b7>] iio_triggered_buffer_predisable+0xa7/0xb0 [industrialio] [<ffffffffa034cc4a>] iio_disable_all_buffers+0x3a/0xc0 [industrialio] [<ffffffffa03487d3>] iio_device_unregister+0x53/0x80 [industrialio] [<ffffffffa026c06a>] hid_accel_3d_remove+0x2a/0x50 [hid_sensor_accel_3d] [<ffffffff814f433d>] platform_drv_remove+0x1d/0x40 [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0 [<ffffffff814f1955>] device_release_driver+0x25/0x40 [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0 [<ffffffff814ed7d6>] device_del+0x136/0x1e0 [<ffffffff81512190>] ? mfd_cell_disable+0x80/0x80 [<ffffffff814f41d1>] platform_device_del+0x21/0xc0 [<ffffffff814f4282>] platform_device_unregister+0x12/0x30 [<ffffffff815121d3>] mfd_remove_devices_fn+0x43/0x50 [<ffffffff814ed3e3>] device_for_each_child+0x43/0x70 [<ffffffff81512105>] mfd_remove_devices+0x25/0x30 [<ffffffffa020ebd7>] sensor_hub_remove+0x87/0x140 [hid_sensor_hub] [<ffffffff81607c5b>] hid_device_remove+0x6b/0xd0 [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0 [<ffffffff814f1955>] device_release_driver+0x25/0x40 [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0 [<ffffffff814ed7d6>] device_del+0x136/0x1e0 [<ffffffff81607d47>] hid_destroy_device+0x27/0x60 [<ffffffff81616972>] usbhid_disconnect+0x22/0x50 [<ffffffff81568597>] usb_unbind_interface+0x77/0x2b0 [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0 [<ffffffff814f1955>] device_release_driver+0x25/0x40 [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0 [<ffffffff814ed7d6>] device_del+0x136/0x1e0 [<ffffffff81565cd1>] usb_disable_device+0x91/0x2a0 [<ffffffff8155b046>] usb_disconnect+0x96/0x2e0 [<ffffffff8155d74a>] hub_thread+0xb5a/0x1840 Signed-off-by: Reyad Attiyat <reyad.attiyat@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
swarren
pushed a commit
to swarren/linux-tegra
that referenced
this pull request
Oct 16, 2014
WARNING: 'retuned' may be misspelled - perhaps 'returned'? torvalds#112: FILE: fs/ocfs2/file.c:2329: + * If generic_file_buffered_write() retuned a synchronous error total: 0 errors, 1 warnings, 121 lines checked ./patches/ocfs2-do-not-fallback-to-buffer-i-o-write-if-fill-holes.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Weiwei Wang <wangww631@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
aryabinin
pushed a commit
to aryabinin/linux
that referenced
this pull request
Oct 27, 2014
WARNING: 'retuned' may be misspelled - perhaps 'returned'? torvalds#112: FILE: fs/ocfs2/file.c:2329: + * If generic_file_buffered_write() retuned a synchronous error total: 0 errors, 1 warnings, 121 lines checked ./patches/ocfs2-do-not-fallback-to-buffer-i-o-write-if-fill-holes.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Weiwei Wang <wangww631@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
tobetter
pushed a commit
to tobetter/linux
that referenced
this pull request
Sep 22, 2015
Remove excessive kernel log of amvideocap driver
0day-ci
pushed a commit
to 0day-ci/linux
that referenced
this pull request
Jun 4, 2016
Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery") introduced access to state after it was just potentially freed by nfs4_put_open_state leading to a random data corruption somewhere. BUG: unable to handle kernel paging request at ffff88004941ee40 IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ torvalds#112 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000 RIP: 0010:[<ffffffff813baf01>] [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP: 0018:ffff88003ff4bd68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8 RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88 RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98 R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0 Stack: ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40 Call Trace: [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff810af0d1>] kthread+0x101/0x120 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff818843af>] ret_from_fork+0x1f/0x40 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250 Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6 RIP [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP <ffff88003ff4bd68> CR2: ffff88004941ee40 Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
0day-ci
pushed a commit
to 0day-ci/linux
that referenced
this pull request
Jun 15, 2016
Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery") introduced access to state after it was just potentially freed by nfs4_put_open_state leading to a random data corruption somewhere. BUG: unable to handle kernel paging request at ffff88004941ee40 IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ torvalds#112 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000 RIP: 0010:[<ffffffff813baf01>] [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP: 0018:ffff88003ff4bd68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8 RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88 RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98 R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0 Stack: ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40 Call Trace: [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff810af0d1>] kthread+0x101/0x120 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff818843af>] ret_from_fork+0x1f/0x40 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250 Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6 RIP [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP <ffff88003ff4bd68> CR2: ffff88004941ee40 Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
torvalds
pushed a commit
that referenced
this pull request
Jun 29, 2016
Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery") introduced access to state after it was just potentially freed by nfs4_put_open_state leading to a random data corruption somewhere. BUG: unable to handle kernel paging request at ffff88004941ee40 IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ #112 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000 RIP: 0010:[<ffffffff813baf01>] [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP: 0018:ffff88003ff4bd68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8 RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88 RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98 R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0 Stack: ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40 Call Trace: [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff810af0d1>] kthread+0x101/0x120 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff818843af>] ret_from_fork+0x1f/0x40 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250 Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6 RIP [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP <ffff88003ff4bd68> CR2: ffff88004941ee40 Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
CkNoSFeRaTU
pushed a commit
to CkNoSFeRaTU/linux
that referenced
this pull request
Aug 25, 2016
TBS 5220, 5881 support.
laijs
pushed a commit
to laijs/linux
that referenced
this pull request
Feb 13, 2017
Revert "lkl: Fix missing screen_info during compilation."
fengguang
pushed a commit
to 0day-ci/linux
that referenced
this pull request
May 11, 2017
Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.
The buggy access is at:
prog = array->ptrs[index];
if (prog == NULL)
goto out;
[...]
00000060: d2800e0a mov x10, #0x70 // torvalds#112
00000064: f86a682a ldr x10, [x1,x10]
00000068: f862694b ldr x11, [x10,x2]
0000006c: b40000ab cbz x11, 0x00000080
[...]
The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.
Fix this by emitting the following instead:
[...]
00000060: d2800e0a mov x10, #0x70 // torvalds#112
00000064: 8b0a002a add x10, x1, x10
00000068: d37df04b lsl x11, x2, #3
0000006c: f86b694b ldr x11, [x10,x11]
00000070: b40000ab cbz x11, 0x00000084
[...]
This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.
Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
fengguang
pushed a commit
to 0day-ci/linux
that referenced
this pull request
May 11, 2017
Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.
The buggy access is at:
prog = array->ptrs[index];
if (prog == NULL)
goto out;
[...]
00000060: d2800e0a mov x10, #0x70 // torvalds#112
00000064: f86a682a ldr x10, [x1,x10]
00000068: f862694b ldr x11, [x10,x2]
0000006c: b40000ab cbz x11, 0x00000080
[...]
The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.
Fix this by emitting the following instead:
[...]
00000060: d2800e0a mov x10, #0x70 // torvalds#112
00000064: 8b0a002a add x10, x1, x10
00000068: d37df04b lsl x11, x2, #3
0000006c: f86b694b ldr x11, [x10,x11]
00000070: b40000ab cbz x11, 0x00000084
[...]
This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.
Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
fengguang
pushed a commit
to 0day-ci/linux
that referenced
this pull request
May 21, 2017
GIT dac94e29110cd606dec37673644caf2cf6fd1dde
commit 7e1b9521f5a8356553f5e58b07952bf346632ea4
Author: Colin Ian King <colin.king@canonical.com>
Date: Sat Mar 11 19:09:45 2017 +0000
dm cache: handle kmalloc failure allocating background_tracker struct
Currently there is no kmalloc failure check on the allocation of
the background_tracker struct in btracker_create(), and so a NULL return
will lead to a NULL pointer dereference. Add a NULL check.
Detected by CoverityScan, CID#1416587 ("Dereference null return value")
Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 83345d51a49a4b3f3b4a08a5db644dae438b0189
Author: Tin Huynh <tnhuynh@apm.com>
Date: Wed May 17 11:25:34 2017 +0700
i2c: xgene: Set ACPI_COMPANION_I2C
With ACPI, i2c-core requires ACPI companion to be set in order for it
to create slave device.
This patch sets the ACPI companion accordingly.
Signed-off-by: Tin Huynh <tnhuynh@apm.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
commit 88ad60c23a394b2f8bf1e570c756f415435d1d35
Author: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Date: Tue May 16 14:07:24 2017 +0200
i2c: mv64xxx: don't override deferred probing when getting irq
There is no reason to use platform_get_irq() for non-DT probing and
irq_of_parse_and_map() for DT probing. Indeed, platform_get_irq()
works fine for both.
In addition, using platform_get_irq() properly returns -EPROBE_DEFER
when the interrupt controller is not yet available, so instead of
inventing our own error code (-ENXIO), return the one provided by
platform_get_irq().
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
commit 13840d38016203f0095cd547b90352812d24b787
Author: Mikulas Patocka <mpatocka@redhat.com>
Date: Sun Apr 30 17:32:28 2017 -0400
dm bufio: make the parameter "retain_bytes" unsigned long
Change the type of the parameter "retain_bytes" from unsigned to
unsigned long, so that on 64-bit machines the user can set more than
4GiB of data to be retained.
Also, change the type of the variable "count" in the function
"__evict_old_buffers" to unsigned long. The assignment
"count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];"
could result in unsigned long to unsigned overflow and that could result
in buffers not being freed when they should.
While at it, avoid division in get_retain_buffers(). Division is slow,
we can change it to shift because we have precalculated the log2 of
block size.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 6f61dd3aa35179043f1fcdb0965c5d56278ab724
Author: Kees Cook <keescook@chromium.org>
Date: Fri May 12 14:52:34 2017 -0700
efi-pstore: Fix read iter after pstore API refactor
During the internal pstore API refactoring, the EFI vars read entry was
accidentally made to update a stack variable instead of the pstore
private data pointer. This corrects the problem (and removes the now
needless argument).
Fixes: 125cc42baf8a ("pstore: Replace arguments for read() API")
Signed-off-by: Kees Cook <keescook@chromium.org>
commit 8b671f906c2debc4f2393438c4e7668936522e99
Author: Shannon Nelson <shannon.nelson@oracle.com>
Date: Mon May 15 10:51:08 2017 -0700
ldmvsw: stop the clean timer at beginning of remove
Stop the clean timer earlier to be sure there's no asynchronous
interference while stopping the port.
Orabug: 25748241
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b18e5e86b44be0dca399d8e2383f97c8077392ce
Author: Thomas Tai <thomas.tai@oracle.com>
Date: Mon May 15 10:51:07 2017 -0700
ldmvsw: unregistering netdev before disable hardware
When running LDom binding/unbinding test, kernel may panic
in ldmvsw_open(). It is more likely that because we're removing
the ldc connection before unregistering the netdev in vsw_port_remove(),
we set up a window of time where one process could be removing the
device while another trying to UP the device. This also sometimes causes
vio handshake error due to opening a device without closing it completely.
We should unregister the netdev before we disable the "hardware".
Orabug: 25980913, 25925306
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ca9df7ede41afd006d74fd6f09f36d909d0eaad7
Author: Miroslav Lichvar <mlichvar@redhat.com>
Date: Mon May 15 16:04:36 2017 +0200
net: netcp: fix check of requested timestamping filter
The driver doesn't support timestamping of all received packets and
should return error when trying to enable the HWTSTAMP_FILTER_ALL
filter.
Cc: WingMan Kwok <w-kwok2@ti.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f98e0eb68008aff9824d1c4dad7276c8bab83ca5
Author: Christoph Hellwig <hch@lst.de>
Date: Mon May 15 17:28:38 2017 +0200
dm mpath: multipath_clone_and_map must not return -EIO
Since 412445ac ("dm: introduce a new DM_MAPIO_KILL return value"), the
clone_and_map_rq methods must not return errno values, so fix it up
to properly return DM_MAPIO_KILL, instead of the -EIO value that snuck
in due to a conflict between two patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 18a482f5245cc875755090853e84283512b3e6bd
Author: Christoph Hellwig <hch@lst.de>
Date: Mon May 15 17:28:37 2017 +0200
dm mpath: don't return -EIO from dm_report_EIO
Instead just turn the macro into a helper for the warning message.
This removes an unnecessary assignment and will allow the next commit to
fix a place where -EIO is the wrong return value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit ece0728037b15f4d31198f12b359104bcb5db4c8
Author: Christoph Hellwig <hch@lst.de>
Date: Mon May 15 17:28:36 2017 +0200
dm rq: add a missing break to map_request
We don't want to bug when receiving a DM_MAPIO_KILL value..
Fixes: 412445ac ("dm: introduce a new DM_MAPIO_KILL return value")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 0377a07c7a035e0d033cd8b29f0cb15244c0916a
Author: Joe Thornber <ejt@redhat.com>
Date: Mon May 15 09:45:40 2017 -0400
dm space map disk: fix some book keeping in the disk space map
When decrementing the reference count for a block, the free count wasn't
being updated if the reference count went to zero.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 91bcdb92d39711d1adb40c26b653b7978d93eb98
Author: Joe Thornber <ejt@redhat.com>
Date: Mon May 15 09:43:05 2017 -0400
dm thin metadata: call precommit before saving the roots
These calls were the wrong way round in __write_initial_superblock.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 66eb9f86e50547ec2a8ff7a75997066a74ef584b
Author: Mahesh Bandewar <maheshb@google.com>
Date: Fri May 12 17:03:39 2017 -0700
ipv6: avoid dad-failures for addresses with NODAD
Every address gets added with TENTATIVE flag even for the addresses with
IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process
we realize it's an address with NODAD and complete the process without
sending any probe. However the TENTATIVE flags stays on the
address for sometime enough to cause misinterpretation when we receive a NS.
While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED
and endup with an address that was originally configured as NODAD with
DADFAILED.
We can't avoid scheduling dad_work for addresses with NODAD but we can
avoid adding TENTATIVE flag to avoid this racy situation.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit aa4ad88cfcd4ee45f527fb982140576711e3b501
Author: Mintz, Yuval <Yuval.Mintz@cavium.com>
Date: Sun May 14 12:21:23 2017 +0300
qed: Fix uninitialized data in aRFS infrastructure
Current memset is using incorrect type of variable, causing the
upper-half of the strucutre to be left uninitialized and causing:
ethernet/qlogic/qed/qed_init_fw_funcs.c: In function 'qed_set_rfs_mode_disable':
ethernet/qlogic/qed/qed_init_fw_funcs.c:993:3: error: '*((void *)&ramline+4)' is used uninitialized in this function [-Werror=uninitialized]
Fixes: d51e4af5c209 ("qed: aRFS infrastructure support")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8c977f5a856a7276450ddf3a7b57b4a8815b63f9
Author: Julia Lawall <julia.lawall@lip6.fr>
Date: Fri May 12 22:54:23 2017 +0800
mdio: mux: fix device_node_continue.cocci warnings
Device node iterators put the previous value of the index variable, so an
explicit put causes a double put.
In particular, of_mdiobus_register can fail before doing anything
interesting, so one could view it as a no-op from the reference count
point of view.
Generated by: scripts/coccinelle/iterators/device_node_continue.cocci
CC: Jon Mason <jon.mason@broadcom.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d19b183cdc1fa3d70d6abe2a4c369e748cd7ebb8
Author: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Date: Fri May 12 15:19:15 2017 -0300
net/packet: fix missing net_device reference release
When using a TX ring buffer, if an error occurs processing a control
message (e.g. invalid message), the net_device reference is not
released.
Fixes c14ac9451c348 ("sock: enable timestamping using control messages")
Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 4762010f09ac0453f613df345c5281e7f2dec510
Author: yuval.shaia@oracle.com <yuval.shaia@oracle.com>
Date: Fri May 12 09:10:51 2017 +0300
net/mlx4_core: Use min3 to select number of MSI-X vectors
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 70957eaecc2e43308e403c80293bec3d59632412
Author: Vlad Yasevich <vyasevich@gmail.com>
Date: Thu May 11 11:09:52 2017 -0400
macvlan: Fix performance issues with vlan tagged packets
Macvlan always turns on offload features that have sofware
fallback (NETIF_GSO_SOFTWARE). This allows much higher guest-guest
communications over macvtap.
However, macvtap does not turn on these features for vlan tagged traffic.
As a result, depending on the HW that mactap is configured on, the
performance of guest-guest communication over a vlan is very
inconsistent. If the HW supports TSO/UFO over vlans, then the
performance will be fine. If not, the the performance will suffer
greatly since the VM may continue using TSO/UFO, and will force the host
segment the traffic and possibly overlow the macvtap queue.
This patch adds the always on offloads to vlan_features. This
makes sure that any vlan tagged traffic between 2 guest will not
be segmented needlessly.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9fce894d03a98ec8e8e8106a964644633d2772ee
Author: Peter Rosin <peda@axentia.se>
Date: Mon May 15 09:03:50 2017 +0200
i2c: mux: only print failure message on error
As is, a failure message is printed unconditionally, which is confusing.
And noisy.
Fixes: 8d4d159f25a7 ("i2c: mux: provide more info on failure in i2c_mux_add_adapter")
Signed-off-by: Peter Rosin <peda@axentia.se>
commit a36d4637e4a06be067b8e327a0b1118bb2a73cb8
Author: Peter Rosin <peda@axentia.se>
Date: Mon May 15 18:48:55 2017 +0200
i2c: mux: reg: rename label to indicate what it does
That maintains sanity if it is ever called from some other spot, and
also makes the label names coherent.
Signed-off-by: Peter Rosin <peda@axentia.se>
commit 68118e0e73aa3a6291c8b9eb1ee708e05f110cea
Author: Peter Rosin <peda@axentia.se>
Date: Sun May 7 07:16:30 2017 +0200
i2c: mux: reg: put away the parent i2c adapter on probe failure
It is only prudent to let go of resources that are not used.
Fixes: b3fdd32799d8 ("i2c: mux: Add register-based mux i2c-mux-reg")
Signed-off-by: Peter Rosin <peda@axentia.se>
commit 66c25f6e31766a9ec19c2bdc7f5f69f9c59bafd7
Author: Niklas Cassel <niklas.cassel@axis.com>
Date: Mon May 15 10:56:06 2017 +0200
net: stmmac: use correct pointer when printing normal descriptor ring
There are two pointers in sysfs_display_ring,
one that increments if using normal dma descriptors,
another if using extended dma descriptors.
When printing the normal dma descriptors, the wrong pointer is used,
thus the printed descriptor addresses are incorrect.
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit fb317002ab4419ae7e068bee6897f2d5745aa3b9
Author: Heiko Carstens <heiko.carstens@de.ibm.com>
Date: Tue May 9 12:50:53 2017 +0200
s390/virtio: change virtio_feature_desc:features type to __le32
The feature member of virtio_feature_desc contains little endian
values, given that it contents will be converted with
le32_to_cpu(). The "wrong" __u32 type leads to the sparse warnings
below.
In order to avoid them, use the correct __le32 type instead.
drivers/s390/virtio/virtio_ccw.c:749:14: warning: cast to restricted __le32
drivers/s390/virtio/virtio_ccw.c:762:28: warning: cast to restricted __le32
Acked-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
commit 2e63309507c818e8b631a03f02c363031c007fb7
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 09:09:04 2017 -0400
dm cache policy smq: don't do any writebacks unless IDLE
If there are no clean blocks to be demoted the writeback will be
triggered at that point. Preemptively writing back can hurt high IO
load scenarios.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 49b7f768900f4084a65c3689d955b2fceac39e53
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 09:07:16 2017 -0400
dm cache: simplify the IDLE vs BUSY state calculation
Drop the MODERATE state since it wasn't buying us much.
Also, in check_migrations(), prepare for the next commit ("dm cache
policy smq: don't do any writebacks unless IDLE") by deferring to the
policy to make the final decision on whether writebacks can be
serviced.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 701e03e4e180f0cd97d4139a32e2b2d879d12da2
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 08:22:31 2017 -0400
dm cache: track all IO to the cache rather than just the origin device's IO
IO tracking used to throttle writebacks when the origin device is busy.
Even if all the IO is going to the fast device, writebacks can
significantly degrade performance. So track all IO to gauge whether the
cache is busy or not.
Otherwise, synthetic IO tests (e.g. fio) that might send all IO to the
fast device wouldn't cause writebacks to get throttled.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 6cf4cc8f8b3b7bc9e3c04a7eab44b985d50029fc
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 07:48:18 2017 -0400
dm cache policy smq: stop preemptively demoting blocks
It causes a lot of churn if the working set's size is close to the fast
device's size.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 4d44ec5ab751be63c5d348f13294304d87baa8c3
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 05:11:06 2017 -0400
dm cache policy smq: put newly promoted entries at the top of the multiqueue
This stops entries bouncing in and out of the cache quickly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 78c45607b909fb384c47c134d89b39285a6a8b45
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 05:09:38 2017 -0400
dm cache policy smq: be more aggressive about triggering a writeback
If there are no clean entries to demote we really want to writeback
immediately.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit a8cd1eba6135e086109e2b94bf96deb17456ede8
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 05:07:34 2017 -0400
dm cache policy smq: only demote entries in bottom half of the clean multiqueue
Heavy IO load may mean there are very few clean blocks in the cache, and
we risk demoting entries that get hit a lot.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 072792dcdfc8d5f91a26050e5665285f50afebf5
Author: Joe Thornber <ejt@redhat.com>
Date: Thu May 11 06:14:16 2017 -0400
dm cache: fix incorrect 'idle_time' reset in IO tracker
Some bios have no payload (eg, a FLUSH), don't reset the idle_time when
these come in.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
commit 508541146af18e43072e41a31aa62fac2b01aac1
Author: Yishai Hadas <yishaih@mellanox.com>
Date: Tue Apr 25 10:39:57 2017 +0300
net/mlx5: Use underlay QPN from the root name space
Root flow table is dynamically changed by the underlying flow steering
layer, and IPoIB/ULPs have no idea what will be the root flow table in
the future, hence we need a dynamic infrastructure to move Underlay QPs
with the root flow table.
Fixes: b3ba51498bdd ("net/mlx5: Refactor create flow table method to accept underlay QP")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
commit 5360fd473c23f18b42722cdb13a1c6ec7acd96ff
Author: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu May 4 17:53:32 2017 +0300
net/mlx5e: IPoIB, Only support regular RQ for now
IPoIB doesn't support striding RQ at the moment, for this
we need to explicitly choose non striding RQ in IPoIB init,
even if the HW supports it.
Fixes: 8f493ffd88ea ("net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
commit 20b6a1c78280dbeb45214c463cf9cbccb3665146
Author: Saeed Mahameed <saeedm@mellanox.com>
Date: Tue May 9 16:40:46 2017 +0300
net/mlx5e: Fix setup TC ndo
Fail-safe support patches introduced a trivial bug,
setup tc callback is doing a wrong check of the netdevice state,
the fix is simply to invert the condition.
Fixes: 6f9485af4020 ("net/mlx5e: Fail safe tc setup")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
commit e3c19503712d6360239b19c14cded56dd63c40d7
Author: Gal Pressman <galp@mellanox.com>
Date: Wed Apr 19 14:35:15 2017 +0300
net/mlx5e: Fix ethtool pause support and advertise reporting
Pause bit should set when RX pause is on, not TX pause.
Also, setting Asym_Pause is incorrect, and should be turned off.
Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
commit b383b544f2666d67446b951a9a97af239dafed5d
Author: Gal Pressman <galp@mellanox.com>
Date: Mon Apr 3 15:11:22 2017 +0300
net/mlx5e: Use the correct pause values for ethtool advertising
Query the operational pause from firmware (PFCC register) instead of
always passing zeros.
Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
commit 67b4c889cc835a2a6e2ff4e20544a33e37e2875d
Author: Steve French <smfrench@gmail.com>
Date: Fri May 12 20:59:10 2017 -0500
[CIFS] Minor cleanup of xattr query function
Some minor cleanup of cifs query xattr functions (will also make
SMB3 xattr implementation cleaner as well).
Signed-off-by: Steve French <steve.french@primarydata.com>
commit 4328fea77ca30ef6af938ae3f263a3d055a9c0e6
Author: Karim Eshapa <karim.eshapa@gmail.com>
Date: Fri May 12 01:53:38 2017 +0200
fs: cifs: transport: Use time_after for time comparison
Use time_after kernel macro for time comparison
that has safety check.
Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
commit cd1230070ae1c12fd34cf6a557bfa81bf9311009
Author: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Fri May 12 17:59:32 2017 +0200
SMB2: Fix share type handling
In fs/cifs/smb2pdu.h, we have:
#define SMB2_SHARE_TYPE_DISK 0x01
#define SMB2_SHARE_TYPE_PIPE 0x02
#define SMB2_SHARE_TYPE_PRINT 0x03
Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
never trigger and printer share would be interpreted as disk share.
So, test the ShareType value for equality instead.
Fixes: faaf946a7d5b ("CIFS: Add tree connect/disconnect capability for SMB2")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
commit ecdcf622eb74b52cebde1387a7a1852a787d8050
Author: Joe Perches via samba-technical <samba-technical@lists.samba.org>
Date: Sun May 7 01:31:47 2017 -0700
cifs: cifsacl: Use a temporary ops variable to reduce code length
Create an ops variable to store tcon->ses->server->ops and cache
indirections and reduce code size a trivial bit.
$ size fs/cifs/cifsacl.o*
text data bss dec hex filename
5338 136 8 5482 156a fs/cifs/cifsacl.o.new
5371 136 8 5515 158b fs/cifs/cifsacl.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
commit 1c4d5f51a812a82de97beee24f48ed05c65ebda5
Author: Neil Horman <nhorman@tuxdriver.com>
Date: Fri May 12 12:00:01 2017 -0400
vmxnet3: ensure that adapter is in proper state during force_close
There are several paths in vmxnet3, where settings changes cause the
adapter to be brought down and back up (vmxnet3_set_ringparam among
them). Should part of the reset operation fail, these paths call
vmxnet3_force_close, which enables all napi instances prior to calling
dev_close (with the expectation that vmxnet3_close will then properly
disable them again). However, vmxnet3_force_close neglects to clear
VMXNET3_STATE_BIT_QUIESCED prior to calling dev_close. As a result
vmxnet3_quiesce_dev (called from vmxnet3_close), returns early, and
leaves all the napi instances in a enabled state while the device itself
is closed. If a device in this state is activated again, napi_enable
will be called on already enabled napi_instances, leading to a BUG halt.
The fix is to simply enausre that the QUIESCED bit is cleared in
vmxnet3_force_close to allow quesence to be completed properly on close.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Shrikrishna Khare <skhare@vmware.com>
CC: "VMware, Inc." <pv-drivers@vmware.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 42e6cae1b35b186650f5abc7b24c20ab6986c5a0
Author: Bert Kenward <bkenward@solarflare.com>
Date: Fri May 12 17:18:50 2017 +0100
sfc: revert changes to NIC revision numbers
The revision enum values (eg EFX_REV_HUNT_A0) form part of our API,
and are included in ethtool. If these are inconsistent then ethtool
will print garbage for a register dump (ethtool -d).
Fixes: 5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b12ca80ca145cecadf841ba27cc061c510cd97ca
Author: Johan Hovold <johan@kernel.org>
Date: Fri May 12 12:13:26 2017 +0200
net: ch9200: add missing USB-descriptor endianness conversions
Add the missing endianness conversions to a debug statement printing
the USB device-descriptor idVendor and idProduct fields during probe.
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 75cf067953d5ee543b3bda90bbfcbee5e1f94ae8
Author: Johan Hovold <johan@kernel.org>
Date: Fri May 12 12:11:13 2017 +0200
net: irda: irda-usb: fix firmware name on big-endian hosts
Add missing endianness conversion when using the USB device-descriptor
bcdDevice field to construct a firmware file name.
Fixes: 8ef80aef118e ("[IRDA]: irda-usb.c: STIR421x cleanups")
Cc: stable <stable@vger.kernel.org> # 2.6.18
Cc: Nick Fedchik <nfedchik@atlantic-link.com.ua>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9fc3e4dc67fde953fd26adb3cea8f92597810b64
Author: Gustavo A. R. Silva <garsilva@embeddedor.com>
Date: Thu May 11 22:11:29 2017 -0500
net: dsa: mv88e6xxx: add default case to switch
Add default case to switch in order to avoid any chance of using an
uninitialized variable _low_, in case s->type does not match any of
the listed case values.
Addresses-Coverity-ID: 1398130
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
Author: Xin Long <lucien.xin@gmail.com>
Date: Fri May 12 14:39:52 2017 +0800
sctp: fix src address selection if using secondary addresses for ipv6
Commit 0ca50d12fe46 ("sctp: fix src address selection if using secondary
addresses") has fixed a src address selection issue when using secondary
addresses for ipv4.
Now sctp ipv6 also has the similar issue. When using a secondary address,
sctp_v6_get_dst tries to choose the saddr which has the most same bits
with the daddr by sctp_v6_addr_match_len. It may make some cases not work
as expected.
hostA:
[1] fd21:356b:459a:cf10::11 (eth1)
[2] fd21:356b:459a:cf20::11 (eth2)
hostB:
[a] fd21:356b:459a:cf30::2 (eth1)
[b] fd21:356b:459a:cf40::2 (eth2)
route from hostA to hostB:
fd21:356b:459a:cf30::/64 dev eth1 metric 1024 mtu 1500
The expected path should be:
fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
But addr[2] matches addr[a] more bits than addr[1] does, according to
sctp_v6_addr_match_len. It causes the path to be:
fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2
This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
if the saddr is in a dev instead.
Note that for backwards compatibility, it will still do the addr_match_len
check here when no optimal is found.
Reported-by: Patrick Talbert <ptalbert@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit df0c8d911abf6ba97b2c2fc3c5a12769e0b081a3
Author: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu May 11 11:24:16 2017 -0700
net: phy: Call bus->reset() after releasing PHYs from reset
The API convention makes it that a given MDIO bus reset should be able
to access PHY devices in its reset() callback and perform additional
MDIO accesses in order to bring the bus and PHYs in a working state.
Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that
contract by first calling bus->reset() and then release all PHYs from
reset using their shared GPIO line, so restore the expected
functionality here.
Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6832a333ed4a7cc4fcb170c045d1d96d0976fdd4
Author: David S. Miller <davem@davemloft.net>
Date: Thu May 11 19:30:02 2017 -0700
bpf: Handle multiple variable additions into packet pointers in verifier.
We must accumulate into reg->aux_off rather than use a plain assignment.
Add a test for this situation to test_align.
Reported-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 844cf763fba654436d3a4279b6a672c196cf1901
Author: Jon Paul Maloy <jon.maloy@ericsson.com>
Date: Thu May 11 20:28:15 2017 +0200
tipc: make macro tipc_wait_for_cond() smp safe
The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
to fulfil its task. The latter, in turn, is evaluating the stated
condition outside the socket lock context. This is problematic if
the condition is accessing non-trivial data structures which may be
altered by incoming interrupts, as is the case with the cong_links()
linked list, used by socket to keep track of the current set of
congested links. We sometimes see crashes when this list is accessed
by a condition function at the same time as a SOCK_WAKEUP interrupt
is removing an element from the list.
We fix this by expanding selected parts of sk_wait_event() into the
outer macro, while ensuring that all evaluations of a given condition
are performed under socket lock protection.
Fixes: commit 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ad990dbe6d3ac3af1f5f4484b1126b9fc601e98a
Author: Andy Gospodarek <andy@greyhouse.net>
Date: Thu May 11 15:52:30 2017 -0400
samples/bpf: run cleanup routines when receiving SIGTERM
Shahid Habib noticed that when xdp1 was killed from a different console the xdp
program was not cleaned-up properly in the kernel and it continued to forward
traffic.
Most of the applications in samples/bpf cleanup properly, but only when getting
SIGINT. Since kill defaults to using SIGTERM, add support to cleanup when the
application receives either SIGINT or SIGTERM.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Reported-by: Shahid Habib <shahid.habib@broadcom.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d2be3667f3769b3c60aa294ef7f2b03d1b16559c
Author: Colin Ian King <colin.king@canonical.com>
Date: Thu May 11 19:29:40 2017 +0100
ethernet: aquantia: remove redundant checks on error status
The error status err is initialized as zero and then being checked
several times to see if it is less than zero even when it has not
been updated. It may seem that the err should be assigned to the
return code of the call to the various *offload_en_set calls and
then we check for failure, however, these functions are void and
never actually return any status.
Since these error checks are redundant we can remove these
as well as err and the error exit label err_exit.
Detected by CoverityScan, CID#1398313 and CID#1398306 ("Logically
dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Acked-by: Pavel Belous <pavel.belous@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 69a73e744db1a57175039e3d1e6ef3913e816eb8
Author: David S. Miller <davem@davemloft.net>
Date: Thu May 11 21:41:09 2017 -0400
bpf: Remove commented out debugging hack in test_align.
Reported-by: Alexander Alemayhu <alexander@alemayhu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 33c16bfd5bc00bbb9823d25af46c496966e0ac0d
Author: Chopra, Manish <Manish.Chopra@cavium.com>
Date: Thu May 11 07:12:48 2017 -0700
qlcnic: Update version to 5.3.66
Bumping up the version as couple of fixes added after 5.3.65
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f9c3fe2f43343be7972226c2e736e7a68e383dc1
Author: Chopra, Manish <Manish.Chopra@cavium.com>
Date: Thu May 11 07:12:47 2017 -0700
qlcnic: Fix link configuration with autoneg disabled
Currently driver returns error on speed configurations
for 83xx adapter's non XGBE ports, due to this link doesn't
come up on the ports using 1000Base-T as a connector with
autoneg disabled. This patch fixes this with initializing
appropriate port type based on queried module/connector
types from hardware before any speed/autoneg configuration.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d86b5672b1adb98b4cdd6fbf0224bbfb03db6e2e
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Thu May 11 13:58:06 2017 +0200
xen-netfront: avoid crashing on resume after a failure in talk_to_netback()
Unavoidable crashes in netfront_resume() and netback_changed() after a
previous fail in talk_to_netback() (e.g. when we fail to read MAC from
xenstore) were discovered. The failure path in talk_to_netback() does
unregister/free for netdev but we don't reset drvdata and we try accessing
it after resume.
Fix the bug by removing the whole xen device completely with
device_unregister(), this guarantees we won't have any calls into netfront
after a failure.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit cb395b2010879a8461aa1b1c37025769708c32cf
Author: Eric Dumazet <edumazet@google.com>
Date: Wed May 10 21:59:28 2017 -0700
net: sched: optimize class dumps
In commit 59cc1f61f09c ("net: sched: convert qdisc linked list to
hashtable") we missed the opportunity to considerably speed up
tc_dump_tclass_root() if a qdisc handle is provided by user.
Instead of iterating all the qdiscs, use qdisc_match_from_root()
to directly get the one we look for.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b451e5d24ba6687c6f0e7319c727a709a1846c06
Author: Yuchung Cheng <ycheng@google.com>
Date: Wed May 10 17:01:27 2017 -0700
tcp: avoid fragmenting peculiar skbs in SACK
This patch fixes a bug in splitting an SKB during SACK
processing. Specifically if an skb contains multiple
packets and is only partially sacked in the higher sequences,
tcp_match_sack_to_skb() splits the skb and marks the second fragment
as SACKed.
The current code further attempts rounding up the first fragment
to MSS boundaries. But it misses a boundary condition when the
rounded-up fragment size (pkt_len) is exactly skb size. Spliting
such an skb is pointless and causses a kernel warning and aborts
the SACK processing. This patch universally checks such over-split
before calling tcp_fragment to prevent these unnecessary warnings.
Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9
Author: Eric Dumazet <edumazet@google.com>
Date: Thu May 11 15:24:41 2017 -0700
netem: fix skb_orphan_partial()
I should have known that lowering skb->truesize was dangerous :/
In case packets are not leaving the host via a standard Ethernet device,
but looped back to local sockets, bad things can happen, as reported
by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
So instead of tweaking skb->truesize, lets change skb->destructor
and keep a reference on the owner socket via its sk_refcnt.
Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michael Madsen <mkm@nabto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d67b9cd28c1d7f82c2e5e727731ea7c89b23a0a8
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri May 12 01:04:46 2017 +0200
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 0489df9a430e9607de8130a6bc4bf4c02f96eaf1
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri May 12 01:04:45 2017 +0200
xdp: add flag to enforce driver mode
After commit b5cdae3291f7 ("net: Generic XDP") we automatically fall
back to a generic XDP variant if the driver does not support native
XDP. Allow for an option where the user can specify that always the
native XDP variant should be selected and in case it's not supported
by a driver, just bail out.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 0a5539f66133a02b24f9cc43da5b84b7e6f3f436
Author: David S. Miller <davem@davemloft.net>
Date: Thu May 11 12:00:50 2017 -0700
bpf: Provide a linux/types.h override for bpf selftests.
We do not want to use the architecture's type.h header when
building BPF programs which are always 64-bit.
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 18b3ad90b64e9893297357608abddd26170730eb
Author: David S. Miller <davem@davemloft.net>
Date: Wed May 10 11:43:51 2017 -0700
bpf: Add verifier test case for alignment.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
commit 91045f5e5238a6d2ad21d41d7e86e2fa65f90d76
Author: David S. Miller <davem@davemloft.net>
Date: Wed May 10 11:42:48 2017 -0700
bpf: Add bpf_verify_program() to the library.
This allows a test case to load a BPF program and unconditionally
acquire the verifier log.
It also allows specification of the strict alignment flag.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
commit e07b98d9bffe410019dfcf62c3428d4a96c56a2c
Author: David S. Miller <davem@davemloft.net>
Date: Wed May 10 11:38:07 2017 -0700
bpf: Add strict alignment flag for BPF_PROG_LOAD.
Add a new field, "prog_flags", and an initial flag value
BPF_F_STRICT_ALIGNMENT.
When set, the verifier will enforce strict pointer alignment
regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS.
The verifier, in this mode, will also use a fixed value of "2" in
place of NET_IP_ALIGN.
This facilitates test cases that will exercise and validate this part
of the verifier even when run on architectures where alignment doesn't
matter.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
commit c5fc9692d101d1318b0f53f9f691cd88ac029317
Author: David S. Miller <davem@davemloft.net>
Date: Wed May 10 11:25:17 2017 -0700
bpf: Do per-instruction state dumping in verifier when log_level > 1.
If log_level > 1, do a state dump every instruction and emit it in
a more compact way (without a leading newline).
This will facilitate more sophisticated test cases which inspect the
verifier log for register state.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
commit d1174416747d790d750742d0514915deeed93acf
Author: David S. Miller <davem@davemloft.net>
Date: Wed May 10 11:22:52 2017 -0700
bpf: Track alignment of register values in the verifier.
Currently if we add only constant values to pointers we can fully
validate the alignment, and properly check if we need to reject the
program on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.
However, once an unknown value is introduced we only allow byte sized
memory accesses which is too restrictive.
Add logic to track the known minimum alignment of register values,
and propagate this state into registers containing pointers.
The most common paradigm that makes use of this new logic is computing
the transport header using the IP header length field. For example:
struct ethhdr *ep = skb->data;
struct iphdr *iph = (struct iphdr *) (ep + 1);
struct tcphdr *th;
...
n = iph->ihl;
th = ((void *)iph + (n * 4));
port = th->dest;
The existing code will reject the load of th->dest because it cannot
validate that the alignment is at least 2 once "n * 4" is added the
the packet pointer.
In the new code, the register holding "n * 4" will have a reg->min_align
value of 4, because any value multiplied by 4 will be at least 4 byte
aligned. (actually, the eBPF code emitted by the compiler in this case
is most likely to use a shift left by 2, but the end result is identical)
At the critical addition:
th = ((void *)iph + (n * 4));
The register holding 'th' will start with reg->off value of 14. The
pointer addition will transform that reg into something that looks like:
reg->aux_off = 14
reg->aux_off_align = 4
Next, the verifier will look at the th->dest load, and it will see
a load offset of 2, and first check:
if (reg->aux_off_align % size)
which will pass because aux_off_align is 4. reg_off will be computed:
reg_off = reg->off;
...
reg_off += reg->aux_off;
plus we have off==2, and it will thus check:
if ((NET_IP_ALIGN + reg_off + off) % size != 0)
which evaluates to:
if ((NET_IP_ALIGN + 14 + 2) % size != 0)
On strict alignment architectures, NET_IP_ALIGN is 2, thus:
if ((2 + 14 + 2) % size != 0)
which passes.
These pointer transformations and checks work regardless of whether
the constant offset or the variable with known alignment is added
first to the pointer register.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
commit d8b54110ee944de522ccd3531191f39986ec20f9
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu May 11 01:53:15 2017 +0200
bpf, arm64: fix faulty emission of map access in tail calls
Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.
The buggy access is at:
prog = array->ptrs[index];
if (prog == NULL)
goto out;
[...]
00000060: d2800e0a mov x10, #0x70 // #112
00000064: f86a682a ldr x10, [x1,x10]
00000068: f862694b ldr x11, [x10,x2]
0000006c: b40000ab cbz x11, 0x00000080
[...]
The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.
Fix this by emitting the following instead:
[...]
00000060: d2800e0a mov x10, #0x70 // #112
00000064: 8b0a002a add x10, x1, x10
00000068: d37df04b lsl x11, x2, #3
0000006c: f86b694b ldr x11, [x10,x11]
00000070: b40000ab cbz x11, 0x00000084
[...]
This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.
Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 5b6cb43b4d625b04a4049d727a116edbfe5cf0f4
Author: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Date: Wed May 10 10:28:05 2017 -0700
net: ethernet: ti: netcp_core: return error while dma channel open issue
Fix error path while dma open channel issue. Also, no need to check output
on NULL if it's never returned.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ebccc7397e4a49ff64c8f44a54895de9d32fe742
Author: Ursula Braun <ubraun@linux.vnet.ibm.com>
Date: Wed May 10 19:07:54 2017 +0200
s390/qeth: add missing hash table initializations
commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
added new hash tables, but missed to initialize them.
Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 25e2c341e7818a394da9abc403716278ee646014
Author: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Date: Wed May 10 19:07:53 2017 +0200
s390/qeth: avoid null pointer dereference on OSN
Access card->dev only after checking whether's its valid.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 2d2ebb3ed0c6acfb014f98e427298673a5d07b82
Author: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Date: Wed May 10 19:07:52 2017 +0200
s390/qeth: unbreak OSM and OSN support
commit b4d72c08b358 ("qeth: bridgeport support - basic control")
broke the support for OSM and OSN devices as follows:
As OSM and OSN are L2 only, qeth_core_probe_device() does an early
setup by loading the l2 discipline and calling qeth_l2_probe_device().
In this context, adding the l2-specific bridgeport sysfs attributes
via qeth_l2_create_device_attributes() hits a BUG_ON in fs/sysfs/group.c,
since the basic sysfs infrastructure for the device hasn't been
established yet.
Note that OSN actually has its own unique sysfs attributes
(qeth_osn_devtype), so the additional attributes shouldn't be created
at all.
For OSM, add a new qeth_l2_devtype that contains all the common
and l2-specific sysfs attributes.
When qeth_core_probe_device() does early setup for OSM or OSN, assign
the corresponding devtype so that the ccwgroup probe code creates the
full set of sysfs attributes.
This allows us to skip qeth_l2_create_device_attributes() in case
of an early setup.
Any device that can't do early setup will initially have only the
generic sysfs attributes, and when it's probed later
qeth_l2_probe_device() adds the l2-specific attributes.
If an early-setup device is removed (by calling ccwgroup_ungroup()),
device_unregister() will - using the devtype - delete the
l2-specific attributes before qeth_l2_remove_device() is called.
So make sure to not remove them twice.
What complicates the issue is that qeth_l2_probe_device() and
qeth_l2_remove_device() is also called on a device when its
layer2 attribute changes (ie. its layer mode is switched).
For early-setup devices this wouldn't work properly - we wouldn't
remove the l2-specific attributes when switching to L3.
But switching the layer mode doesn't actually make any sense;
we already decided that the device can only operate in L2!
So just refuse to switch the layer mode on such devices. Note that
OSN doesn't have a layer2 attribute, so we only need to special-case
OSM.
Based on an initial patch by Ursula Braun.
Fixes: b4d72c08b358 ("qeth: bridgeport support - basic control")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9111e7880ccf419548c7b0887df020b08eadb075
Author: Ursula Braun <ubraun@linux.vnet.ibm.com>
Date: Wed May 10 19:07:51 2017 +0200
s390/qeth: handle sysfs error during initialization
When setting up the device from within the layer discipline's
probe routine, creating the layer-specific sysfs attributes can fail.
Report this error back to the caller, and handle it by
releasing the layer discipline.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
[jwi: updated commit msg, moved an OSN change to a subsequent patch]
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b60161668199ac62011c024adc9e66713b9554e7
Author: Jon Mason <jon.mason@broadcom.com>
Date: Wed May 10 11:20:27 2017 -0400
mdio: mux: Correct mdio_mux_init error path issues
There is a potential unnecessary refcount decrement on error path of
put_device(&pb->mii_bus->dev), as it is possible to avoid the
of_mdio_find_bus() call if mux_bus is specified by the calling function.
The same put_device() is not called in the error path if the
devm_kzalloc of pb fails. This caused the variable used in the
put_device() to be changed, as the pb pointer was obviously not set up.
There is an unnecessary of_node_get() on child_bus_node if the
of_mdiobus_register() is successful, as the
for_each_available_child_of_node() automatically increments this.
Thus the refcount on this node will always be +1 more than it should be.
There is no of_node_put() on child_bus_node if the of_mdiobus_register()
call fails.
Finally, it is lacking devm_kfree() of pb in the error path. While this
might not be technically necessary, it was present in other parts of the
function. So, I am adding it where necessary to make it uniform.
Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Fixes: f20e6657a875 ("mdio: mux: Enhanced MDIO mux framework for integrated multiplexers")
Fixes: 0ca2997d1452 ("netdev/of/phy: Add MDIO bus multiplexer support.")
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 83eaddab4378db256d00d295bda6ca997cd13a52
Author: WANG Cong <xiyou.wangcong@gmail.com>
Date: Tue May 9 16:59:54 2017 -0700
ipv6/dccp: do not inherit ipv6_mc_list from parent
Like commit 657831ffc38e ("dccp/tcp: do not inherit mc_list from parent")
we should clear ipv6_mc_list etc. for IPv6 sockets too.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 0fe20fafd1791f993806d417048213ec57b81045
Author: Colin Ian King <colin.king@canonical.com>
Date: Tue May 9 17:19:42 2017 +0100
netxen_nic: set rcode to the return status from the call to netxen_issue_cmd
Currently rcode is being initialized to NX_RCODE_SUCCESS and later it
is checked to see if it is not NX_RCODE_SUCCESS which is never true. It
appears that there is an unintentional missing assignment of rcode from
the return of the call to netxen_issue_cmd() that was dropped in
an earlier fix, so add it in.
Detected by CoverityScan, CID#401900 ("Logically dead code")
Fixes: 2dcd5d95ad6b2 ("netxen_nic: fix cdrp race condition")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8d66c30b12ed3cb533696dea8b9a9eadd5da426a
Author: Stefan Wahren <stefan.wahren@i2se.com>
Date: Tue May 9 15:40:38 2017 +0200
net: qca_spi: Fix alignment issues in rx path
The qca_spi driver causes alignment issues on ARM devices.
So fix this by using netdev_alloc_skb_ip_align().
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Fixes: 291ab06ecf67 ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 1a4a5bf52a4adb477adb075e5afce925824ad132
Author: Gao Feng <gfree.wind@vip.163.com>
Date: Tue May 9 18:27:33 2017 +0800
driver: vrf: Fix one possible use-after-free issue
The current codes only deal with the case that the skb is dropped, it
may meet one use-after-free issue when NF_HOOK returns 0 that means
the skb is stolen by one netfilter rule or hook.
When one netfilter rule or hook stoles the skb and return NF_STOLEN,
it means the skb is taken by the rule, and other modules should not
touch this skb ever. Maybe the skb is queued or freed directly by the
rule.
Now uses the nf_hook instead of NF_HOOK to get the result of netfilter,
and check the return value of nf_hook. Only when its value equals 1, it
means the skb could go ahead. Or reset the skb as NULL.
BTW, because vrf_rcv_finish is empty function, so needn't invoke it
even though nf_hook returns 1. But we need to modify vrf_rcv_finish
to deal with the NF_STOLEN case.
There are two cases when skb is stolen.
1. The skb is stolen and freed directly.
There is nothing we need to do, and vrf_rcv_finish isn't invoked.
2. The skb is queued and reinjected again.
The vrf_rcv_finish would be invoked as okfn, so need to free the
skb in it.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit efc0c21c9ea786d6f019d7df7b4e3932f3578d90
Author: Elena Reshetova <elena.reshetova@intel.com>
Date: Thu Mar 2 12:23:45 2017 +0100
s390: convert debug_info.ref_count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
commit de1892b887eeb85ce458a93979c2108e6f329618
Author: Steve French <smfrench@gmail.com>
Date: Thu May 4 07:54:04 2017 -0500
Don't delay freeing mids when blocked on slow socket write of request
When processing responses, and in particular freeing mids (DeleteMidQEntry),
which is very important since it also frees the associated buffers (cifs_buf_release),
we can block a long time if (writes to) socket is slow due to low memory or networking
issues.
We can block in send (smb request) waiting for memory, and be blocked in processing
responess (which could free memory if we let it) - since they both grab the
server->srv_mutex.
In practice, in the DeleteMidQEntry case - there is no reason we need to
grab the srv_mutex so remove these around DeleteMidQEntry, and it allows
us to free memory faster.
Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
commit 560d388950ceda5e7c7cdef7f3d9a8ff297bbf9d
Author: Rabin Vincent <rabinv@axis.com>
Date: Wed May 3 17:17:21 2017 +0200
CIFS: silence lockdep splat in cifs_relock_file()
cifs_relock_file() can perform a down_write() on the inode's lock_sem even
though it was already performed in cifs_strict_readv(). Lockdep complains
about this. AFAICS, there is no problem here, and lockdep just needs to be
told that this nesting is OK.
=============================================
[ INFO: possible recursive locking detected ]
4.11.0+ #20 Not tainted
---------------------------------------------
cat/701 is trying to acquire lock:
(&cifsi->lock_sem){++++.+}, at: cifs_reopen_file+0x7a7/0xc00
but task is already holding lock:
(&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&cifsi->lock_sem);
lock(&cifsi->lock_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by cat/701:
#0: (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
stack backtrace:
CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
Call Trace:
dump_stack+0x85/0xc2
__lock_acquire+0x17dd/0x2260
? trace_hardirqs_on_thunk+0x1a/0x1c
? preempt_schedule_irq+0x6b/0x80
lock_acquire+0xcc/0x260
? lock_acquire+0xcc/0x260
? cifs_reopen_file+0x7a7/0xc00
down_read+0x2d/0x70
? cifs_reopen_file+0x7a7/0xc00
cifs_reopen_file+0x7a7/0xc00
? printk+0x43/0x4b
cifs_readpage_worker+0x327/0x8a0
cifs_readpage+0x8c/0x2a0
generic_file_read_iter+0x692/0xd00
cifs_strict_readv+0x29f/0x310
generic_file_splice_read+0x11c/0x1c0
do_splice_to+0xa5/0xc0
splice_direct_to_actor+0xfa/0x350
? generic_pipe_buf_nosteal+0x10/0x10
do_splice_direct+0xb5/0xe0
do_sendfile+0x278/0x3a0
SyS_sendfile64+0xc4/0xe0
entry_SYSCALL_64_fastpath+0x1f/0xbe
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
commit d04a4c76f71dd5335f8e499b59617382d84e2b8d
Author: Heiko Carstens <heiko.carstens@de.ibm.com>
Date: Thu May 4 09:42:22 2017 +0200
s390: move _text symbol to address higher than zero
The perf tool assumes that kernel symbols are never present at address
zero. In fact it assumes if functions that map symbols to addresses
return zero, that the symbol was not found.
Given that s390's _text symbol historically is located at address zero
this yields at least a couple of false errors and warnings in one of
perf's test cases about not present symbols ("perf test 1").
To fix this simply move the _text symbol to address 0x200, just behind
the initial psw and channel program located at the beginning of the
kernel image. This is now hard coded within the linker script.
I tried a nicer solution which moves the initial psw and channel
program into an own section. However that would move the symbols
within the "real" head.text section to different addresses, since the
".org" statements within head.S are relative to the head.text
section. If there is a new section in front, everything else will be
moved…
damentz
referenced
this pull request
in zen-kernel/zen-kernel
May 31, 2017
[ Upstream commit d8b5411 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Noltari
pushed a commit
to Noltari/linux
that referenced
this pull request
Jun 7, 2017
[ Upstream commit d8b5411 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
koenkooi
pushed a commit
to koenkooi/linux
that referenced
this pull request
Jun 7, 2017
[ Upstream commit d8b5411 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fengguang
pushed a commit
to 0day-ci/linux
that referenced
this pull request
Jun 12, 2017
Wire up the existing arm64 support for SMBIOS tables (aka DMI) for ARM as well, by moving the arm64 init code to drivers/firmware/efi/arm-runtime.c (which is shared between ARM and arm64), and adding a asm/dmi.h header to ARM that defines the mapping routines for the firmware tables. This allows userspace to access these tables to discover system information exposed by the firmware. It also sets the hardware name used in crash dumps, e.g.: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ed3c0000 [00000000] *pgd=bf1f3835 Internal error: Oops: 817 [#1] SMP THUMB2 Modules linked in: CPU: 0 PID: 759 Comm: bash Not tainted 4.10.0-09601-g0e8f38792120-dirty torvalds#112 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 ^^^ NOTE: This does *NOT* enable or encourage the use of DMI quirks, i.e., the the practice of identifying the platform via DMI to decide whether certain workarounds for buggy hardware and/or firmware need to be enabled. This would require the DMI subsystem to be enabled much earlier than we do on ARM, which is non-trivial. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20170602135207.21708-14-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
wzyy2
pushed a commit
to wzyy2/linux
that referenced
this pull request
Jun 19, 2017
(1) use cpu id from bl31 delivers; (2) sp_el0 should point to kernel address in EL1 mode. On ARM64, kernel uses sp_el0 to store current_thread_info(), we see a problem: when fiq occurs, cpu is EL1 mode but sp_el0 point to userspace address. At this moment, if we read 'current_thread_info()->cpu' or other, it leads an error. We find above situation happens when save/restore cpu context between system mode and user mode under heavy load. Like 'ret_fast_syscall()', kernel restore context of user mode, but fiq occurs before the instruction 'eret', so this causes the above situation. Assembly code: ffffff80080826c8 <ret_fast_syscall>: ...skipping... ffffff80080826fc: d503201f nop ffffff8008082700: d5384100 mrs x0, sp_el0 ffffff8008082704: f9400c00 ldr x0, [x0,torvalds#24] ffffff8008082708: d5182000 msr ttbr0_el1, x0 ffffff800808270c: d5033fdf isb ffffff8008082710: f9407ff7 ldr x23, [sp,torvalds#248] ffffff8008082714: d5184117 msr sp_el0, x23 ffffff8008082718: d503201f nop ffffff800808271c: d503201f nop ffffff8008082720: d5184035 msr elr_el1, x21 ffffff8008082724: d5184016 msr spsr_el1, x22 ffffff8008082728: a94007e0 ldp x0, x1, [sp] ffffff800808272c: a9410fe2 ldp x2, x3, [sp,torvalds#16] ffffff8008082730: a94217e4 ldp x4, x5, [sp,torvalds#32] ffffff8008082734: a9431fe6 ldp x6, x7, [sp,torvalds#48] ffffff8008082738: a94427e8 ldp x8, x9, [sp,torvalds#64] ffffff800808273c: a9452fea ldp x10, x11, [sp,torvalds#80] ffffff8008082740: a94637ec ldp x12, x13, [sp,torvalds#96] ffffff8008082744: a9473fee ldp x14, x15, [sp,torvalds#112] ffffff8008082748: a94847f0 ldp x16, x17, [sp,torvalds#128] ffffff800808274c: a9494ff2 ldp x18, x19, [sp,torvalds#144] ffffff8008082750: a94a57f4 ldp x20, x21, [sp,torvalds#160] ffffff8008082754: a94b5ff6 ldp x22, x23, [sp,torvalds#176] ffffff8008082758: a94c67f8 ldp x24, x25, [sp,torvalds#192] ffffff800808275c: a94d6ffa ldp x26, x27, [sp,torvalds#208] ffffff8008082760: a94e77fc ldp x28, x29, [sp,torvalds#224] ffffff8008082764: f9407bfe ldr x30, [sp,torvalds#240] ffffff8008082768: 9104c3ff add sp, sp, #0x130 ffffff800808276c: d69f03e0 eret Change-Id: I071e899f8a407764e166ca0403199c9d87d6ce78 Signed-off-by: chenjh <chenjh@rock-chips.com>
dcui
pushed a commit
to dcui/linux
that referenced
this pull request
Jul 26, 2017
BugLink: http://bugs.launchpad.net/bugs/1696723 [ Upstream commit d8b5411 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
iaguis
pushed a commit
to kinvolk/linux
that referenced
this pull request
Feb 6, 2018
Rebase patches onto 4.13.16
commodo
pushed a commit
to commodo/linux
that referenced
this pull request
Jun 19, 2018
…lise-support Add adrv9009 talise support
fengguang
pushed a commit
to 0day-ci/linux
that referenced
this pull request
Jun 22, 2018
On fast hosts or malicious bots, we trigger a DCCP_BUG() which seems excessive. syzbot reported : BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback() CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline] ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline] dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654 sk_backlog_rcv include/net/sock.h:914 [inline] __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:287 [inline] ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:287 [inline] ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693 process_backlog+0x219/0x760 net/core/dev.c:5373 napi_poll net/core/dev.c:5771 [inline] net_rx_action+0x7da/0x1980 net/core/dev.c:5837 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Cc: dccp@vger.kernel.org
fengguang
pushed a commit
to 0day-ci/linux
that referenced
this pull request
Jun 23, 2018
On fast hosts or malicious bots, we trigger a DCCP_BUG() which seems excessive. syzbot reported : BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback() CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline] ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline] dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654 sk_backlog_rcv include/net/sock.h:914 [inline] __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:287 [inline] ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:287 [inline] ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693 process_backlog+0x219/0x760 net/core/dev.c:5373 napi_poll net/core/dev.c:5771 [inline] net_rx_action+0x7da/0x1980 net/core/dev.c:5837 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Cc: dccp@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Noltari
pushed a commit
to Noltari/linux
that referenced
this pull request
Jul 22, 2018
[ Upstream commit 74174fe ] On fast hosts or malicious bots, we trigger a DCCP_BUG() which seems excessive. syzbot reported : BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback() CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline] ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline] dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654 sk_backlog_rcv include/net/sock.h:914 [inline] __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:287 [inline] ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:287 [inline] ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693 process_backlog+0x219/0x760 net/core/dev.c:5373 napi_poll net/core/dev.c:5771 [inline] net_rx_action+0x7da/0x1980 net/core/dev.c:5837 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Cc: dccp@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
staging-kernelci-org
pushed a commit
to kernelci/linux
that referenced
this pull request
Sep 16, 2022
WARNING: line length of 108 exceeds 100 columns torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); WARNING: Missing a blank line after declarations torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(start + page_size, page_size); ERROR: space required before the open parenthesis '(' torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146: + while(getline(&line, &len, fp) != -1) { ERROR: space required after that ',' (ctx:VxV) torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147: + char *first = strtok(line,"- "); ^ ERROR: space required after that ',' (ctx:VxV) torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149: + char *second = strtok(NULL,"- "); ^ WARNING: Missing a blank line after declarations torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151: + void *second_val = (void *) strtol(second, NULL, 16); + if (first_val == start && second_val == start + 3 * page_size) { total: 3 errors, 3 warnings, 113 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Sep 16, 2022
WARNING: line length of 108 exceeds 100 columns torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); WARNING: Missing a blank line after declarations torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(start + page_size, page_size); ERROR: space required before the open parenthesis '(' torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146: + while(getline(&line, &len, fp) != -1) { ERROR: space required after that ',' (ctx:VxV) torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147: + char *first = strtok(line,"- "); ^ ERROR: space required after that ',' (ctx:VxV) torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149: + char *second = strtok(NULL,"- "); ^ WARNING: Missing a blank line after declarations torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151: + void *second_val = (void *) strtol(second, NULL, 16); + if (first_val == start && second_val == start + 3 * page_size) { total: 3 errors, 3 warnings, 113 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Sep 19, 2022
WARNING: line length of 108 exceeds 100 columns torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); WARNING: Missing a blank line after declarations torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(start + page_size, page_size); ERROR: space required before the open parenthesis '(' torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146: + while(getline(&line, &len, fp) != -1) { ERROR: space required after that ',' (ctx:VxV) torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147: + char *first = strtok(line,"- "); ^ ERROR: space required after that ',' (ctx:VxV) torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149: + char *second = strtok(NULL,"- "); ^ WARNING: Missing a blank line after declarations torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151: + void *second_val = (void *) strtol(second, NULL, 16); + if (first_val == start && second_val == start + 3 * page_size) { total: 3 errors, 3 warnings, 113 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Sep 22, 2022
WARNING: line length of 108 exceeds 100 columns torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); WARNING: Missing a blank line after declarations torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(start + page_size, page_size); ERROR: space required before the open parenthesis '(' torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146: + while(getline(&line, &len, fp) != -1) { ERROR: space required after that ',' (ctx:VxV) torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147: + char *first = strtok(line,"- "); ^ ERROR: space required after that ',' (ctx:VxV) torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149: + char *second = strtok(NULL,"- "); ^ WARNING: Missing a blank line after declarations torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151: + void *second_val = (void *) strtol(second, NULL, 16); + if (first_val == start && second_val == start + 3 * page_size) { total: 3 errors, 3 warnings, 113 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Sep 25, 2022
WARNING: line length of 108 exceeds 100 columns torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); WARNING: Missing a blank line after declarations torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(start + page_size, page_size); ERROR: space required before the open parenthesis '(' torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146: + while(getline(&line, &len, fp) != -1) { ERROR: space required after that ',' (ctx:VxV) torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147: + char *first = strtok(line,"- "); ^ ERROR: space required after that ',' (ctx:VxV) torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149: + char *second = strtok(NULL,"- "); ^ WARNING: Missing a blank line after declarations torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151: + void *second_val = (void *) strtol(second, NULL, 16); + if (first_val == start && second_val == start + 3 * page_size) { total: 3 errors, 3 warnings, 113 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Sep 27, 2022
WARNING: line length of 108 exceeds 100 columns torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); WARNING: Missing a blank line after declarations torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(start + page_size, page_size); ERROR: space required before the open parenthesis '(' torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146: + while(getline(&line, &len, fp) != -1) { ERROR: space required after that ',' (ctx:VxV) torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147: + char *first = strtok(line,"- "); ^ ERROR: space required after that ',' (ctx:VxV) torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149: + char *second = strtok(NULL,"- "); ^ WARNING: Missing a blank line after declarations torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151: + void *second_val = (void *) strtol(second, NULL, 16); + if (first_val == start && second_val == start + 3 * page_size) { total: 3 errors, 3 warnings, 113 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Oct 26, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== 1) ChromeOS memory pressure test ============================================================================= Our standard memory pressure test, that is designed with reproducibility in mind. zram is configured as a swap device, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted device. Columns per (Documentation/admin-guide/blockdev/zram.rst) orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 10353639424 2981711944 3166896128 0 3543158784 579494 825135 123707 10168573952 2932288347 3106541568 0 3499085824 565187 853137 126153 9950461952 2815911234 3035693056 0 3441090560 586696 748054 122103 9892335616 2779566152 2943459328 0 3514736640 591541 650696 119621 9993949184 2814279212 3021357056 0 3336421376 582488 711744 121273 9953226752 2856382009 3025649664 0 3512893440 564559 787861 123034 9838448640 2785481728 2997575680 0 3367219200 573282 777099 122739 ORDER 3 zspage 9509138432 2706941227 2823393280 0 3389587456 535856 1011472 90223 10105245696 2882368370 3013095424 0 3296165888 563896 1059033 94808 9531236352 2666125512 2867650560 0 3396173824 567117 1126396 88807 9561812992 2714536764 2956652544 0 3310505984 548223 827322 90992 9807470592 2790315707 2908053504 0 3378315264 563670 1020933 93725 10178371584 2948838782 3071209472 0 3329548288 548533 954546 90730 9925165056 2849839413 2958274560 0 3336978432 551464 1058302 89381 ORDER 4 zspage 9444515840 2613362645 2668232704 0 3396759552 573735 1162207 83475 10129108992 2925888488 3038351360 0 3499597824 555634 1231542 84525 9876594688 2786692282 2897006592 0 3469463552 584835 1290535 84133 10012909568 2649711847 2801512448 0 3171323904 675405 750728 80424 10120966144 2866742402 2978639872 0 3257815040 587435 1093981 83587 9578790912 2671245225 2802270208 0 3376353280 545548 1047930 80895 10108588032 2888433523 2983960576 0 3316641792 571445 1290640 81402 First, we establish that order 3 and 4 don't cause any statistically significant change in `orig_data_size` (number of bytes we store during the test), in other words larger zspages don't cause regressions. T-test for order 3: x order-2-stored + order-3-stored +-----------------------------------------------------------------------------+ |+ + + + x x + x x + x+ x| | |________________________AM__|_________M_____A____|__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.5091384e+09 1.0178372e+10 9.8074706e+09 9.8026344e+09 2.7856206e+08 No difference proven at 95.0% confidence T-test for order 4: x order-2-stored + order-4-stored +-----------------------------------------------------------------------------+ | + | |+ + x +x xx x + ++ x x| | |__________________|____A____M____M____________|_| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.4445158e+09 1.0129109e+10 1.001291e+10 9.8959249e+09 2.7947784e+08 No difference proven at 95.0% confidence Next we establish that there is a statistically significant improvement in `mem_used_total` metrics. T-test for order 3: x order-2-usedmem + order-3-usedmem +-----------------------------------------------------------------------------+ |+ + + x ++ x + xx x + x x| | |_________________A__M__|____________|__A________________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.8233933e+09 3.0712095e+09 2.9566525e+09 2.9426185e+09 84630851 Difference at 95.0% confidence -9.98347e+07 +/- 9.21744e+07 -3.28139% +/- 3.02961% (Student's t, pooled s = 7.91383e+07) T-test for order 4: x order-2-usedmem + order-4-usedmem +-----------------------------------------------------------------------------+ | + x | |+ + + x ++ x x * x x| | |__________________A__M__________|_____|_M__A__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.6682327e+09 3.0383514e+09 2.8970066e+09 2.8814248e+09 1.3098053e+08 Difference at 95.0% confidence -1.61028e+08 +/- 1.23591e+08 -5.29272% +/- 4.0622% (Student's t, pooled s = 1.06111e+08) Order 3 zspages also show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem +-----------------------------------------------------------------------------+ |+ + + x+ x + + + x x x x| | |________M__A_________|_|_____________________A___________M____________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.2961659e+09 3.3961738e+09 3.3369784e+09 3.3481822e+09 39840377 Difference at 95.0% confidence -1.11047e+08 +/- 7.36589e+07 -3.21017% +/- 2.12934% (Student's t, pooled s = 6.32415e+07) Order 4 zspages, on the other hand, do not show any statistically significant improvement in `mem_used_max` metrics. T-test for order 4: x order-2-maxmem + order-4-maxmem +-----------------------------------------------------------------------------+ |+ + + x x + + x + * x x| | |_______________________A___M________________A_|_____M_______| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.1713239e+09 3.4995978e+09 3.3763533e+09 3.3554221e+09 1.1609062e+08 No difference proven at 95.0% confidence Overall, with sufficient level of confidence, order 3 zspages appear to be beneficial for these particular use-case and data patterns. Rather expectedly we also observed lower numbers of huge-pages when zsmalloc is configured with order 3 and order 4 zspages, for the reason already explained. 2) Synthetic test ============================================================================= Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem +--------------------------------------------------------------------------+ |+ x| |+ x| |+ x| |++ x| |A| A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem +--------------------------------------------------------------------------+ |+ x| |+ x| |+ x| |+ x| |+ x| |A A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Oct 27, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
jonhunter
pushed a commit
to jonhunter/linux
that referenced
this pull request
Oct 28, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221027042651.234524-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Oct 29, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221027042651.234524-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Nov 1, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Nov 1, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Nov 2, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Nov 3, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Nov 5, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jonhunter
pushed a commit
to jonhunter/linux
that referenced
this pull request
Nov 7, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jonhunter
pushed a commit
to jonhunter/linux
that referenced
this pull request
Nov 8, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Nov 9, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gatieme
pushed a commit
to gatieme/linux
that referenced
this pull request
Nov 24, 2022
ANBZ: torvalds#112 commit bd74048 upstream If we hit an earlier error path in io_uring_create(), then we will have accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the ring is torn down in error, we use those values to unaccount the memory. Ensure we set the ctx entries before we're able to hit a potential error path. Cc: stable@vger.kernel.org Reported-by: Tomáš Chaloupka <chalucha@gmail.com> Tested-by: Tomáš Chaloupka <chalucha@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Damenly
pushed a commit
to Damenly/linux
that referenced
this pull request
Jul 25, 2023
Signed-off-by: nascs <nascs@radxa.com>
mj22226
pushed a commit
to mj22226/linux
that referenced
this pull request
Jul 30, 2023
Signed-off-by: nascs <nascs@radxa.com>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Nov 27, 2023
With latest upstream llvm18, the following test cases failed: $ ./test_progs -j torvalds#13/2 bpf_cookie/multi_kprobe_link_api:FAIL torvalds#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL torvalds#13 bpf_cookie:FAIL torvalds#77 fentry_fexit:FAIL torvalds#78/1 fentry_test/fentry:FAIL torvalds#78 fentry_test:FAIL torvalds#82/1 fexit_test/fexit:FAIL torvalds#82 fexit_test:FAIL torvalds#112/1 kprobe_multi_test/skel_api:FAIL torvalds#112/2 kprobe_multi_test/link_api_addrs:FAIL ... torvalds#112 kprobe_multi_test:FAIL torvalds#356/17 test_global_funcs/global_func17:FAIL torvalds#356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356. For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
alobakin
pushed a commit
to alobakin/linux
that referenced
this pull request
Nov 27, 2023
With latest upstream llvm18, the following test cases failed: $ ./test_progs -j #13/2 bpf_cookie/multi_kprobe_link_api:FAIL #13/3 bpf_cookie/multi_kprobe_attach_api:FAIL #13 bpf_cookie:FAIL torvalds#77 fentry_fexit:FAIL torvalds#78/1 fentry_test/fentry:FAIL torvalds#78 fentry_test:FAIL torvalds#82/1 fexit_test/fexit:FAIL torvalds#82 fexit_test:FAIL torvalds#112/1 kprobe_multi_test/skel_api:FAIL torvalds#112/2 kprobe_multi_test/link_api_addrs:FAIL [...] torvalds#112 kprobe_multi_test:FAIL torvalds#356/17 test_global_funcs/global_func17:FAIL torvalds#356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for #13/torvalds#77 etc. except torvalds#356. For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Dec 25, 2023
If CONFIG_BPF_JIT_ALWAYS_ON is not set and bpf_jit_enable is 0, there exist 6 failed tests. [root@linux bpf]# echo 0 > /proc/sys/net/core/bpf_jit_enable [root@linux bpf]# ./test_verifier | grep FAIL torvalds#107/p inline simple bpf_loop call FAIL torvalds#108/p don't inline bpf_loop call, flags non-zero FAIL torvalds#109/p don't inline bpf_loop call, callback non-constant FAIL torvalds#110/p bpf_loop_inline and a dead func FAIL torvalds#111/p bpf_loop_inline stack locations for loop vars FAIL torvalds#112/p inline bpf_loop call in a big program FAIL Summary: 505 PASSED, 266 SKIPPED, 6 FAILED The test log shows that callbacks are not allowed in non-JITed programs, interpreter doesn't support them yet, thus these tests should be skipped if jit is disabled, just return -ENOTSUPP instead of -EINVAL for pseudo calls in fixup_call_args(). With this patch: [root@linux bpf]# echo 0 > /proc/sys/net/core/bpf_jit_enable [root@linux bpf]# ./test_verifier | grep FAIL Summary: 505 PASSED, 272 SKIPPED, 0 FAILED Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Kaz205
pushed a commit
to Kaz205/linux
that referenced
this pull request
Feb 5, 2024
[ Upstream commit b16904f ] With latest upstream llvm18, the following test cases failed: $ ./test_progs -j torvalds#13/2 bpf_cookie/multi_kprobe_link_api:FAIL torvalds#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL torvalds#13 bpf_cookie:FAIL torvalds#77 fentry_fexit:FAIL torvalds#78/1 fentry_test/fentry:FAIL torvalds#78 fentry_test:FAIL torvalds#82/1 fexit_test/fexit:FAIL torvalds#82 fexit_test:FAIL torvalds#112/1 kprobe_multi_test/skel_api:FAIL torvalds#112/2 kprobe_multi_test/link_api_addrs:FAIL [...] torvalds#112 kprobe_multi_test:FAIL torvalds#356/17 test_global_funcs/global_func17:FAIL torvalds#356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356. For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
1054009064
pushed a commit
to 1054009064/linux
that referenced
this pull request
Feb 5, 2024
[ Upstream commit b16904f ] With latest upstream llvm18, the following test cases failed: $ ./test_progs -j torvalds#13/2 bpf_cookie/multi_kprobe_link_api:FAIL torvalds#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL torvalds#13 bpf_cookie:FAIL torvalds#77 fentry_fexit:FAIL torvalds#78/1 fentry_test/fentry:FAIL torvalds#78 fentry_test:FAIL torvalds#82/1 fexit_test/fexit:FAIL torvalds#82 fexit_test:FAIL torvalds#112/1 kprobe_multi_test/skel_api:FAIL torvalds#112/2 kprobe_multi_test/link_api_addrs:FAIL [...] torvalds#112 kprobe_multi_test:FAIL torvalds#356/17 test_global_funcs/global_func17:FAIL torvalds#356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356. For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
rodriguezst
pushed a commit
to rodriguezst/linux
that referenced
this pull request
May 17, 2024
This change adds support for a configurable eir_max_name_len for platforms which requires a larger than 48 bytes complete name in EIR. From bluetoothctl: [bluetooth]# system-alias 012345678901234567890123456789012345678901234567890123456789 Changing 012345678901234567890123456789012345678901234567890123456789 succeeded [CHG] Controller DC:71:96:69:02:89 Alias: 012345678901234567890123456789012345678901234567890123456789 From btmon: < HCI Command: Write Local Name (0x03|0x0013) plen 248 torvalds#109 [hci0] 88.567990 Name: 012345678901234567890123456789012345678901234567890123456789 > HCI Event: Command Complete (0x0e) plen 4 torvalds#110 [hci0] 88.663854 Write Local Name (0x03|0x0013) ncmd 1 Status: Success (0x00) @ MGMT Event: Local Name Changed (0x0008) plen 260 {0x0004} [hci0] 88.663948 Name: 012345678901234567890123456789012345678901234567890123456789 Short name: < HCI Command: Write Extended Inquiry Response (0x03|0x0052) plen 241 torvalds#111 [hci0] 88.663977 FEC: Not required (0x00) Name (complete): 012345678901234567890123456789012345678901234567890123456789 TX power: 12 dBm Device ID: Bluetooth SIG assigned (0x0001) Vendor: Google (224) Product: 0xc405 Version: 0.5.6 (0x0056) 16-bit Service UUIDs (complete): 7 entries Generic Access Profile (0x1800) Generic Attribute Profile (0x1801) Device Information (0x180a) A/V Remote Control (0x110e) A/V Remote Control Target (0x110c) Handsfree Audio Gateway (0x111f) Audio Source (0x110a) > HCI Event: Command Complete (0x0e) plen 4 torvalds#112 [hci0] 88.664874 Write Extended Inquiry Response (0x03|0x0052) ncmd 1 Status: Success (0x00) (am from https://patchwork.kernel.org/patch/11687367/) Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Alain Michaud <alainm@chromium.org> Backport notes: HDEV_PARAM_U16 is changed from two parameters to one parameter. BUG=none TEST=build Signed-off-by: Zhengping Jiang <jiangzp@google.com>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Oct 1, 2024
This change adds support for a configurable eir_max_name_len for platforms which requires a larger than 48 bytes complete name in EIR. From bluetoothctl: [bluetooth]# system-alias 012345678901234567890123456789012345678901234567890123456789 Changing 012345678901234567890123456789012345678901234567890123456789 succeeded [CHG] Controller DC:71:96:69:02:89 Alias: 012345678901234567890123456789012345678901234567890123456789 From btmon: < HCI Command: Write Local Name (0x03|0x0013) plen 248 torvalds#109 [hci0] 88.567990 Name: 012345678901234567890123456789012345678901234567890123456789 > HCI Event: Command Complete (0x0e) plen 4 torvalds#110 [hci0] 88.663854 Write Local Name (0x03|0x0013) ncmd 1 Status: Success (0x00) @ MGMT Event: Local Name Changed (0x0008) plen 260 {0x0004} [hci0] 88.663948 Name: 012345678901234567890123456789012345678901234567890123456789 Short name: < HCI Command: Write Extended Inquiry Response (0x03|0x0052) plen 241 torvalds#111 [hci0] 88.663977 FEC: Not required (0x00) Name (complete): 012345678901234567890123456789012345678901234567890123456789 TX power: 12 dBm Device ID: Bluetooth SIG assigned (0x0001) Vendor: Google (224) Product: 0xc405 Version: 0.5.6 (0x0056) 16-bit Service UUIDs (complete): 7 entries Generic Access Profile (0x1800) Generic Attribute Profile (0x1801) Device Information (0x180a) A/V Remote Control (0x110e) A/V Remote Control Target (0x110c) Handsfree Audio Gateway (0x111f) Audio Source (0x110a) > HCI Event: Command Complete (0x0e) plen 4 torvalds#112 [hci0] 88.664874 Write Extended Inquiry Response (0x03|0x0052) ncmd 1 Status: Success (0x00) (am from https://patchwork.kernel.org/patch/11687367/) Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Alain Michaud <alainm@chromium.org> Backport notes: HDEV_PARAM_U16 is changed from two parameters to one parameter. BUG=none TEST=build Signed-off-by: Zhengping Jiang <jiangzp@google.com>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Dec 13, 2024
This change adds support for a configurable eir_max_name_len for platforms which requires a larger than 48 bytes complete name in EIR. From bluetoothctl: [bluetooth]# system-alias 012345678901234567890123456789012345678901234567890123456789 Changing 012345678901234567890123456789012345678901234567890123456789 succeeded [CHG] Controller DC:71:96:69:02:89 Alias: 012345678901234567890123456789012345678901234567890123456789 From btmon: < HCI Command: Write Local Name (0x03|0x0013) plen 248 torvalds#109 [hci0] 88.567990 Name: 012345678901234567890123456789012345678901234567890123456789 > HCI Event: Command Complete (0x0e) plen 4 torvalds#110 [hci0] 88.663854 Write Local Name (0x03|0x0013) ncmd 1 Status: Success (0x00) @ MGMT Event: Local Name Changed (0x0008) plen 260 {0x0004} [hci0] 88.663948 Name: 012345678901234567890123456789012345678901234567890123456789 Short name: < HCI Command: Write Extended Inquiry Response (0x03|0x0052) plen 241 torvalds#111 [hci0] 88.663977 FEC: Not required (0x00) Name (complete): 012345678901234567890123456789012345678901234567890123456789 TX power: 12 dBm Device ID: Bluetooth SIG assigned (0x0001) Vendor: Google (224) Product: 0xc405 Version: 0.5.6 (0x0056) 16-bit Service UUIDs (complete): 7 entries Generic Access Profile (0x1800) Generic Attribute Profile (0x1801) Device Information (0x180a) A/V Remote Control (0x110e) A/V Remote Control Target (0x110c) Handsfree Audio Gateway (0x111f) Audio Source (0x110a) > HCI Event: Command Complete (0x0e) plen 4 torvalds#112 [hci0] 88.664874 Write Extended Inquiry Response (0x03|0x0052) ncmd 1 Status: Success (0x00) (am from https://patchwork.kernel.org/patch/11687367/) Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Alain Michaud <alainm@chromium.org> Backport notes: HDEV_PARAM_U16 is changed from two parameters to one parameter. BUG=none TEST=build Signed-off-by: Zhengping Jiang <jiangzp@google.com>
scpcom
pushed a commit
to scpcom/linux-archive
that referenced
this pull request
Apr 27, 2025
[ Upstream commit 74174fe ] On fast hosts or malicious bots, we trigger a DCCP_BUG() which seems excessive. syzbot reported : BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback() CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline] ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline] dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654 sk_backlog_rcv include/net/sock.h:914 [inline] __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:287 [inline] ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:287 [inline] ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693 process_backlog+0x219/0x760 net/core/dev.c:5373 napi_poll net/core/dev.c:5771 [inline] net_rx_action+0x7da/0x1980 net/core/dev.c:5837 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Cc: dccp@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.