Skip to content

ZFS 0.8: kworker sucking up 100% CPU, various messages in logs mentioning ZFS, system eventually hangs. #9430

Closed
@ReimuHakurei

Description

@ReimuHakurei

System information

Type Version/Name
Distribution Name Ubuntu Server
Distribution Version 19.10 Eoan Ermine
Linux Kernel 5.3.0-13-generic #14-Ubuntu SMP Tue Sep 24 02:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Architecture amd64
ZFS Version 0.8.1-1ubuntu11

Describe the problem you're observing

After some hours of system uptime, a kworker process will suck up 100% of a core. The system load averages will climb constantly. This kworker process appears to be ZFS related (as of this writing, the current CPU hog is named kworker/u33:3+flush-zfs-19).

This system was originally on Ubuntu Server 18.04 with ZFS 0.8.1 that I had compiled some time before. It was at several months of uptime with no issues. Out of nowhere, this issue came up.

Things that have been tried:

  • Rebooting several times.
  • Upgrading to the Ubuntu HWE kernel
  • Updating to the development release of Ubuntu 19.10 (which comes with ZFS 0.8) to rule out problems with the version I compiled.

I've tried everything I can think of, and since various ZFS-related terms come up in the kworker name and messages printed to dmesg, I am left to assume it is a problem in ZFS.

Describe how to reproduce the problem

Unknown. The system this is occurring on has been stable with minimum in the way of changes for about a year. The system is a host to various LXC containers. The root filesystem is ZFS, on a single 500GB Crucial SSD. All data for the LXC containers is stored on an encrypted storage pool. Output of zpool status:

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:02:31 with 0 errors on Mon Oct  7 22:35:32 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda1      ONLINE       0     0     0

errors: No known data errors

  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:55:30 with 0 errors on Mon Oct  7 23:35:51 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        storage                     ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35000cca24300d8f4  ONLINE       0     0     0
            scsi-35000cca243014464  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            scsi-35000cca243015924  ONLINE       0     0     0
            scsi-35000cca24400a9a4  ONLINE       0     0     0

errors: No known data errors

Include any warning/errors/backtraces from the system logs

Not sure what would be helpful for debugging, but here's some stuff that was printed to dmesg:

[47488.869573] INFO: task pip:19670 blocked for more than 120 seconds.
[47488.869627]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47488.869670] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47488.869718] pip             D    0 19670  19606 0x00000120
[47488.869722] Call Trace:
[47488.869732]  __schedule+0x2b9/0x6c0
[47488.869736]  ? bit_wait_io+0x50/0x50
[47488.869738]  schedule+0x42/0xb0
[47488.869741]  bit_wait+0x11/0x50
[47488.869743]  __wait_on_bit+0x33/0xa0
[47488.869749]  __inode_wait_for_writeback+0xae/0xe0
[47488.869753]  ? var_wake_function+0x30/0x30
[47488.869756]  inode_wait_for_writeback+0x27/0x40
[47488.869759]  evict+0xbd/0x1b0
[47488.869761]  iput+0x141/0x1f0
[47488.869764]  do_unlinkat+0x1c5/0x2d0
[47488.869767]  __x64_sys_unlinkat+0x38/0x60
[47488.869770]  do_syscall_64+0x5a/0x130
[47488.869774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47488.869777] RIP: 0033:0x7f7a4ac39d77
[47488.869784] Code: Bad RIP value.
[47488.869786] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47488.869788] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47488.869790] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47488.869791] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47488.869792] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47488.869793] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47488.869797] INFO: task sync:20009 blocked for more than 120 seconds.
[47488.869840]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47488.869883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47488.869933] sync            D    0 20009  20008 0x00004120
[47488.869936] Call Trace:
[47488.869940]  __schedule+0x2b9/0x6c0
[47488.869943]  schedule+0x42/0xb0
[47488.869945]  wb_wait_for_completion+0x64/0x90
[47488.869948]  ? wait_woken+0x80/0x80
[47488.869950]  sync_inodes_sb+0xd7/0x290
[47488.869954]  sync_inodes_one_sb+0x15/0x20
[47488.869957]  iterate_supers+0xa3/0x110
[47488.869959]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47488.869962]  ksys_sync+0x42/0xb0
[47488.869964]  __ia32_sys_sync+0xe/0x20
[47488.869967]  do_syscall_64+0x5a/0x130
[47488.869970]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47488.869972] RIP: 0033:0x7f56f6bd7287
[47488.869976] Code: Bad RIP value.
[47488.869977] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47488.869979] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47488.869981] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47488.869982] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47488.869984] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47488.869985] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[47609.702068] INFO: task pip:19670 blocked for more than 241 seconds.
[47609.702114]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47609.702150] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47609.702191] pip             D    0 19670  19606 0x00000120
[47609.702194] Call Trace:
[47609.702203]  __schedule+0x2b9/0x6c0
[47609.702206]  ? bit_wait_io+0x50/0x50
[47609.702209]  schedule+0x42/0xb0
[47609.702210]  bit_wait+0x11/0x50
[47609.702212]  __wait_on_bit+0x33/0xa0
[47609.702218]  __inode_wait_for_writeback+0xae/0xe0
[47609.702221]  ? var_wake_function+0x30/0x30
[47609.702223]  inode_wait_for_writeback+0x27/0x40
[47609.702226]  evict+0xbd/0x1b0
[47609.702227]  iput+0x141/0x1f0
[47609.702230]  do_unlinkat+0x1c5/0x2d0
[47609.702232]  __x64_sys_unlinkat+0x38/0x60
[47609.702235]  do_syscall_64+0x5a/0x130
[47609.702238]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47609.702240] RIP: 0033:0x7f7a4ac39d77
[47609.702247] Code: Bad RIP value.
[47609.702248] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47609.702251] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47609.702252] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47609.702253] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47609.702254] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47609.702255] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47609.702258] INFO: task sync:20009 blocked for more than 241 seconds.
[47609.702293]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47609.702328] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47609.702368] sync            D    0 20009  20008 0x00004120
[47609.702370] Call Trace:
[47609.702373]  __schedule+0x2b9/0x6c0
[47609.702376]  schedule+0x42/0xb0
[47609.702378]  wb_wait_for_completion+0x64/0x90
[47609.702379]  ? wait_woken+0x80/0x80
[47609.702381]  sync_inodes_sb+0xd7/0x290
[47609.702384]  sync_inodes_one_sb+0x15/0x20
[47609.702386]  iterate_supers+0xa3/0x110
[47609.702388]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47609.702391]  ksys_sync+0x42/0xb0
[47609.702393]  __ia32_sys_sync+0xe/0x20
[47609.702395]  do_syscall_64+0x5a/0x130
[47609.702398]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47609.702399] RIP: 0033:0x7f56f6bd7287
[47609.702402] Code: Bad RIP value.
[47609.702403] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47609.702405] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47609.702406] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47609.702407] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47609.702408] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47609.702409] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[47730.534430] INFO: task journal-offline:32058 blocked for more than 120 seconds.
[47730.534481]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47730.534516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47730.534558] journal-offline D    0 32058  27128 0x00000120
[47730.534561] Call Trace:
[47730.534570]  __schedule+0x2b9/0x6c0
[47730.534573]  schedule+0x42/0xb0
[47730.534575]  schedule_preempt_disabled+0xe/0x10
[47730.534578]  __mutex_lock.isra.0+0x182/0x4f0
[47730.534581]  __mutex_lock_slowpath+0x13/0x20
[47730.534583]  mutex_lock+0x2e/0x40
[47730.534698]  zil_commit_impl+0x148/0xd60 [zfs]
[47730.534766]  ? zpl_write_common_iovec+0x100/0x100 [zfs]
[47730.534769]  ? _cond_resched+0x19/0x30
[47730.534771]  ? mutex_lock+0x13/0x40
[47730.534840]  zil_commit+0x40/0x60 [zfs]
[47730.534908]  zpl_writepages+0xec/0x180 [zfs]
[47730.534912]  do_writepages+0x43/0xd0
[47730.534916]  __filemap_fdatawrite_range+0xc4/0x100
[47730.534919]  filemap_write_and_wait_range+0x36/0x90
[47730.534987]  zpl_fsync+0x3d/0xa0 [zfs]
[47730.534991]  vfs_fsync_range+0x49/0x80
[47730.534993]  ? __fget_light+0x57/0x70
[47730.534995]  do_fsync+0x3d/0x70
[47730.534997]  __x64_sys_fsync+0x14/0x20
[47730.535001]  do_syscall_64+0x5a/0x130
[47730.535003]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47730.535006] RIP: 0033:0x7f872e0af237
[47730.535012] Code: Bad RIP value.
[47730.535014] RSP: 002b:00007f8729766cb0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[47730.535016] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007f872e0af237
[47730.535017] RDX: 0000000000000000 RSI: 00007f872dcb762c RDI: 0000000000000014
[47730.535018] RBP: 00007f872dcb8da0 R08: 00007f8729767700 R09: 00007f8729767700
[47730.535019] R10: 000000000000062c R11: 0000000000000293 R12: 0000000000000002
[47730.535020] R13: 0000000000000000 R14: 00005647819af200 R15: 00007ffce0398f10
[47730.535120] INFO: task pip:19670 blocked for more than 362 seconds.
[47730.535155]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47730.535190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47730.535230] pip             D    0 19670  19606 0x00000120
[47730.535232] Call Trace:
[47730.535235]  __schedule+0x2b9/0x6c0
[47730.535238]  ? bit_wait_io+0x50/0x50
[47730.535240]  schedule+0x42/0xb0
[47730.535242]  bit_wait+0x11/0x50
[47730.535243]  __wait_on_bit+0x33/0xa0
[47730.535246]  __inode_wait_for_writeback+0xae/0xe0
[47730.535249]  ? var_wake_function+0x30/0x30
[47730.535252]  inode_wait_for_writeback+0x27/0x40
[47730.535254]  evict+0xbd/0x1b0
[47730.535256]  iput+0x141/0x1f0
[47730.535258]  do_unlinkat+0x1c5/0x2d0
[47730.535260]  __x64_sys_unlinkat+0x38/0x60
[47730.535262]  do_syscall_64+0x5a/0x130
[47730.535265]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47730.535266] RIP: 0033:0x7f7a4ac39d77
[47730.535270] Code: Bad RIP value.
[47730.535271] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47730.535273] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47730.535274] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47730.535275] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47730.535276] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47730.535277] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47730.535280] INFO: task sync:20009 blocked for more than 362 seconds.
[47730.535314]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47730.535349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47730.535389] sync            D    0 20009  20008 0x00004120
[47730.535391] Call Trace:
[47730.535394]  __schedule+0x2b9/0x6c0
[47730.535396]  schedule+0x42/0xb0
[47730.535398]  wb_wait_for_completion+0x64/0x90
[47730.535400]  ? wait_woken+0x80/0x80
[47730.535402]  sync_inodes_sb+0xd7/0x290
[47730.535405]  sync_inodes_one_sb+0x15/0x20
[47730.535407]  iterate_supers+0xa3/0x110
[47730.535409]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47730.535411]  ksys_sync+0x42/0xb0
[47730.535413]  __ia32_sys_sync+0xe/0x20
[47730.535416]  do_syscall_64+0x5a/0x130
[47730.535418]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47730.535420] RIP: 0033:0x7f56f6bd7287
[47730.535423] Code: Bad RIP value.
[47730.535424] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47730.535425] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47730.535427] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47730.535428] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47730.535429] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47730.535429] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[47851.366840] INFO: task journal-offline:32058 blocked for more than 241 seconds.
[47851.366900]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47851.366944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47851.366994] journal-offline D    0 32058  27128 0x00000124
[47851.366998] Call Trace:
[47851.367007]  __schedule+0x2b9/0x6c0
[47851.367011]  schedule+0x42/0xb0
[47851.367014]  schedule_preempt_disabled+0xe/0x10
[47851.367017]  __mutex_lock.isra.0+0x182/0x4f0
[47851.367021]  __mutex_lock_slowpath+0x13/0x20
[47851.367023]  mutex_lock+0x2e/0x40
[47851.367145]  zil_commit_impl+0x148/0xd60 [zfs]
[47851.367224]  ? zpl_write_common_iovec+0x100/0x100 [zfs]
[47851.367227]  ? _cond_resched+0x19/0x30
[47851.367230]  ? mutex_lock+0x13/0x40
[47851.367309]  zil_commit+0x40/0x60 [zfs]
[47851.367387]  zpl_writepages+0xec/0x180 [zfs]
[47851.367392]  do_writepages+0x43/0xd0
[47851.367396]  __filemap_fdatawrite_range+0xc4/0x100
[47851.367400]  filemap_write_and_wait_range+0x36/0x90
[47851.367478]  zpl_fsync+0x3d/0xa0 [zfs]
[47851.367483]  vfs_fsync_range+0x49/0x80
[47851.367486]  ? __fget_light+0x57/0x70
[47851.367489]  do_fsync+0x3d/0x70
[47851.367491]  __x64_sys_fsync+0x14/0x20
[47851.367495]  do_syscall_64+0x5a/0x130
[47851.367498]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47851.367501] RIP: 0033:0x7f872e0af237
[47851.367508] Code: Bad RIP value.
[47851.367510] RSP: 002b:00007f8729766cb0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[47851.367512] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007f872e0af237
[47851.367514] RDX: 0000000000000000 RSI: 00007f872dcb762c RDI: 0000000000000014
[47851.367515] RBP: 00007f872dcb8da0 R08: 00007f8729767700 R09: 00007f8729767700
[47851.367516] R10: 000000000000062c R11: 0000000000000293 R12: 0000000000000002
[47851.367518] R13: 0000000000000000 R14: 00005647819af200 R15: 00007ffce0398f10
[47851.367618] INFO: task pip:19670 blocked for more than 483 seconds.
[47851.367660]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47851.367704] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47851.367753] pip             D    0 19670  19606 0x00000120
[47851.367756] Call Trace:
[47851.367760]  __schedule+0x2b9/0x6c0
[47851.367764]  ? bit_wait_io+0x50/0x50
[47851.367766]  schedule+0x42/0xb0
[47851.367769]  bit_wait+0x11/0x50
[47851.367771]  __wait_on_bit+0x33/0xa0
[47851.367775]  __inode_wait_for_writeback+0xae/0xe0
[47851.367779]  ? var_wake_function+0x30/0x30
[47851.367782]  inode_wait_for_writeback+0x27/0x40
[47851.367785]  evict+0xbd/0x1b0
[47851.367787]  iput+0x141/0x1f0
[47851.367790]  do_unlinkat+0x1c5/0x2d0
[47851.367793]  __x64_sys_unlinkat+0x38/0x60
[47851.367795]  do_syscall_64+0x5a/0x130
[47851.367799]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47851.367801] RIP: 0033:0x7f7a4ac39d77
[47851.367804] Code: Bad RIP value.
[47851.367806] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47851.367808] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47851.367809] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47851.367811] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47851.367812] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47851.367814] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47851.367817] INFO: task sync:20009 blocked for more than 483 seconds.
[47851.367859]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47851.367902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47851.367952] sync            D    0 20009  20008 0x00004120
[47851.367954] Call Trace:
[47851.367958]  __schedule+0x2b9/0x6c0
[47851.367961]  schedule+0x42/0xb0
[47851.367963]  wb_wait_for_completion+0x64/0x90
[47851.367966]  ? wait_woken+0x80/0x80
[47851.367968]  sync_inodes_sb+0xd7/0x290
[47851.367972]  sync_inodes_one_sb+0x15/0x20
[47851.367975]  iterate_supers+0xa3/0x110
[47851.367977]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47851.367980]  ksys_sync+0x42/0xb0
[47851.367983]  __ia32_sys_sync+0xe/0x20
[47851.367985]  do_syscall_64+0x5a/0x130
[47851.367989]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47851.367991] RIP: 0033:0x7f56f6bd7287
[47851.367994] Code: Bad RIP value.
[47851.367995] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47851.367998] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47851.367999] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47851.368000] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47851.368002] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47851.368003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

I can grab any other debugging information that would be helpful, just let me know what commands to run to grab it.

EDIT: Oh, and I suspect it may have SOMETHING to do with one particular LXC container. The container is stored in a ZFS dataset. Inside the container, the only thing running is Nextcloud, installed from Snapcraft.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: StaleNo recent activity for issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions