Skip to content

[BUG] linux-azure kernel segfault in netfs module, netfs_rreq_unlock causes kernel panic on nodes #4726

Open
@kro-cat

Description

Describe the bug
This is a segmentation fault which exists in the netfs module of the linux-azure kernel (5.15.0-1075-azure). This was fixed in a later version, but not patched in the current AKS vm image. We've observed it on nodes with the cephfs module loaded.

To Reproduce
Steps to reproduce the behavior:

  1. Load cephfs kernel module (may use rook-ceph provisioner).
  2. Unknown system load or time characteristic. (may be correlated with high number of disk read operations but that's not confirmed)
  3. Kernel panic shows in boot diagnostics for vmss instance, stateful workloads will experience ~5-10 minutes of downtime.

Expected behavior
Correct handling of XA_RETRY_ENTRY so that address 0000000000000402 is not dereferenced.
via https://github.com/torvalds/linux/blob/v5.15/fs/netfs/read_helper.c#L406 : On or after the first iteration of netfs_rreq_unlock, page can have the value XA_RETRY_ENTRY (returned by xas_find() in xas_for_each), which needs to be properly handled.

Screenshots

[87534.602454] BUG: kernel NULL pointer dereference, address: 0000000000000402
[87534.606859] #PF: supervisor read access in kernel mode
[87534.609959] #PF: error_code(0x0000) - not-present page
[87534.613243] PGD 0 P4D 0 
[87534.615278] Oops: 0000 [#1] SMP NOPTI
[87534.617686] CPU: 4 PID: 2688731 Comm: kworker/4:2 Not tainted 5.15.0-1075-azure #84-Ubuntu
[87534.622366] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024
[87534.628290] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[87534.632247] RIP: 0010:netfs_rreq_unlock+0xf7/0x3c0 [netfs]
[87534.635624] Code: 7d a0 4c 89 fe 45 31 f6 e8 c6 7d c8 ed 4c 8b 45 90 48 85 c0 48 89 c7 0f 84 33 01 00 00 4d 8d 48 50 4c 89 fa 45 31 e4 4d 89 cf <48> 8b 0f 48 8b 47 20 48 2b 45 98 48 c1 e9 10 c1 e0 0c 83 e1 01 80
[87534.645989] RSP: 0018:ffffb1cc927bfac0 EFLAGS: 00010246
...

6.8.0-1018-azure/kernel/fs/netfs/netfs.ko (correct handling of signal, taken from another VM, non-AKS):
Image

5.15.0-1075-azure/kernel/fs/netfs/netfs.ko (segfault exists):
Image

Environment:

  • azure-cli 2.60.0
  • Kubernetes 1.28.x
  • Linux 5.15.0-1075-azure ( AKS node image AKSUbuntu-2204gen2containerd-202410.09.0 )

Additional context
https://ubuntu.com/security/CVE-2023-52582

https://access.redhat.com/solutions/6993035

torvalds/linux@7e043a8

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions