[BUG] linux-azure kernel segfault in netfs module, netfs_rreq_unlock causes kernel panic on nodes #4726
Description
Describe the bug
This is a segmentation fault which exists in the netfs module of the linux-azure kernel (5.15.0-1075-azure). This was fixed in a later version, but not patched in the current AKS vm image. We've observed it on nodes with the cephfs module loaded.
To Reproduce
Steps to reproduce the behavior:
- Load cephfs kernel module (may use rook-ceph provisioner).
- Unknown system load or time characteristic. (may be correlated with high number of disk read operations but that's not confirmed)
- Kernel panic shows in boot diagnostics for vmss instance, stateful workloads will experience ~5-10 minutes of downtime.
Expected behavior
Correct handling of XA_RETRY_ENTRY so that address 0000000000000402 is not dereferenced.
via https://github.com/torvalds/linux/blob/v5.15/fs/netfs/read_helper.c#L406 : On or after the first iteration of netfs_rreq_unlock, page
can have the value XA_RETRY_ENTRY (returned by xas_find() in xas_for_each), which needs to be properly handled.
Screenshots
[87534.602454] BUG: kernel NULL pointer dereference, address: 0000000000000402
[87534.606859] #PF: supervisor read access in kernel mode
[87534.609959] #PF: error_code(0x0000) - not-present page
[87534.613243] PGD 0 P4D 0
[87534.615278] Oops: 0000 [#1] SMP NOPTI
[87534.617686] CPU: 4 PID: 2688731 Comm: kworker/4:2 Not tainted 5.15.0-1075-azure #84-Ubuntu
[87534.622366] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024
[87534.628290] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[87534.632247] RIP: 0010:netfs_rreq_unlock+0xf7/0x3c0 [netfs]
[87534.635624] Code: 7d a0 4c 89 fe 45 31 f6 e8 c6 7d c8 ed 4c 8b 45 90 48 85 c0 48 89 c7 0f 84 33 01 00 00 4d 8d 48 50 4c 89 fa 45 31 e4 4d 89 cf <48> 8b 0f 48 8b 47 20 48 2b 45 98 48 c1 e9 10 c1 e0 0c 83 e1 01 80
[87534.645989] RSP: 0018:ffffb1cc927bfac0 EFLAGS: 00010246
...
6.8.0-1018-azure/kernel/fs/netfs/netfs.ko (correct handling of signal, taken from another VM, non-AKS):
5.15.0-1075-azure/kernel/fs/netfs/netfs.ko (segfault exists):
Environment:
- azure-cli 2.60.0
- Kubernetes 1.28.x
- Linux 5.15.0-1075-azure ( AKS node image AKSUbuntu-2204gen2containerd-202410.09.0 )
Additional context
https://ubuntu.com/security/CVE-2023-52582
Activity