[Bug] CPU usage for firecracker grows out of control #3043
Description
Describe the bug
After some time, some firecracker
processes on our hosts grow to use as much CPU as possible (100%). Inside the guests, CPU usage is low (2-3%).
We have collected some diagnostics...
perf kvm
on a high-CPU firecracker
Analyze events for pid(s) 6641, all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
hlt 40833 99.69% 98.26% 0.32us 19528.58us 29.19us ( +- 5.93% )
interrupt 76 0.19% 1.74% 0.43us 9798.54us 277.33us ( +- 57.11% )
msr 49 0.12% 0.01% 0.71us 2.82us 1.55us ( +- 5.31% )
vintr 2 0.00% 0.00% 1.36us 2.16us 1.76us ( +- 22.73% )
Total Samples:40960, Total events handled time:1213240.92us.
perf kvm
on a normally operating firecracker
Analyze events for pid(s) 39114, all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
msr 351 59.69% 7.81% 0.73us 118078.80us 618.57us ( +- 61.33% )
hlt 106 18.03% 81.49% 69.82us 155820.39us 21370.64us ( +- 12.99% )
interrupt 75 12.76% 1.17% 0.44us 29505.95us 433.52us ( +- 90.73% )
npf 39 6.63% 9.53% 3.03us 139254.99us 6792.68us ( +- 58.72% )
cpuid 12 2.04% 0.00% 0.49us 1.97us 0.79us ( +- 15.77% )
vintr 5 0.85% 0.00% 0.91us 2.02us 1.30us ( +- 16.12% )
Total Samples:588, Total events handled time:2779849.81us.
Running strace
on a high-cpu firecracker outputs most ioctl
KVM_RUN
.
strace
on high-cpu firecracker
$ timeout 120 strace -fp 18146 -c
strace: Process 18146 attached with 3 threads
strace: Process 18146 detached
strace: Process 18207 detached
strace: Process 18255 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
53.61 43.381786 12155 3569 ioctl
41.21 33.342295 14347 2324 epoll_pwait
2.79 2.254346 539 4183 906 read
2.03 1.643915 361 4558 599 write
0.31 0.251635 450 559 lseek
0.03 0.027855 387 72 recvfrom
0.01 0.009457 131 72 epoll_ctl
0.00 0.002255 188 12 accept4
0.00 0.001461 122 12 close
------ ----------- ----------- --------- --------- ----------------
100.00 80.915005 15361 1505 total
strace
on normally-operating firecracker
$ timeout 120 strace -fp 522 -c
strace: Process 522 attached with 3 threads
strace: Process 522 detached
strace: Process 550 detached
strace: Process 612 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
52.57 0.406513 1099 370 ioctl
41.69 0.322366 1452 222 epoll_pwait
3.28 0.025333 86 294 65 read
2.12 0.016372 22 731 427 write
0.14 0.001062 15 72 epoll_ctl
0.12 0.000950 13 72 recvfrom
0.06 0.000442 37 12 accept4
0.04 0.000298 25 12 close
------ ----------- ----------- --------- --------- ----------------
100.00 0.773336 1785 492 total
(these 2 firecrackers were on the same host, running at the same time)
CPU usage inside the guest for a high-cpu firecracker
(green is normal, red is kernel and light blue is steal CPU)
To Reproduce
We do not know how to reproduce this. It happens seemingly randomly.
Expected behaviour
CPU usage of firecracker
should be proportional to CPU usage inside the guest.
Was this a known bug with the versions we're using? Would upgrading to a more recent firecracker
help?
Environment
- Firecracker version: mix of
v0.23.5
andv0.24.4
- Host kernel: mix of
5.11.7
and5.11.16
- Guest kernel:
5.12.2
- Rootfs: varies wildly, often ubuntu, debian or alpine
- Arch: x86_64
Additional context
This increases CPU usage and load on a host to very bad levels. VMs running on the host all experience significant steal even if we are not overprovisioning CPU.
Checks
- Have you searched the Firecracker Issues database for similar problems?
- Have you read the existing relevant Firecracker documentation?
- Are you certain the bug being reported is a Firecracker issue?
I am not certain this is a firecracker issue, but I don't know what else it could be at this point.