Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM killed with "out of PoD memory" error #7023

Closed
marmarek opened this issue Oct 28, 2021 · 18 comments
Closed

VM killed with "out of PoD memory" error #7023

marmarek opened this issue Oct 28, 2021 · 18 comments
Labels
C: kernel diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. r4.0-dom0-stable r4.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@marmarek
Copy link
Member

How to file a helpful issue

Qubes OS release

R4.0, R4.1

Brief summary

VM running a recent kernel sometimes is killed during startup, with "out of PoD memory" error in Xen log.

Affected versions:

  • = 5.4.150 (R4.0)

  • = 5.10.70 (R4.1)

  • = 5.15-rc2 (unreleased)

Steps to reproduce

  1. Update VM kernel (kernel-qubes-vm package) to the affected version
  2. Try to start a VM. This especially affects VMs with low "memory" property or high "maxmem" property.

Expected behavior

Successful startup (as with earlier kernels).

Actual behavior

Sometimes a crash like this (/var/log/xen/console/hypervisor.log):

[2021-10-28 06:03:10] (XEN) p2m_pod_demand_populate: Dom18 out of PoD memory! (tot=182246 ents=66825 dom18)
[2021-10-28 06:03:10] (XEN) domain_crash called from p2m-pod.c:1218
[2021-10-28 06:03:10] (XEN) Domain 18 (vcpu#0) crashed on cpu#1:
[2021-10-28 06:03:10] (XEN) ----[ Xen-4.8.5-35.fc25  x86_64  debug=n   Not tainted ]----
[2021-10-28 06:03:10] (XEN) CPU:    1
[2021-10-28 06:03:10] (XEN) RIP:    0010:[<ffffffff81a058dc>]
[2021-10-28 06:03:10] (XEN) RFLAGS: 0000000000010246   CONTEXT: hvm guest (d18v0)
[2021-10-28 06:03:10] (XEN) rax: 00007ffc82fcd5b0   rbx: 0000000000001000   rcx: 0000000000000200
[2021-10-28 06:03:10] (XEN) rdx: 0000000000000000   rsi: 00007ffc82fcc5b0   rdi: ffff888012e05000
[2021-10-28 06:03:10] (XEN) rbp: 0000000000001000   rsp: ffffc90000807cc8   r8:  ffff888012e05000
[2021-10-28 06:03:10] (XEN) r9:  0000000000001000   r10: 0000000000001000   r11: ffff88803a6456e8
[2021-10-28 06:03:10] (XEN) r12: ffff888012e06000   r13: ffffc90000807e40   r14: 0000000000001000
[2021-10-28 06:03:10] (XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000000606f0
[2021-10-28 06:03:10] (XEN) cr3: 0000000024a80005   cr2: 0000704ab5d33000
[2021-10-28 06:03:10] (XEN) fsb: 0000704ab5d49800   gsb: ffff8880bc800000   gss: 0000000000000000
[2021-10-28 06:03:10] (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
[2021-10-28 06:03:10] (XEN) p2m_pod_demand_populate: Dom18 out of PoD memory! (tot=182246 ents=66825 dom18)
[2021-10-28 06:03:10] (XEN) domain_crash called from p2m-pod.c:1218

Additional context

The 5.10.71 kernel is the one included in R4.1.0-rc1 installation image. It is especially painful when the issue interrupts template installation - then it may lack menu entries, lack some properties/features, or in the worst case completely fail to install. This was reported several times on the forum.

This appears to be a fallout of torvalds/linux@8480ed9c2bbd

Thread discussing the issue upstream: https://lore.kernel.org/xen-devel/912c7377-26f0-c14a-e3aa-f00a81ed5766@suse.com/T/#u
And earlier issue with the same change: https://lore.kernel.org/xen-devel/YVxTp9rWmxv0wYBl@mail-itl/T/#u

@marmarek marmarek added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: kernel P: major Priority: major. Between "default" and "critical" in severity. labels Oct 28, 2021
@marmarek marmarek added this to the Release 4.0 updates milestone Oct 28, 2021
@marmarek
Copy link
Member Author

There is already a patch proposed for the issue. According to my testing, it does fix the issue, but introduces few seconds delay on VM startup. The delay depends on maxmem-memory size. Especially, if maxmem == memory, there is no delay, and in fact the VM starts faster than before all those changes (kernel older than 5.10.70).

Which brings an idea that perhaps we should change how we start VMs:

  • currently: started with maxmem high (4000M by default) and balloon down on start to the initial memory size ("memory" property, 400M default)
  • alternative to consider: start with maxmem==memory (400M) and use memory hotplug when more memory is needed

This would require changes to libvirt config (easy) and qmemman (possibly less easy). The current approach is used because memory hotplug used to be unstable, but I think that isn't the case anymore for a long time already.

@marmarek
Copy link
Member Author

* alternative to consider: start with `maxmem==memory` (400M) and use memory hotplug when more memory is needed

This could potentially allow changing maxmem in runtime (without restarting the VM).

@brendanhoar
Copy link

brendanhoar commented Oct 28, 2021

The alternative would likely also reduce the instances of failed starts based on current memory availability.

Edit: @marmarek addressed my misunderstanding above. Thanks.

@marmarek
Copy link
Member Author

The alternative would likely also reduce the instances of failed starts based on current memory availability.

You mean Not enough memory to start domain case? No, that's another thing. That's about availability of that "400 MB", not that "4000 MB". In both cases you need that initial amount to be available to start a qube.

marmarek added a commit to QubesOS/qubes-linux-kernel that referenced this issue Oct 29, 2021
Another fix for thread-based balloon driver

Fixes QubesOS/qubes-issues#7023

(cherry picked from commit c7f2b45)
@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-5-4 (including package kernel-5.4.156-1.fc25.qubes) has been pushed to the r4.0 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@andrewdavidwong andrewdavidwong added the diagnosed Technical diagnosis has been performed (see issue comments). label Oct 29, 2021
marmarek added a commit to QubesOS/qubes-linux-kernel that referenced this issue Oct 29, 2021
Another fix for thread-based balloon driver

Fixes QubesOS/qubes-issues#7023

(cherry picked from commit c7f2b45)
@jevank
Copy link

jevank commented Oct 29, 2021

Pretty strange that I missed the problem, I've living with 5.4.153 kernel about a week. Will try to reproduce

@marmarek
Copy link
Member Author

It's easier to reproduce on slower systems, or with lower initial VM memory.

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.10.76-1.fc32.qubes) has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.14.15-1.fc25.qubes) has been pushed to the r4.0 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.14.15-1.fc32.qubes) has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@jevank
Copy link

jevank commented Oct 29, 2021

It's easier to reproduce on slower systems, or with lower initial VM memory.

If you interested I can't reproduce issue with 5.4.153. Domain memory less 350 is not enough to start, tested PVH VM with values

qvm-prefs work memory 350
qvm-prefs work maxmem 3974
for i in $(seq 60); do qvm-start work && sleep 3 && qvm-shutdown work ; sleep 3 ; done

No issues on X1C6

@jevank
Copy link

jevank commented Oct 29, 2021

Sorry, got the error. With memory 300 I catch it. Previously loglvl was none, so I missed the messages.

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.10.76-1.fc32.qubes) has been pushed to the r4.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.14.15-1.fc32.qubes) has been pushed to the r4.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-5-4 (including package kernel-5.4.156-1.fc25.qubes) has been pushed to the r4.0 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.14.15-1.fc25.qubes) has been pushed to the r4.0 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.15.46-2.fc32.qubes) has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.15.52-1.fc32.qubes) has been pushed to the r4.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: kernel diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. r4.0-dom0-stable r4.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

5 participants