Skip to content

[Bug] Expanding memory balloon causes VM to freeze #4990

Closed
@maggie-lou

Description

@maggie-lou

Describe the bug

Even when there should be enough free memory in the VM, expanding the balloon sometimes causes the VM to freeze.

During a sample run (using the scripts linked below), after restoring the VM from a snapshot, free -h returned:

               total        used        free      shared  buff/cache   available
Mem:           108Mi        35Mi        45Mi       2.5Mi        35Mi        72Mi
Swap:             0B          0B          0B

Originally, the balloon was initialized to 5MB. When I inflated it to 20MB, it inflated successfully. When I inflated it to 30MB, the VM froze and there were a bunch of "Failed to update balloon stats, missing descriptor." errors.

To Reproduce

You can use the scripts in this branch: #4989

  1. Build firecracker with this patch: Fix unregistering memory ranges from UFFD when expanding the balloon #4988
  2. (Only needs to be run once): Prepare rootfs and guest kernel: get_rootfs_guest_kernel.sh
  3. Run firecracker: run_firecracker.sh
  4. Initialize a VM with a balloon and snapshot it: snapshot_vm.sh
  5. You will probably need to kill the former firecracker process and restart it: run_firecracker.sh
  6. Start the UFFD handler with the snapshot: run_uffd_handler.sh
  7. Expand the balloon. : trigger_remove_events.sh
  8. Expand the balloon even more : If you edit trigger_remove_events.sh. to inflate the balloon to 40MB, the VM will freeze and there are "Failed to update balloon stats, missing descriptor." errors

Expected behavior

I expected the balloon to be able to expand to 30MB because there is 72Mi of memory available.

Environment

Additional context

We are using UFFD to restore snapshots. The memory snapshots are quite large, so we're looking into using memory balloons with the goal of having the UFFD handler process removed memory ranges, so we don't have to save those memory ranges in the snapshot files. We've noticed that the VM will sometimes freeze when expanding the balloon, even when there should be sufficient memory.

Around the same time as the freeze, we always see the "Failed to update balloon stats, missing descriptor." errors as well as vsock connection errors VIRTIO_VSOCK_OP_RST.

I've tried disabling async page faults, in case the freezing was related to some sort of race condition in the kernel but the problem persists.

Checks

  • Have you searched the Firecracker Issues database for similar problems?
  • Have you read the existing relevant Firecracker documentation?
  • Are you certain the bug being reported is a Firecracker issue?

No - It could be a Linux bug as well. Though we've read cases where people seem to be successfully using UFFD + the balloon, so this use case seems like it should be possible now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: Awaiting authorIndicates that an issue or pull request requires author actionType: BugIndicates an unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions