Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for VMGenID device on x86 microVMs #4487

Merged
merged 8 commits into from
Apr 9, 2024
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,16 @@ and this project adheres to
without MPTable support. Please see our
[kernel policy documentation](docs/kernel-policy.md) for more information
regarding relevant kernel configurations.
- [#4487](https://github.com/firecracker-microvm/firecracker/pull/4487): Added
support for the Virtual Machine Generation Identifier (VMGenID) device on
x86_64 platforms. VMGenID is a virtual device that allows VMMs to notify
guests when they are resumed from a snapshot. Linux includes VMGenID support
since version 5.18. It uses notifications from the device to reseed its
internal CSPRNG. Please refer to
wearyzen marked this conversation as resolved.
Show resolved Hide resolved
[snapshot support](docs/snapshotting/snapshot-support.md) and
[random for clones](docs/snapshotting/random-for-clones.md) documention for
more info on VMGenID. VMGenID state is part of the snapshot format of
Firecracker. As a result, Firecracker snapshot version is now 2.0.0.

### Changed

Expand Down
73 changes: 60 additions & 13 deletions docs/snapshotting/random-for-clones.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,19 @@ which wraps the [`AWS-LC` cryptographic library][9].

Traditionally, `/dev/random` has been considered a source of “true” randomness,
with the downside that reads block when the pool of entropy gets depleted. On
the other hand, `/dev/urandom` doesn’t block, but provides lower quality
results. It turns out the distinction in output quality is actually very hard to
make. According to [this article][2], for kernel versions prior to 4.8, both
devices draw their output from the same pool, with the exception that
`/dev/random` will block when the system estimates the entropy count has
decreased below a certain threshold. The `/dev/urandom` output is considered
secure for virtually all purposes, with the caveat that using it before the
system gathers sufficient entropy for initialization may indeed produce low
quality random numbers. The `getrandom` syscall helps with this situation; it
uses the `/dev/urandom` source by default, but will block until it gets properly
initialized (the behavior can be altered via configuration flags).
the other hand, `/dev/urandom` doesn’t block, which lead people believe that it
provides lower quality results.

It turns out the distinction in output quality is actually very hard to make.
According to [this article][2], for kernel versions prior to 4.8, both devices
draw their output from the same pool, with the exception that `/dev/random` will
block when the system estimates the entropy count has decreased below a certain
threshold. The `/dev/urandom` output is considered secure for virtually all
purposes, with the caveat that using it before the system gathers sufficient
entropy for initialization may indeed produce low quality random numbers. The
`getrandom` syscall helps with this situation; it uses the `/dev/urandom` source
by default, but will block until it gets properly initialized (the behavior can
be altered via configuration flags).

Newer kernels (4.8+) have switched to an implementation where `/dev/random`
output comes from a pool called the blocking pool, the output of `/dev/urandom`
Expand All @@ -41,6 +43,8 @@ and there’s also an input pool which gathers entropy from various sources
available on the system, and is used to feed into or seed the other two
components. A very detailed description is available [here][3].

### Linux kernels from 4.8 until 5.17 (included)
ShadowCurse marked this conversation as resolved.
Show resolved Hide resolved

The details of this newer implementation are used to make the recommendations
present in the document. There are in-kernel interfaces used to obtain random
numbers as well, but they are similar to using `/dev/urandom` (or `getrandom`
Expand Down Expand Up @@ -99,6 +103,42 @@ not increase the current entropy estimation. There is also an `ioctl` interface
which, given the appropriate privileges, can be used to add data to the input
entropy pool while also increasing the count, or completely empty all pools.

### Linux kernels from 5.18 onwards

Since version 5.18, Linux has support for the
[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier).
The purpose of VMGenID is to notify the guest about time shift events, such as
resuming from a snapshot. The device exposes a 16-byte cryptographically random
identifier in guest memory. Firecracker implements VMGenID. When resuming a
microVM from a snapshot Firecracker writes a new identifier and injects a
notification to the guest. Linux,
[uses this value](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/virt/vmgenid.c#L77)
[as new randomness for its CSPRNG](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/char/random.c#L908).
Quoting the random.c implementation of the kernel:

```
/*
* Handle a new unique VM ID, which is unique, not secret, so we
* don't credit it, but we do immediately force a reseed after so
* that it's used by the crng posthaste.
*/
```

As a result, values returned by `getrandom()` and `/dev/(u)random` are distinct
in all VMs started from the same snapshot, **after** the kernel handles the
VMGenID notification. This leaves a race window between resuming vCPUs and Linux
CSPRNG getting successfully re-seeded. In Linux 6.8, we
[extended VMGenID](https://lore.kernel.org/lkml/20230531095119.11202-2-bchalios@amazon.es/)
to emit a uevent to user space when it handles the notification. User space can
poll this uevent to know when it is safe to use `getrandom()`, et al. avoiding
the race condition.

Please note that, Firecracker will always enable VMGenID. In kernels earlier
than 5.18, where there is no VMGenID driver, the device will not have any effect
in the guest.

### User space considerations

Init systems (such as `systemd` used by AL2 and other distros) might save a
random seed file after boot. For `systemd`, the path is
`/var/lib/systemd/random-seed`. Just to be on the safe side, any such file
Expand All @@ -121,8 +161,8 @@ alter the read result via bind mounting another file on top of
and should be sufficient for most cases.
- Use `virtio-rng`. When present, the guest kernel uses the device as an
additional source of entropy.
- To be as safe as possible, the direct approach is to do the following (before
customer code is resumed in the clone):
- On kernels before 5.18, to be as safe as possible, the direct approach is to
do the following (before customer code is resumed in the clone):
1. Open one of the special devices files (either `/dev/random` or
`/dev/urandom`). Take note that `RNDCLEARPOOL` no longer
[has any effect][7] on the entropy pool.
Expand All @@ -133,6 +173,13 @@ alter the read result via bind mounting another file on top of
1. Issue a `RNDRESEEDCRNG` ioctl call ([4.14][5], [5.10][6], (requires
`CAP_SYS_ADMIN`)) that specifically causes the `CSPRNG` to be reseeded from
the input pool.
- On kernels starting from 5.18 onwards, the CSPRNG will be automatically
reseeded when the guest kernel handles the VMGenID notification. To completely
avoid the race condition, users should follow the same steps as with kernels
\< 5.18.
- On kernels starting from 6.8, users can poll for the VMGenID uevent that the
driver sends when the CSPRNG is reseeded after handling the VMGenID
notification.

**Annex 1 contains the source code of a C program which implements the previous
three steps.** As soon as the guest kernel version switches to 4.19 (or higher),
Expand Down
49 changes: 40 additions & 9 deletions docs/snapshotting/snapshot-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,10 @@ The snapshot functionality is still in developer preview due to the following:
- If a [CPU template](../cpu_templates/cpu-templates.md) is not used on x86_64,
overwrites of `MSR_IA32_TSX_CTRL` MSR value will not be preserved after
restoring from a snapshot.
- Resuming from a snapshot that was taken during early stages of the guest
kernel boot might lead to crashes upon snapshot resume. We suggest that users
take snapshot after the guest microVM kernel has booted. Please see
[VMGenID device limitation](#vmgenid-device-limitation).

## Firecracker Snapshotting characteristics

Expand Down Expand Up @@ -571,15 +575,32 @@ we also consider microVM A insecure if it resumes execution.

### Reusing snapshotted states securely

We are currently working to add a functionality that will notify guest operating
systems of the snapshot event in order to enable secure reuse of snapshotted
microVM states, guest operating systems, language runtimes, and cryptographic
libraries. In some cases, user applications will need to handle the snapshot
create/restore events in such a way that the uniqueness and randomness
properties are preserved and guaranteed before resuming the workload.

We've started a discussion on how the Linux operating system might securely
handle being snapshotted [here](https://lkml.org/lkml/2020/10/16/629).
[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier)
(VMGenID) is a virtual device that allows VM guests to detect when they have
resumed from a snapshot. It works by exposing a cryptographically random
16-bytes identifier to the guest. The VMM ensures that the value of the
indentifier changes every time the VM a time shift happens in the lifecycle of
the VM, e.g. when it resumes from a snapshot.

Linux supports VMGenID since version 5.18. When Linux detects a change in the
identifier, it uses its value to reseed its internal PRNG. Moreover,
[since version 6.8](https://lkml.org/lkml/2023/5/31/414) Linux VMGenID driver
also emits to userspace a uevent. User space processes can monitor this uevent
for detecting snapshot resume events.

Firecracker supports VMGenID device on x86 platforms. Firecracker will always
enable the device. During snapshot resume, Firecracker will update the 16-byte
generation ID and inject a notification in the guest before resuming its vCPUs.

As a result, guests that run Linux versions >= 5.18 will re-seed their in-kernel
PRNG upon snapshot resume. User space applications can rely on the guest kernel
for randomness. State other than the guest kernel entropy pool, such as unique
identifiers, cached random numbers, cryptographic tokens, etc **will** still be
replicated across multiple microVMs resumed from the same snapshot. Users need
to implement mechanisms for ensuring de-duplication of such state, where needed.
On guests that run Linux versions >= 6.8, users can make use of the uevent that
VMGenID driver emits upon resuming from a snapshot, to be notified about
snapshot resume events.

## Vsock device limitation

Expand All @@ -605,6 +626,16 @@ section 5.10.6.6 Device Events.
Firecracker handles sending the `reset` event to the vsock driver, thus the
customers are no longer responsible for closing active connections.

## VMGenID device limitation
wearyzen marked this conversation as resolved.
Show resolved Hide resolved

During snashot resume, Firecracker updates the 16-byte generation ID of the
VMGenID device and injects an interrupt in the guest before resuming vCPUs. If
the snapshot was taken at the very early stages of the guest kernel boot process
proper interrupt handling might not be in place yet. As a result, the kernel
might not be able to handle the injected notification and crash. We suggest to
users that they take snapshots only after the guest kernel has completed
booting, to avoid this issue.

## Snapshot compatibility across kernel versions

We have a mechanism in place to experiment with snapshot compatibility across
Expand Down
15 changes: 12 additions & 3 deletions src/vmm/src/acpi/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@
// SPDX-License-Identifier: Apache-2.0

use acpi_tables::fadt::{FADT_F_HW_REDUCED_ACPI, FADT_F_PWR_BUTTON, FADT_F_SLP_BUTTON};
use acpi_tables::{Dsdt, Fadt, Madt, Rsdp, Sdt, Xsdt};
use acpi_tables::{Aml, Dsdt, Fadt, Madt, Rsdp, Sdt, Xsdt};
use log::{debug, error};
use vm_allocator::AllocPolicy;

use crate::acpi::x86_64::{
apic_addr, rsdp_addr, setup_arch_dsdt, setup_arch_fadt, setup_interrupt_controllers,
};
use crate::device_manager::acpi::ACPIDeviceManager;
use crate::device_manager::mmio::MMIODeviceManager;
use crate::device_manager::resources::ResourceAllocator;
use crate::vstate::memory::{GuestAddress, GuestMemoryMmap};
Expand Down Expand Up @@ -74,12 +75,19 @@ impl<'a> AcpiTableWriter<'a> {
}

/// Build the DSDT table for the guest
fn build_dsdt(&mut self, mmio_device_manager: &MMIODeviceManager) -> Result<u64, AcpiError> {
fn build_dsdt(
&mut self,
mmio_device_manager: &MMIODeviceManager,
acpi_device_manager: &ACPIDeviceManager,
) -> Result<u64, AcpiError> {
let mut dsdt_data = Vec::new();

// Virtio-devices DSDT data
dsdt_data.extend_from_slice(&mmio_device_manager.dsdt_data);

// Add GED and VMGenID AML data.
acpi_device_manager.append_aml_bytes(&mut dsdt_data);

// Architecture specific DSDT data
setup_arch_dsdt(&mut dsdt_data);

Expand Down Expand Up @@ -155,14 +163,15 @@ pub(crate) fn create_acpi_tables(
mem: &GuestMemoryMmap,
resource_allocator: &mut ResourceAllocator,
mmio_device_manager: &MMIODeviceManager,
acpi_device_manager: &ACPIDeviceManager,
vcpus: &[Vcpu],
) -> Result<(), AcpiError> {
let mut writer = AcpiTableWriter {
mem,
resource_allocator,
};

let dsdt_addr = writer.build_dsdt(mmio_device_manager)?;
let dsdt_addr = writer.build_dsdt(mmio_device_manager, acpi_device_manager)?;
let fadt_addr = writer.build_fadt(dsdt_addr)?;
let madt_addr = writer.build_madt(vcpus.len().try_into().unwrap())?;
let xsdt_addr = writer.build_xsdt(fadt_addr, madt_addr)?;
Expand Down
Loading
Loading