Skip to content

Latest commit

 

History

History
199 lines (150 loc) · 9.03 KB

design.md

File metadata and controls

199 lines (150 loc) · 9.03 KB

Firecracker Design

Scope

What is Firecracker

Firecracker is a new virtualization technology that enables customers to deploy lightweight micro Virtual Machines or microVMs. Firecracker microVMs combine the security and workload isolation properties of traditional VMs with the speed, agility and resource efficiency enabled by containers. They provide a secure, trusted environment for multi-tenant services, while maintaining minimal overhead.

The scope of this document is to describe the features and architecture of the Firecracker virtual machine manager (VMM).

Features

  1. Firecracker can safely run workloads from different customers on the same machine.
  2. Customers can create microVMs with any combination of vCPU (up to 32) and memory to match their application requirements.
  3. Firecracker microVMs can oversubscribe host CPU and memory. The degree of oversubscription is controlled by customers, who may factor in workload correlation and load in order to ensure smooth host system operation.
  4. With a microVM configured with a minimal Linux kernel, single-core CPU, and 128 MiB of RAM, Firecracker supports a steady mutation rate of 5 microVMs per host core per second (e.g., one can create 180 microVMs per second on a host with 36 physical cores).
  5. The number of Firecracker microVMs running simultaneously on a host is limited only by the availability of hardware resources.
  6. Each microVM exposes a host-facing API via an in-process HTTP server.
  7. Each microVM provides guest-facing access to host-configured metadata via the /mmds API.

Specifications

Firecracker's technical specifications are available in the Specifications document.

Host Integration

The following diagram depicts an example host running Firecracker microVMs.

Firecracker Host Integration

Firecracker runs on Linux hosts and with Linux guest OSs (from this point on, referred to as guests). For a complete list of currently supported kernel versions, check out the kernel support policy.

In production environments, Firecracker should be started only via the jailer binary. See Sandboxing for more details.

After launching the process, users interact with the Firecracker API to configure the microVM, before issuing the InstanceStart command.

Host Networking Integration

Firecracker emulated network devices are backed by TAP devices on the host. To make use of Firecracker, we expect our customers to leverage on-host networking solutions.

Storage

Firecracker emulated block devices are backed by files on the host. To be able to mount block devices in the guest, the backing files need to be pre-formatted with a filesystem that the guest kernel supports.

Internal Architecture

Each Firecracker process encapsulates one and only one microVM. The process runs the following threads: API, VMM and vCPU(s). The API thread is responsible for Firecracker's API server and associated control plane. It's never in the fast path of the virtual machine. The VMM thread exposes the machine model, minimal legacy device model, microVM metadata service (MMDS) and VirtIO device emulated Net, Block and Vsock devices, complete with I/O rate limiting. In addition to them, there are one or more vCPU threads (one per guest CPU core). They are created via KVM and run the KVM_RUN main loop. They execute synchronous I/O and memory-mapped I/O operations on devices models.

Threat Containment

From a security perspective, all vCPU threads are considered to be running malicious code as soon as they have been started; these malicious threads need to be contained. Containment is achieved by nesting several trust zones which increment from least trusted or least safe (guest vCPU threads) to most trusted or safest (host). These trusted zones are separated by barriers that enforce aspects of Firecracker security. For example, all outbound network traffic data is copied by the Firecracker I/O thread from the emulated network interface to the backing host TAP device, and I/O rate limiting is applied at this point. These barriers are marked in the diagram below.

Firecracker Threat Containment

Components and Features

Machine Model

Layout

Firecracker provides guests with storage and network access via emulated VirtIO Net and VirtIO Block devices. It also exposes a serial console and partial keyboard controller, the latter being used by guests to reset the VM (either soft or hard reset). Within Firecracker, the purpose of the I8042 device is to signal the microVM that the guest has requested a reboot.

In addition to the Firecracker provided device models, guests also see the Programmable Interrupt Controllers (PICs), the I/O Advanced Programmable Interrupt Controller (IOAPIC), and the Programmable Interval Timer (PIT) that KVM supports.

Exposing the CPU to the guest

Firecracker allows control of what processor information is exposed to the guest by using CPU templates. CPU templates can be set via the Firecracker API. Users can choose from existing static CPU templates and/or creating their own custom CPU templates.

Clocksources available to guests

Firecracker only exposes kvm-clock to customers.

I/O: Storage, Networking and Rate Limiting

Firecracker provides VirtIO/block and VirtIO/net emulated devices, along with the application of rate limiters to each volume and network interface to make sure host hardware resources are used fairly by multiple microVMs. These are implemented using a token bucket algorithm based on two buckets. One is associated with the number of operations per second and the other one with the bandwidth. The customer can create and configure rate limiters via the API by specifying token bucket configurations for ingress and egress. Each token bucket is defined via the bucket size, I/O cost, refill rate, maximum burst, and initial value. This enables the customer to define flexible rate limiters that support bursts or specific bandwidth/operations limitations. For vhost-user devices, customers should implement rate limiting on the side of the vhost-user backend that they provide.

MicroVM Metadata Service

Firecracker microVMs expose access to a minimal MicroVM-Metadata Service (MMDS) to the guest through the API endpoint. The metadata stored by the service is fully configured by users.

Sandboxing

Firecracker process

Firecracker is designed to assure secure isolation using multiple layers. The first layer of isolation is provided by the Linux KVM and the Firecracker virtualization boundary. To assure defense in depth, Firecracker should only run constrained at the process level. This is achieved by the following: seccomp filters for disallowing unwanted system calls, cgroups and namespaces for resource isolation, and dropping privileges by jailing the process. Seccomp filters are automatically installed by Firecracker, while for the latter, we recommend starting Firecracker with the jailer binary that's part of each Firecracker release.

Seccomp

Seccomp filters are used by default to limit the host system calls Firecracker can use. The default filters only allow the bare minimum set of system calls and parameters that Firecracker needs in order to function correctly.

The filters are loaded in the Firecracker process, on a per-thread basis, before executing any guest code.

For more information, check out the seccomp documentation.

Jailer process

The Firecracker process can be started by another jailer process. The jailer sets up system resources that require elevated permissions (e.g., cgroup, chroot), drops privileges, and then exec()s into the Firecracker binary, which then runs as an unprivileged process. Past this point, Firecracker can only access resources that a privileged third-party grants access to (e.g., by copying a file into the chroot, or passing a file descriptor).

Cgroups and Quotas

Each Firecracker microVM can be further encapsulated into a cgroup. By setting the affinity of the Firecracker microVM to a node via the cpuset subsystem, one can prevent the migration of said microVM from one node to another, something that would impair performance and cause unnecessary contention on shared resources. In addition to setting the affinity, each Firecracker microVM can have its own dedicated quota of the CPU time via the cpu subsystem, thus guaranteeing that resources are fairly shared across Firecracker microVMs.

Monitoring

Firecracker emits logs and metric counters, each on a named pipe that is passed via the API. Logs are flushed line by line, whereas metrics are emitted when the instance starts, then every 60 seconds while it's running, and on panic. Firecracker customers are responsible for collecting data in the Firecracker log files. In production builds, Firecracker does not expose the serial console port, since it may contain guest data that the host should not see.