Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
fbff502
doc: export under-the-hood documentations from criu.org
avagin Mar 7, 2026
2e13b87
docs: improve formatting in 'under-the-hood' documentation
avagin Mar 7, 2026
8873787
docs: update 32-bit tasks C/R documentation
avagin Mar 7, 2026
47db848
docs: update AIO documentation
avagin Mar 7, 2026
72e15d9
docs: update BPF maps documentation
avagin Mar 7, 2026
1850397
docs: update CGroups documentation
avagin Mar 7, 2026
b781b2c
docs: update IP address change documentation
avagin Mar 7, 2026
6704a74
docs: update Checkpoint/Restore architecture documentation
avagin Mar 7, 2026
0a60b35
docs: update Code Blobs documentation
avagin Mar 7, 2026
ec494f1
docs: update Comparison to other CR projects documentation
avagin Mar 7, 2026
b37a35a
docs: update COW memory documentation
avagin Mar 7, 2026
94211c0
docs: update DMTCP comparison documentation
avagin Mar 7, 2026
46cef77
docs: update Dumping Files documentation
avagin Mar 7, 2026
5875651
docs: update FAQ documentation
avagin Mar 7, 2026
e209883
docs: update File Restoration Engine (fdinfo) documentation
avagin Mar 7, 2026
efa4b70
docs: update Filesystem Peculiarities documentation
avagin Mar 7, 2026
5d2d1b6
docs: update Process Tree Final States documentation
avagin Mar 7, 2026
27d776e
docs: update Freezing the Process Tree documentation
avagin Mar 7, 2026
b0755b3
docs: update FSNotify documentation
avagin Mar 7, 2026
647ba48
docs: update Re-opening Files documentation
avagin Mar 7, 2026
a3fb3c6
docs: update Descriptor Assignment documentation
avagin Mar 7, 2026
c6d0b33
docs: update Re-opening Nameless Files documentation
avagin Mar 7, 2026
784fb71
docs: update Invisible Files documentation
avagin Mar 7, 2026
53aefec
docs: update Irmap documentation
avagin Mar 7, 2026
8e759c9
docs: update Kcmp Trees documentation
avagin Mar 7, 2026
e856d8f
docs: update Kerndat documentation
avagin Mar 7, 2026
86f592b
docs: update Mac-VLAN documentation
avagin Mar 7, 2026
6066048
docs: update Memory Changes Tracking documentation
avagin Mar 7, 2026
376752f
docs: update Memory Dumping and Restoring documentation
avagin Mar 7, 2026
c35c511
docs: update Memory Images Deduplication documentation
avagin Mar 7, 2026
0759661
docs: update Mount Points documentation
avagin Mar 7, 2026
7bc31f6
docs: update Mount V2 documentation
avagin Mar 7, 2026
84edc07
docs: update Mount V2 Detailed Algorithm documentation
avagin Mar 7, 2026
ccc355a
docs: update Optimized Pre-dump Algorithm documentation
avagin Mar 7, 2026
1062b20
docs: update Pagemap Cache documentation
avagin Mar 7, 2026
b7a2f8a
docs: update Parasite Code documentation
avagin Mar 7, 2026
c22972a
docs: update Pending Signals documentation
avagin Mar 7, 2026
bb0cfba
docs: update PID Restoration documentation
avagin Mar 7, 2026
2a3fa95
docs: update Pidfd Store documentation
avagin Mar 7, 2026
51698eb
docs: update Pidfd documentation
avagin Mar 7, 2026
c1b1b43
docs: update Restartable Sequences documentation
avagin Mar 7, 2026
180b81f
docs: update Restorer Context documentation
avagin Mar 7, 2026
4af884f
docs: update Service Descriptors documentation
avagin Mar 7, 2026
aa00b64
docs: update Shared Memory documentation
avagin Mar 7, 2026
101b1e7
docs: update Network Sockets documentation
avagin Mar 7, 2026
be85cba
docs: update Restoration Stages documentation
avagin Mar 7, 2026
fede6c7
docs: update TCP Connection documentation
avagin Mar 7, 2026
3207456
docs: update TTY documentation
avagin Mar 7, 2026
397df62
docs: update TUN/TAP documentation
avagin Mar 7, 2026
bc000a4
docs: update Userfaultfd documentation
avagin Mar 7, 2026
f4af26f
docs: update File Validation documentation
avagin Mar 7, 2026
e85c362
docs: update vDSO and VVAR documentation
avagin Mar 7, 2026
322f7a6
docs: update Unix Sockets documentation
avagin Mar 7, 2026
73e5fd0
docs: update Technologies documentation
avagin Mar 7, 2026
e1e8b72
docs: update Zombie Processes documentation
avagin Mar 7, 2026
b796b63
docs: update ARM64 GCS documentation
avagin Mar 7, 2026
1727fdc
docs: update AppArmor documentation
avagin Mar 7, 2026
a14f28b
docs: mark Mount Points 2.0 as legacy and redirect to Mount V2
avagin Mar 7, 2026
211adbe
doc: Create index.md for Under the Hood documentation
avagin Mar 7, 2026
18b3647
Merge branch 'checkpoint-restore:criu-dev' into docs
avagin Apr 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions Documentation/under-the-hood/32bit-tasks-cr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# 32-bit tasks C/R

## Compatible applications

On x86_64, there are two types of compatibility mode applications:
- ia32: Compiled to run on an i686 target, these can be executed on x86_64 if the `IA32_EMULATION` configuration option is enabled.
- x32: Specially compiled binaries designed to run on x86_64 with the `CONFIG_X86_X32` configuration option enabled.

Both use 4-byte pointers and thus can address no more than 4 GB of virtual memory.
However, x32 uses the full 64-bit register set and therefore cannot be launched natively on an i686 host.
Both require an additional environment on x86_64, such as Glibc, libraries, and compiler support.
x32 is rarely distributed; currently, only the [Debian x32 port](https://wiki.debian.org/X32Port) is easily found.
Currently, CRIU supports ia32 C/R. Support for x32 can be added relatively easily, as the necessary kernel patches for ia32 C/R are already in place.
In this document, the terms *compatible* and *32-bit* refer to ia32 applications unless otherwise specified.

## Difference between native and compatibility mode applications

From the CPU's point of view, 32-bit compatibility mode applications differ from 64-bit applications by the current Code Segment (CS) selector. If the L-bit (Long mode) in the segment descriptor is set, the CPU operates in 64-bit mode when that descriptor is used. There are other differences between 32-bit and 64-bit selectors; for more details, see [the article "The 0x33 Segment Selector (Heavens Gate)"](https://www.malwaretech.com/2014/02/the-0x33-segment-selector-heavens-gate.html). Code selectors for both modes are defined in kernel headers as `__USER32_CS` and `__USER_CS`, corresponding to descriptors in the Global Descriptor Table (GDT). The mode can be switched from 64-bit to compatibility mode by changing the CS value (e.g., using a long jump).

From the Linux kernel's point of view, applications differ based on values set during `exec`, such as `mmap_base` or thread info flags like `TIF_ADDR32`, `TIF_IA32`, or `TIF_X32`.
Both native and compatibility mode applications can perform either 32-bit or 64-bit syscalls.

## Mixed-bitness applications

The current kernel ABI allows for the creation of mixed-bitness applications, which can become quite complex.
For instance, an application could set both 32-bit and 64-bit robust futex list pointers.
Alternatively, a multi-threaded application could have some threads executing 32-bit code while others execute 64-bit code.

If support for such mixed-bitness applications is ever needed, it could be added to CRIU relatively easily. However, this should likely be a compile-time configuration option to avoid adding unnecessary syscalls to standard C/R operations.

Currently, there are no plans to add this support, as such applications are unlikely to be encountered outside of synthetic tests.

## Approaches to C/R for compatibility mode applications

32-bit C/R can be implemented in several ways. This section describes the pros and cons of various approaches and explains why the current implementation was chosen.

### Restore via exec() of a 32-bit dummy binary vs. from 64-bit CRIU

Restoring a 32-bit application could be done using a 32-bit daemon that communicates with the 64-bit CRIU binary or a 32-bit CRIU subprocess.

**Pros**:
- No kernel patches expected (though `vDSO mremap()` would still require support).

**Cons**:
- The CRIU codebase lacks a dedicated restore daemon, requiring significant rework.
- A 64-bit application can have a 32-bit child, which in turn could parent a 64-bit process. This would require re-executing the native 64-bit CRIU from the 32-bit dummy or subprocess.
- It would be necessary to send process properties, open image file descriptors, and shared memory containing the parsed `ps_tree` to the daemon. The volume of IPC calls would slow down the restoration process.
- Restoration becomes more complex, especially when considering user and PID namespaces.
- Task properties that are erased during `exec()` cannot benefit from optimized inheritance.
- A separate daemon would also be needed for x32.

### Restore with a flag to sigreturn() or arch_prctl()

The initial attempt to implement 32-bit C/R was rejected by the LKML community for several reasons. It involved swapping thread info flags (e.g., `TIF_ADDR32`, `TIF_IA32`, `TIF_X32`), unmapping the native 64-bit vDSO, and mapping the 32-bit vDSO based on a bit in the `rt_sigreturn()` sigframe or a dedicated `arch_prctl()` call.

**Pros**:
- Simple for CRIU: just perform a `sigreturn` with the new bit set or call `arch_prctl` before `sigreturn`.

**Cons**:
- If the 32-bit vDSO on the restoration host differs from the dumped image, the task must be intercepted after `sigreturn` to create jump trampolines (this is simpler with `arch_prctl`).
- Too many potential failure points for a single syscall; overly complex.
- Allowing userspace to swap thread info flags could introduce new race conditions and bugs (e.g., since the `TASK_SIZE` macro depends on `TIF_ADDR32`, memory mapping behavior might become unpredictable).

Following LKML discussions, it was decided to separate personality changes from the vDSO mapping API, remove the `TIF_IA32` flag that distinguished 32-bit from 64-bit tasks, and instead rely on the nature of the syscall (compat, x32, or native).

### Seizing with separate 32-bit and 64-bit parasites

**Pros**:
- No 32-bit calls in the 64-bit parasite and vice-versa.
- Since `ptrace` does not allow setting a 32-bit register set on a 64-bit task (and vice versa), using a parasite of the same nature as the task avoids these limitations.

**Cons**:
- Requires maintaining two or three (for x32) separate parasite blobs.
- Requires complex Makefile macros to build multiple parasites.
- Serializing parasite responses is difficult because argument sizes differ between modes, leading to complex and less readable C macros.

### Current approach

CRIU (a 64-bit process) handles 32-bit (ia32) tasks through a series of architecture-specific transitions:

1. **Architecture Detection**: CRIU uses `ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, &iov)` to detect the task's architecture. The kernel returns different register set sizes depending on the mode: `sizeof(user_regs_struct64)` for native 64-bit tasks and `sizeof(user_regs_struct32)` for 32-bit compatibility mode tasks.
2. **Dumping**: When dumping a 32-bit task, CRIU uses the 64-bit `ptrace` interface. The kernel handles the internal mapping of 32-bit registers into the structure expected by CRIU.
3. **vDSO Handling**: To ensure the restored task uses a vDSO compatible with the current kernel, CRIU uses the `arch_prctl(ARCH_MAP_VDSO_32, addr)` system call (available since kernel v4.8) to map the 32-bit vDSO into the restored process's address space.
4. **Restoration via Sigreturn**: The final restoration of 32-bit registers is performed using a 32-bit `rt_sigreturn` call:
* CRIU prepares a 32-bit signal frame (`rt_sigframe_ia32`) on the target task's stack.
* The CRIU restorer code, running in 64-bit mode, executes a far return (`lretq`) to switch the CPU to 32-bit mode with the `USER32_CS` (0x23) segment selector.
* Once in 32-bit mode, it executes `int $0x80` with the `__NR32_rt_sigreturn` syscall number. The kernel then restores all registers from the 32-bit sigframe and resumes the task in 32-bit mode.

## To-Do

### vsyscall page handling

The `vsyscall` page is an emulated, fixed-address page (`0xffffffffff600000`) used for legacy support. It is not a standard VMA and is marked as `VMA_AREA_VSYSCALL` by CRIU, which avoids dumping or restoring its contents. Since its presence in `/proc/<pid>/maps` depends on kernel configuration (`vsyscall=emulate` or `vsyscall=xonly`), it can introduce noise during ZDTM tests that compare memory layouts. Consequently, tests are often run with `vsyscall=none`.

### Error reporting on x32 binary dumping

Currently, CRIU does not support x32 binaries (64-bit registers with 32-bit pointers). While the infrastructure for 32-bit pointers exists, the specific register handling and vDSO mapping for x32 are not implemented. Attempting to dump an x32 binary should result in an explicit error.

### Removal of TIF_IA32 from the kernel

The `TIF_IA32` thread info flag was historically used to distinguish 32-bit tasks. Kernel efforts (merged in v5.11) have moved towards relying on the nature of the syscall (compat vs. native) rather than a persistent thread flag. This unification simplifies how the kernel and CRIU interact, particularly for tracing tools like uprobes.

## External links
- [GitHub issue](https://github.com/checkpoint-restore/criu/issues/43)

32 changes: 32 additions & 0 deletions Documentation/under-the-hood/aio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Asynchronous I/O (AIO)

CRIU supports checkpointing and restoring kernel-level Asynchronous I/O (AIO) contexts, which are managed via the `io_setup`, `io_submit`, `io_getevents`, and `io_destroy` system calls.

## How CRIU Handles AIO

To successfully checkpoint and restore an AIO context, CRIU manages three primary components:

1. **The AIO Ring Buffer**: This is a memory-mapped area where the kernel and userspace communicate. CRIU identifies these areas by their `[aio]` label in `/proc/pid/maps` or by detecting the specific VMA attributes.
2. **Completed Events**: Events that have finished and are already residing in the ring buffer are dumped as part of the process's memory.
3. **AIO Context State**: This includes the kernel's internal tracking of the ring's head and tail.

### The Restoration Process

The restoration of an AIO ring is complex because the kernel's AIO context ID (the `aio_context_t` value) is an internal pointer that cannot be arbitrarily assigned by userspace. CRIU uses the following strategy to restore it:

1. **New Ring Creation**: The restorer calls `io_setup` to create a fresh AIO ring with the original number of requested events.
2. **Tail Synchronization**: To move the kernel's internal `tail` pointer to the original position, CRIU submits dummy I/O requests (typically writes to `/dev/null`). Since these operations are synchronous for the device, the kernel advances the tail as each request completes.
3. **Head Synchronization**: CRIU manually adjusts the `head` pointer in the ring header to match the state at the time of the dump.
4. **Event Data Restoration**: The original `io_events` data (the completed but unread events) is copied from the dump image into the new ring buffer.
5. **Memory Remapping**: Finally, CRIU uses `mremap` to move the new ring buffer to its original virtual address, ensuring the application can continue using its existing AIO context ID.

## Limitations: In-Flight Events

Currently, **in-flight events** (I/O requests that have been submitted but not yet completed at the time of the dump) are **not supported**.

* **Dumping**: CRIU's parasite code checks for AIO rings but does not currently wait for pending requests to complete. If a request completes during or after the dump, it may lead to data inconsistency or a failed restore.
* **Restoring**: There is no mechanism to re-submit pending I/O requests upon restoration. Applications using AIO should ideally be in a quiescent state (all submitted I/O completed) before being checkpointed.

## See also

* [Memory dumping and restoring](memory-dumping-and-restoring.md)
35 changes: 35 additions & 0 deletions Documentation/under-the-hood/apparmor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# AppArmor Support

CRIU provides support for checkpointing and restoring **AppArmor** security profiles and namespaces. This is a critical feature for containerized environments (like Docker, LXC, or Podman) where each container frequently operates under its own set of specialized security policies.

## How CRIU Handles AppArmor

AppArmor integration in CRIU ensures that restored processes continue to operate under the same security constraints as the original processes, while also managing the temporary permissions needed for the checkpointing process itself.

### 1. Checkpointing (Dumping)
During the dump phase, CRIU detects the AppArmor state of each task:
* **Profile Identification**: CRIU captures the active profile name for every thread (e.g., `unconfined`, `docker-default`, or a custom user-defined profile).
* **Namespace and Policy Dumping**: In modern containerized setups, containers often have their own AppArmor namespaces. CRIU walks the `/sys/kernel/security/apparmor/policy/` directory to capture the full hierarchy of namespaces and the raw binary blobs of all loaded policies.
* **Parasite Profile**: To allow the [Parasite Code](parasite-code.md) to perform its necessary inspections (like opening network sockets or reading memory) without being blocked by the application's strict security policy, CRIU temporarily transitions the task into a special, permissive "parasite profile" while it is infected.

### 2. Restoration
Restoring AppArmor state involves re-establishing the security context before the process resumes:
* **Policy Loading**: CRIU uses the `apparmor_parser` utility on the destination host to re-load the policy blobs captured in the image files.
* **Namespace Reconstruction**: It recreates any nested AppArmor namespaces to match the original environment.
* **Profile Re-attachment**: As each process is restored, CRIU ensures it is transitioned back into its original profile (or stack of profiles) using the `aa_change_profile()` interface before the application code begins executing.

## Support for Stacking

Modern AppArmor implementations support **Profile Stacking**, where multiple security profiles are applied to a single process simultaneously (e.g., a container-wide profile plus a per-application profile). CRIU correctly identifies, dumps, and restores these complex stacked configurations.

## Kernel Requirements

Reliable AppArmor C/R requires:
* A kernel with `CONFIG_SECURITY_APPARMOR` enabled and active.
* The `securityfs` filesystem mounted (typically at `/sys/kernel/security`).
* Support for AppArmor policy introspection and namespaces, which is standard in modern distributions like Ubuntu and Debian.

## See also
* [Checkpoint/Restore Architecture](checkpointrestore.md)
* [Parasite Code](parasite-code.md)
* [Kerndat Feature Detection](kerndat.md)
32 changes: 32 additions & 0 deletions Documentation/under-the-hood/arm64-gcs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# ARM64 Guarded Control Stack (GCS)

CRIU supports checkpointing and restoring the **Guarded Control Stack (GCS)** feature on ARM64 (AArch64) architectures. GCS is a hardware-assisted shadow stack mechanism designed to prevent return-oriented programming (ROP) attacks by maintaining a protected stack of return addresses.

## How CRIU Handles GCS

GCS support is integrated into CRIU's architecture-specific code for AArch64 (`arch/aarch64/gcs.c`).

### 1. Checkpointing (Dumping)
During the dump phase, CRIU detects if a task has GCS enabled by checking its CPU features and hardware capabilities (`HWCAP_GCS`).
* **State Capture**: CRIU uses `ptrace(PTRACE_GETREGSET, ..., NT_ARM_GCS, ...)` to retrieve the current GCS state.
* **Key Parameters**:
* `gcspr_el0`: The current Guarded Control Stack Pointer.
* `features_enabled`: The GCS configuration flags (e.g., `PR_SHADOW_STACK_ENABLE`).
* **VMA Identification**: CRIU identifies the memory region (VMA) used for the shadow stack, which is marked with special kernel attributes.

### 2. Restoration
Restoring GCS requires carefully re-establishing the shadow stack before the process resumes normal execution.
* **Shadow Stack Mapping**: CRIU uses the `map_shadow_stack` system call to recreate the shadow stack at its original virtual address.
* **Context Setup**: The captured GCS state (`gcspr_el0` and flags) is integrated into the task's **restorer context**.
* **Sigframe Integration**: To ensure a seamless transition, CRIU places a `gcs_context` entry into the signal frame used for the final `sigreturn`. This informs the kernel to switch to the restored shadow stack as the process resumes.

## Kernel Requirements

GCS support in CRIU requires an ARM64 host and a kernel that supports the Guarded Control Stack ABI, typically including:
* `PR_SHADOW_STACK_ENABLE` prctl support.
* The `map_shadow_stack` system call.
* `NT_ARM_GCS` ptrace regset.

## See also
* [Checkpoint/Restore Architecture](checkpointrestore.md)
* [Restorer Context](restorer-context.md)
36 changes: 36 additions & 0 deletions Documentation/under-the-hood/bpf-maps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# BPF Maps

BPF maps are kernel objects that store data used by BPF programs, typically in the form of key-value pairs. Applications access these maps via file descriptors. Checkpointing and restoring BPF maps involves serializing both their **metadata** and their **data contents**.

## How CRIU Handles BPF Maps

### Metadata Serialization
CRIU collects essential map attributes from several sources:
- **/proc filesystem**: Essential fields such as `map_type`, `key_size`, `value_size`, `max_entries`, and the `frozen` status are parsed from the task's `fdinfo`.
- **BPF System Call**: CRIU uses the `bpf` system call with the `BPF_OBJ_GET_INFO_BY_FD` command to retrieve additional information, including the map name and interface index (`ifindex`).

### Data Serialization
To preserve the map's contents, CRIU relies on batch operations:
- **Dumping**: During the checkpoint stage, CRIU uses `BPF_MAP_LOOKUP_BATCH` to efficiently read all key-value pairs from the map.
- **Restoring**: During the restore phase, CRIU recreates the map and uses `BPF_MAP_UPDATE_BATCH` to repopulate it with the saved key-value pairs.

### Supported Map Types
CRIU currently supports data serialization for the following BPF map types:
- `BPF_MAP_TYPE_HASH`
- `BPF_MAP_TYPE_ARRAY`

For other map types, CRIU may be able to restore the map itself (metadata) but not its contents, depending on kernel support for batch operations on those types.

### Frozen Maps
If a BPF map was marked as read-only (frozen) using `bpf_map_freeze()`, CRIU detects this state from `fdinfo` and reapplies the freeze during restoration after the data has been repopulated.

## To-Do

- **BTF Support**: Serialization and restoration of BPF Type Format (BTF) information associated with maps.
- **Extended Map Types**: Implementation of data serialization for more BPF map types (e.g., `BPF_MAP_TYPE_PERF_EVENT_ARRAY`, `BPF_MAP_TYPE_LPM_TRIE`).
- **Map Extra Data**: Full support for `map_extra` fields introduced in recent kernels (currently only partially parsed with limited restoration).

## External Links
- [BPF Documentation](https://www.kernel.org/doc/html/latest/bpf/index.html)
- [Notes on BPF](https://blogs.oracle.com/linux/notes-on-bpf-1)
- [An eBPF Overview](https://www.collabora.com/news-and-blog/blog/2019/04/05/an-ebpf-overview-part-1-introduction/)
Loading
Loading