Skip to content

Export wiki under-the-hood docs#2946

Open
avagin wants to merge 60 commits into
checkpoint-restore:criu-devfrom
avagin:docs
Open

Export wiki under-the-hood docs#2946
avagin wants to merge 60 commits into
checkpoint-restore:criu-devfrom
avagin:docs

Conversation

@avagin
Copy link
Copy Markdown
Member

@avagin avagin commented Mar 9, 2026

No description provided.

avagin added 30 commits March 9, 2026 00:37
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
- Replace FIXME with a detailed description of the current approach
- Explain architecture detection using PTRACE_GETREGSET
- Describe the restoration process via sigreturn and mode switching
- Update vsyscall handling details
- Clarify the status of x32 support and TIF_IA32 removal

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain how CRIU restores AIO context IDs and ring buffers
- Describe the tail synchronization technique using dummy /dev/null requests
- Clarify the lack of support for in-flight events and its implications

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain metadata collection from /proc and BPF syscall
- Describe data serialization using batch operations
- Add details about frozen maps handling
- Clarify current limitations regarding map_extra and BTF

Signed-off-by: Andrei Vagin <avagin@google.com>
- Document full CGroup v2 support and properties
- Explain CGroup namespace (CLONE_NEWCGROUP) handling
- Clarify the 'soft mode' default and other restoration strategies
- Detail the root mount requirement for bind-mounted subgroups

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the core problem of TCP 4-tuple mismatch
- Describe solutions for listening, in-flight, and established sockets
- Document the UPDATE_INETSK plugin hook for programmatic IP remapping
- Add a summary table of options and flags

Signed-off-by: Andrei Vagin <avagin@google.com>
- Clarify freezing mechanisms (PTRACE_INTERRUPT, Freezer CGroup)
- Detail the parasite injection and bootstrap process
- Explain the role of the restorer blob as a PIE and its conflict avoidance
- Document the final transition via sigreturn

Signed-off-by: Andrei Vagin <avagin@google.com>
- Document the use of 'compel hgen' for header generation
- Update the example header format to include structured relocations
- Describe the 'parasite_blob_desc' setup functions
- Refine the build procedure steps

Signed-off-by: Andrei Vagin <avagin@google.com>
- Embed DMTCP description and characteristics
- Update CRIU supported architectures (s390, MIPS, RISC-V, etc.)
- Refine the comparison table for accuracy and modern features
- Add more context for BLCR, PinPlay, and Legacy OpenVZ

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the identification of COW candidates by comparing parent/child VMAs
- Describe the pre-mapping strategy before fork to leverage kernel sharing
- Detail the content verification and manual COW triggering
- Document the use of madvise(MADV_DONTNEED) for final memory layout accuracy
- Clarify current limitations regarding reparenting and VMA movement

Signed-off-by: Andrei Vagin <avagin@google.com>
- Formalize the architectural comparison (userspace vs. kernel integration)
- Highlight the dangers of DMTCP's fake PID virtualization
- Explain CRIU's usage of ns_last_pid and clone3 for real PID restoration
- Improve overall technical clarity and structure

Signed-off-by: Andrei Vagin <avagin@google.com>
- Detail the Linux file object hierarchy (Inode, Dentry, File)
- Explain the SCM_RIGHTS mechanism for retrieving local FD copies
- Describe the gen_id and kcmp optimization for shared file detection
- Clarify the two-tier image storage structure (fdinfo vs specialized images)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Clarify dirty page dumping in read-only mappings
- Add instructions for using 'criu check --extra'
- Detail PID mismatch solutions and internal interfaces (clone3)
- Expand on external Unix socket limitations
- Update guidance for Docker and container filesystem consistency

Signed-off-by: Andrei Vagin <avagin@google.com>
- Formalize the Master and Slave descriptor concepts
- Describe the 'open()' state machine and early FD distribution via SCM_RIGHTS
- Document the inter-process synchronization (set_fds_event, futexes)
- List key dependencies (TTYs, Unix Sockets, Epoll)
- Add notes on Service FDs and restoration ordering

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain BTRFS virtual vs physical device ID resolution
- Detail NFS 'Silly Rename' handling for unlinked files
- Document OverlayFS path inconsistencies and linkat() fallback logic
- Clarify legacy AUFS branch path fixes

Signed-off-by: Andrei Vagin <avagin@google.com>
- Formalize TASK_ALIVE, TASK_STOPPED, and TASK_DEAD states
- Explain the rationale for default behaviors in dump/restore
- Mention pre-dump enforcement of the Running state
- Document the use of --leave-stopped for debugging
- Add instructions for resuming trees via SIGCONT and pstree_cont.py

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the PTRACE_SEIZE and PTRACE_INTERRUPT sequence
- Detail the transparency of ptrace-stop (TRAP_STOP)
- Document cgroup v1 and v2 freezer mechanisms
- Mention kernel kludges for v1 freezer unreliability
- Clarify the relationship between freezer and ptrace

Signed-off-by: Andrei Vagin <avagin@google.com>
- Detail the challenges of finding the 'watchee' path
- Explain the use of open_by_handle_at() and Irmap
- Explicitly document that pending events are dropped with a warning
- Explain how spurious events are generated during restore (ghost files)
- Add details for Fanotify inode and mount marks

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain non-blocking techniques for FIFOs
- Detail link-remap and ghost file strategies for unlinked files
- Document mount namespace (mnt_id) and open_ns_root usage
- Explain fown restoration (F_SETOWN_EX, UID switching, F_SETSIG)
- Clarify flag sanitization and O_PATH handling

Signed-off-by: Andrei Vagin <avagin@google.com>
- Formalize Master and Slave descriptor roles
- Explain the SCM_RIGHTS distribution mechanism
- Document transport socket naming and 'criu_run_id' usage
- Detail deterministic master selection to avoid deadlocks
- Explain dynamic service FD relocation during collisions

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain path loss in fsnotify instances
- Describe the open_by_handle_at() mechanism and kernel integration
- Detail the Irmap brute-force scanning strategy
- Mention filesystem-specific behaviors (Tmpfs, OverlayFS)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain path loss scenarios (unlinked, virtual files, mount shadowing)
- Detail the Ghost File strategy (link count 0) and optimization (fiemap)
- Document the Link-Remap strategy (link count > 0) via linkat()
- Explain the PID helper (TASK_HELPER) mechanism for virtual files
- Clarify handling for NFS Silly Rename and OverlayFS

Signed-off-by: Andrei Vagin <avagin@google.com>
- Describe the (inode, device) to path resolution problem
- List default heuristic scan hints (/etc, /var/log, etc.)
- Explain user-defined scan paths via --irmap-scan-path
- Detail the pre-dump optimization and irmap-cache.img
- Clarify the status of Irmap vs open_by_handle_at on modern kernels

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the kernel pointer comparison mechanism of kcmp()
- Describe the two-level red-black tree optimization (genid + kcmp sub-tree)
- List all supported KCMP_* types (FILE, VM, FILES, FS, EPOLL_TFD, etc.)
- Clarify how genid minimizes expensive system calls

Signed-off-by: Andrei Vagin <avagin@google.com>
- Clarify feature detection for system calls, filesystems, and namespaces
- Update persistent caching locations (/run/criu.kdat vs XDG_RUNTIME_DIR)
- Distinguish between kerndat (host capabilities) and inventory (checkpoint metadata)
- Mention 'criu check --extra' for runtime inspection

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain attribute extraction during checkpointing (Mode, Flags, Parent)
- Detail index preservation using IFLA_NEW_IFINDEX
- Document the --external macvlan[IFNAME]:OUTNAME option
- Improve overall structure and clarity

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the soft-dirty bit mechanism for tracking modified pages
- Document the usage of ioctl(PAGEMAP_SCAN) for efficient scanning (kernel v6.7+)
- Describe the iterative pre-dump workflow and image chaining
- Detail the consolidation of pages during restoration
- Mention the role of the page server in minimizing disk I/O

Signed-off-by: Andrei Vagin <avagin@google.com>
- Detail the multi-stage dumping approach involving parasite injection
- Explain zero-copy dumping using vmsplice() and SPLICE_F_GIFT
- Describe the use of splice() for efficient image writing and page server transport
- Document VMA re-mapping and content filling during restoration
- Add references to COW preservation and lazy migration (userfaultfd)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the in_parent flag in pagemap entries
- Detail detection of unchanged pages via soft-dirty bit
- Document the --auto-dedup mode for dump and restore
- Describe online disk space reclamation using FALLOC_FL_PUNCH_HOLE
- Clarify image chaining and sparse file support

Signed-off-by: Andrei Vagin <avagin@google.com>
avagin added 22 commits March 9, 2026 00:37
- Explain the legacy ns_last_pid interface and its limitations
- Detail the modern clone3() with set_tid mechanism (kernel v5.5+)
- Describe the benefits of atomic PID assignment and nested namespace support
- Mention automatic feature detection via Kerndat
- Document implementation using architecture-specific assembly wrappers

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the PID reuse problem during iterative migration
- Document the use of pidfd_open() for race-free identification
- Detail the 'socket trick' for persistent FD storage via SCM_RIGHTS
- Explain the identity verification process in subsequent iterations
- List required kernel features (pidfd_open, pidfd_getfd)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain stable identification vs numeric PIDs
- Detail restoration of alive vs dead processes
- Document the 'helper process' trick for dead pidfds
- Explain the transition from anonymous inodes to pidfs (kernel v6.9+)
- Clarify current limitations (PIDFD_THREAD)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the sensitivity of rseq state to process execution
- Document the use of PTRACE_GET_RSEQ_CONF and external peeking
- Detail the critical requirement to unregister the restorer's own rseq
- Explain how re-registration and rseq_cs restoration ensure automatic kernel fixups
- Update kernel requirements (v5.13 for automated detection)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the necessity of a dedicated context for memory swapping
- Describe the shared restorer mapping and mremap-based re-positioning
- Detail the safe hole detection strategy to avoid VMA conflicts
- Document the final transition via sigreturn
- Highlight the characteristics of the freestanding PIE blob

Signed-off-by: Andrei Vagin <avagin@google.com>
- Detail the top-down allocation strategy using RLIMIT_NOFILE
- Explain per-process isolation (service_fd_id) for shared FD tables
- Document the relocation mechanism (F_DUPFD_CLOEXEC, dup3)
- Describe the 'sfds_protected' flag and safety invariants
- List common Service FD types (LOG, IMG, RPC, TRANSPORT, etc.)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the use of internal inode numbers (shmid) for anonymous sharing
- Detail the restoration of shared anonymous regions via memfd_create()
- Describe the 'master' vs 'slave' roles and futex synchronization
- Document System V IPC and file-backed shared mapping restoration
- Add references to kcmp and memory dumping optimizations

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the use of sock_diag for kernel state extraction
- Describe the SCM_RIGHTS mechanism for queue inspection
- Detail TCP Repair Mode for connection restoration
- List supported families including Netlink and Packet sockets
- Improve overall structure and technical depth

Signed-off-by: Andrei Vagin <avagin@google.com>
- Formalize the CR_STATE_* state machine and synchronization mechanism
- Detail the multi-stage restoration workflow (Root Task, NS Prep, Forking, etc.)
- Explain the security rationale for Stage 6 (Credentials and Seccomp)
- Document the final transition via sigreturn and thread restoration

Signed-off-by: Andrei Vagin <avagin@google.com>
- Detail the mechanics of TCP Repair Mode and state manipulation
- Explain the role of libsoccr in capturing sequence numbers and options
- Document the network locking workflow using nftables/iptables
- Describe the 'Silent Close' technique to preserve peer connections
- Highlight the importance of sequence number and window restoration

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the PTY index restoration 'brute-force' strategy
- Detail the capture of termios, winsize, and ownership
- Describe the restoration workflow for master and slave peers
- Clarify the status of buffered data and legacy BSD PTYs
- Document the re-binding of controlling terminals (TIOCSCTTY)

Signed-off-by: Andrei Vagin <avagin@google.com>
- Detail the capture of device attributes (TUN vs TAP, Flags)
- Explain index preservation using TUNSETIFINDEX
- Document multi-queue support and re-attachment via TUNSETQUEUE
- Clarify current limitations (BPF filters, in-flight packets)
- Explain persistency management during restoration

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the mechanics of Lazy Migration and on-demand page loading
- Detail the Lazy Pages Daemon and the UFFD descriptor handover (SCM_RIGHTS)
- Document the use of non-cooperative UFFD features (Fork, Remap, Unmap)
- Describe the page fault handling loop and page server integration
- Clarify benefits and trade-offs of the lazy approach

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain Build-ID extraction (ELF magic, 1MB mapping)
- Document 'buildid' (default) vs 'filesize' methods
- Explain the automatic fallback mechanism
- Describe the importance for security and memory pointer integrity
- Detail usage via the --file-validation flag

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the address and ABI mismatch challenges
- Detail the Proxy (Patching) method for older kernels
- Document the modern arch_prctl method for native vDSO mapping
- Explain the role and restoration of the VVAR data region
- Mention automatic feature detection via Kerndat

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the decouplling of socket paths and inodes
- Document the SIOCUNIXFILE ioctl for stable handle retrieval
- Describe the restoration workflow (tmpfs yard, peer coordination)
- Explain the capture and redelivery of in-flight file descriptors
- Clarify handling of external Unix sockets

Signed-off-by: Andrei Vagin <avagin@google.com>
- Document modern kernel features (clone3, PAGEMAP_SCAN, Mount V2)
- Detail advanced introspection tools (sock_diag, /proc/pid/map_files)
- Explain userspace components (Compel, Protobuf)
- Add references to other architectural documents

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain how zombies are identified and their exit codes captured
- Describe the 'helper technique' for restoring zombies via immediate exit
- Detail parent-child coordination to prevent premature reaping
- Add references to related technical documentation

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the hardware-assisted shadow stack mechanism
- Document the state capture via NT_ARM_GCS ptrace regset
- Detail restoration using map_shadow_stack and sigframe integration
- List kernel requirements for AArch64 hosts

Signed-off-by: Andrei Vagin <avagin@google.com>
- Explain the profile identification and namespace dumping process
- Document the use of the 'parasite profile' for non-disruptive dumping
- Detail policy loading via apparmor_parser and namespace reconstruction
- Support for modern features like Profile Stacking
- List kernel and filesystem requirements

Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Comment thread Documentation/under-the-hood/dmtcp.md
Comment thread Documentation/under-the-hood/index.md
Comment thread Documentation/under-the-hood/checkpointrestore.md
@avagin avagin closed this Mar 9, 2026
@avagin avagin reopened this Apr 21, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.21%. Comparing base (f6b7fb6) to head (18b3647).

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2946      +/-   ##
============================================
+ Coverage     57.19%   57.21%   +0.01%     
============================================
  Files           154      154              
  Lines         40399    40400       +1     
  Branches       8857     8856       -1     
============================================
+ Hits          23107    23113       +6     
+ Misses        17032    17023       -9     
- Partials        260      264       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants