Conversation
Rewrite mode on aarch64 previously only patched the raw SVC
instruction inside musl's __syscall_cancel, leaving the wrapper-level
bl and the cancel-state check on every open() call. Detect
bl __syscall_cancel at the caller and redirect it to a dedicated
cancel trampoline entry that consumes the syscall number from x6
instead of x8, matching musl's __syscall_cancel(a,b,c,d,e,f,nr)
calling convention, and resumes at bl_pc + 4 with the kernel result
already in x0.
Site detection walks the segment for 'movz x6, #nr' followed by a bl
within 32 bytes, then validates that the BL target is musl's cancel
chain by following the call graph up to depth 2 looking for a svc
preceded by four or more consecutive 'mov Xd, Xm' arg-shuffle
instructions. That prefix is specific to __syscall_cancel_arch and
distinguishes it from ordinary 7-arg C calls that happen to load a
small constant into x6. Plain b (tail call) is never matched because
there is no bl_pc + 4 to resume at.
Promotion is gated on two conditions enforced at install time, both
of which must hold:
1. launch->interp_elf == NULL (static binary). Dynamic binaries are
rejected outright because their libc lives in a DSO that the
main-ELF scan cannot see, and they could also dlopen a DSO that
spins up threads at runtime.
2. main_elf contains no fork/clone/vfork/clone3 wrapper sites.
Because condition 1 requires a static binary, libc is part of
main_elf: pthread_create compiles down to a literal
'mov x8, #220; svc 0' site that the wrapper-number scanner
catches. Scanning main_elf alone is therefore sufficient to
cover the embedded libc.
The fork-sites scan previously went through kbox_rewrite_has_fork_sites,
which walked only x86_64-shaped sites and silently returned 0 for any
aarch64 input. Rewrite it to delegate to kbox_rewrite_has_wrapper_syscalls,
which handles both architectures via the unified wrapper_nr_scan_segment
walker that matches 'mov x8, #nr; svc 0' on aarch64.
kbox_rewrite_has_fork_sites_memfd() is also cleaned up: the previous
version called kbox_rewrite_analyze_memfd() into a 'report' local
that was never read, duplicating the ELF header parse that
kbox_rewrite_has_wrapper_syscalls_memfd() already does internally.
Bypassing __syscall_cancel skips pthread cancellation point checks,
so the static-binary gate is load-bearing: a program that cannot
spawn threads cannot observe the missing cancellation.
docs/cancel-wrapper.md captures the full design, the calling-
convention details, the two gate conditions and their invariants,
and two known residual limitations (clone via syscall(3) and
shared-libc musl).
Performance on real aarch64 hardware (release build, bench_test
10000 iterations, mean of 5 runs):
syscall before after delta
stat 3.5 3.5 noise
open+close 122.97 119.76 -2.6%
lseek+read 42.9 42.5 noise
write 1.5 1.4 flat
getpid 0.0 0.0 flat
The ~3 us win on open+close is exactly the bl prologue, cancel-state
check, and epilogue saved by bypassing __syscall_cancel. stat,
lseek+read, and write do not use the cancel wrapper and are
unaffected. The much larger open+close regression vs. the pre-Phase-8
baseline is addressed by the companion fd-table O(1) close-path
rework in the following commit.
Tests:
- 271/271 unit tests pass on lima (ASAN+UBSAN debug build),
including new coverage for the BL target validator, the arch-aware
fork scanner on aarch64, and updated cancel-wrapper ELF
expectations.
- 51/51 integration tests pass on lima.
- bench_test correctness verified on arm under --syscall-mode=rewrite.
Change-Id: Iac7bf080e653cc0ac3ef23c59db878ae5780204e
Profiling bench_test under --syscall-mode=rewrite on aarch64 showed
kbox_fd_table_find_by_host_fd() consuming 35.47 % of total CPU time.
Every close() on a tracee-held host FD called forward_close(), which
in turn called find_by_host_fd() to locate the supervisor's shadow
entry for that FD. The old implementation walked all three backing
ranges linearly: 1024 low_fds + 31744 mid_fds + 4096 entries = 36864
slots per call. For bench_test's cached-shadow openat path, which
injects an ADDFD without ever creating an fd_table entry, every
close walked the full table finding nothing before returning -1.
Replace the linear scan with two flat tables maintained alongside
the main fd_table:
- host_to_vfd[KBOX_HOST_FD_REVERSE_MAX]: O(1) host_fd -> virtual_fd
reverse map, sized to cover the child's RLIMIT_NOFILE (65536).
Three states per slot: KBOX_HOST_VFD_NONE (-1) means no entry
claims this host_fd (authoritative miss, return -1 in O(1));
KBOX_HOST_VFD_MULTI (-2) means two or more entries share this
host_fd (fall through to the linear scan); any non-negative
value is the single holder's vfd. A forward-check guards
against stale single-holder entries.
- lkl_fd_refs[KBOX_LKL_FD_REFMAX]: O(1) refcount of how many
virtual fds currently hold each lkl_fd, replacing the O(n)
lkl_fd_has_other_ref scan and the still_ref loop inside
forward_close()'s shadow-socket close path.
The three-state reverse map preserves the "authoritative miss"
optimization for the hot path (map[h] == NONE short-circuits to -1
in O(1)) while correctly handling dup2/dup3-style duplicate holders
via the MULTI sentinel. The invariant is that every positive host_fd
assignment must go through kbox_fd_table_set_host_fd(), which the
codebase already honors; the only direct writes to entry->host_fd
are the two negative sentinel writes in seccomp-dispatch.c, and
they never transition a positive value out.
Three existing helpers were also refactored. kbox_fd_table_insert_at
now releases the old refcount and reverse-map entry when reusing a
live slot (previously a latent refcount leak). close_cloexec_entry
gained a vfd argument so it can clear the reverse map correctly.
Six copies of 6-field slot-initialization boilerplate were
collapsed into two helpers, clear_fd_entry() and init_live_entry(),
without changing any behavior.
The forward_close() still_ref loop that walked all three ranges
looking for another holder of a shadow socket lkl_fd is replaced
with kbox_fd_table_lkl_ref_count() == 0, which reads the refcount
in O(1) after kbox_fd_table_remove() has already decremented it.
Memory cost: struct kbox_fd_table grows from ~1.13 MB to ~1.43 MB
(+288 KB: host_to_vfd adds 256 KB as int32_t, lkl_fd_refs adds 32
KB as uint16_t). Stack-allocated in the supervisor launch paths;
the default 8 MB Linux stack has plenty of headroom.
Out-of-range host_fds (>= 65536) and lkl_fds (>= 16384) fall
through to the original linear-scan implementations as a safety
net. In practice kbox raises RLIMIT_NOFILE to exactly 65536 and
LKL allocates small kernel fd numbers, so the fallback paths are
dead code on the hot path.
Performance on real aarch64 hardware (release build, bench_test
10000 iterations, mean of 5 runs):
syscall before after delta
stat 3.5 3.3 noise
open+close 119.76 47.52 -60.3 % (2.52x faster)
lseek+read 42.5 43.4 noise
write 1.4 1.4 flat
getpid 0.0 0.0 flat
Combined with the preceding cancel-wrapper fast-path commit, the
total improvement versus the pre-series baseline is:
open+close 122.97 us -> 47.52 us -61.4 % (2.59x faster)
perf record confirms kbox_fd_table_find_by_host_fd is off the top
chart after this change; the new hotspot is kernel-side
_raw_spin_unlock_irqrestore from futex wake-up in the supervisor
service thread (13.40 %), which is an orthogonal signaling cost
and a separate optimization target.
Tests:
- Two new fd-table unit tests document the hybrid semantics: the
duplicate-holder test asserts either holder is a valid answer
after a dup-style set (matching the scan-order tie-break the
MULTI state exposes), and a positive assertion documents the
load-bearing invariant that positive host_fd values must be
installed via the API (direct writes are intentionally not
findable so the authoritative-NONE fast path is sound).
- 273/273 unit tests pass on lima (ASAN+UBSAN debug build).
- 51/51 integration tests pass on lima.
- bench_test, clone3-test, dup-test, and /bin/ls all work
correctly on arm under --syscall-mode=rewrite. The static-binary
gate was re-verified: /bin/ls (dynamic) reports
cancel_promote_allowed=0, bench_test (static) reports
cancel_promote_allowed=1.
Change-Id: Ic96c0e862e1e984a0966651ee8beb38eb54e7a85
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series eliminates two independent overheads on the open()/close() hot path in kbox's rewrite-mode syscall dispatcher on aarch64. Together the two patches take the bench_test open+close pair from 122.97 us to 47.52 us (-61.4 %, 2.59x faster) on real arm64 hardware, while leaving every other measured syscall unchanged.
Summary by cubic
Accelerates aarch64 rewrite-mode open+close by promoting musl cancel-wrapper call sites and making fd-table close-path lookups O(1), cutting open+close from 122.97µs to 47.52µs (~2.6x) on real arm64 hardware.
New Features
bl __syscall_cancelcall sites on aarch64 to a cancel trampoline that readsnrfromx6and resumes atbl+4.fork/clone/vfork/clone3wrapper sites; dynamic binaries are not promoted.movz x6, #nr+ nearbybl, then validate the target by walking the BL chain to asvc #0with the musl arg-shuffle signature; ignore tail-callb.docs/cancel-wrapper.mdwith semantics and trade-offs (cancellation checks are bypassed under the single-threaded gate).Refactors
lkl_fdrefcounts; eliminate linear scans on close and replace “still_ref” loops withkbox_fd_table_lkl_ref_count().Written for commit 000015c. Summary will update on new commits.