Skip to content

Accelerate open+close on aarch64 rewrite mode#51

Merged
jserv merged 2 commits intomainfrom
cancel-wrapper
Apr 8, 2026
Merged

Accelerate open+close on aarch64 rewrite mode#51
jserv merged 2 commits intomainfrom
cancel-wrapper

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented Apr 8, 2026

This series eliminates two independent overheads on the open()/close() hot path in kbox's rewrite-mode syscall dispatcher on aarch64. Together the two patches take the bench_test open+close pair from 122.97 us to 47.52 us (-61.4 %, 2.59x faster) on real arm64 hardware, while leaving every other measured syscall unchanged.


Summary by cubic

Accelerates aarch64 rewrite-mode open+close by promoting musl cancel-wrapper call sites and making fd-table close-path lookups O(1), cutting open+close from 122.97µs to 47.52µs (~2.6x) on real arm64 hardware.

  • New Features

    • Promote bl __syscall_cancel call sites on aarch64 to a cancel trampoline that reads nr from x6 and resumes at bl+4.
    • Conservative gating: only for static binaries with no fork/clone/vfork/clone3 wrapper sites; dynamic binaries are not promoted.
    • Robust detection: match movz x6, #nr + nearby bl, then validate the target by walking the BL chain to a svc #0 with the musl arg-shuffle signature; ignore tail-call b.
    • Adds docs at docs/cancel-wrapper.md with semantics and trade-offs (cancellation checks are bypassed under the single-threaded gate).
  • Refactors

    • fd-table: add O(1) host-fd → vfd reverse map and lkl_fd refcounts; eliminate linear scans on close and replace “still_ref” loops with kbox_fd_table_lkl_ref_count().
    • Tighten slot lifecycle: centralized clear/init helpers; maintain reverse-map and refcounts on insert, insert_at (releases previous refs), set_host_fd, remove, and close-on-exec.
    • Unify fork-site scanning via wrapper-number detectors so aarch64 is handled correctly.

Written for commit 000015c. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

jserv added 2 commits April 8, 2026 13:54
Rewrite mode on aarch64 previously only patched the raw SVC
instruction inside musl's __syscall_cancel, leaving the wrapper-level
bl and the cancel-state check on every open() call. Detect
bl __syscall_cancel at the caller and redirect it to a dedicated
cancel trampoline entry that consumes the syscall number from x6
instead of x8, matching musl's __syscall_cancel(a,b,c,d,e,f,nr)
calling convention, and resumes at bl_pc + 4 with the kernel result
already in x0.

Site detection walks the segment for 'movz x6, #nr' followed by a bl
within 32 bytes, then validates that the BL target is musl's cancel
chain by following the call graph up to depth 2 looking for a svc
preceded by four or more consecutive 'mov Xd, Xm' arg-shuffle
instructions. That prefix is specific to __syscall_cancel_arch and
distinguishes it from ordinary 7-arg C calls that happen to load a
small constant into x6. Plain b (tail call) is never matched because
there is no bl_pc + 4 to resume at.

Promotion is gated on two conditions enforced at install time, both
of which must hold:

  1. launch->interp_elf == NULL (static binary). Dynamic binaries are
     rejected outright because their libc lives in a DSO that the
     main-ELF scan cannot see, and they could also dlopen a DSO that
     spins up threads at runtime.

  2. main_elf contains no fork/clone/vfork/clone3 wrapper sites.
     Because condition 1 requires a static binary, libc is part of
     main_elf: pthread_create compiles down to a literal
     'mov x8, #220; svc 0' site that the wrapper-number scanner
     catches. Scanning main_elf alone is therefore sufficient to
     cover the embedded libc.

The fork-sites scan previously went through kbox_rewrite_has_fork_sites,
which walked only x86_64-shaped sites and silently returned 0 for any
aarch64 input. Rewrite it to delegate to kbox_rewrite_has_wrapper_syscalls,
which handles both architectures via the unified wrapper_nr_scan_segment
walker that matches 'mov x8, #nr; svc 0' on aarch64.
kbox_rewrite_has_fork_sites_memfd() is also cleaned up: the previous
version called kbox_rewrite_analyze_memfd() into a 'report' local
that was never read, duplicating the ELF header parse that
kbox_rewrite_has_wrapper_syscalls_memfd() already does internally.

Bypassing __syscall_cancel skips pthread cancellation point checks,
so the static-binary gate is load-bearing: a program that cannot
spawn threads cannot observe the missing cancellation.
docs/cancel-wrapper.md captures the full design, the calling-
convention details, the two gate conditions and their invariants,
and two known residual limitations (clone via syscall(3) and
shared-libc musl).

Performance on real aarch64 hardware (release build, bench_test
10000 iterations, mean of 5 runs):

  syscall     before    after    delta
  stat          3.5      3.5     noise
  open+close  122.97   119.76    -2.6%
  lseek+read   42.9     42.5     noise
  write         1.5      1.4     flat
  getpid        0.0      0.0     flat

The ~3 us win on open+close is exactly the bl prologue, cancel-state
check, and epilogue saved by bypassing __syscall_cancel. stat,
lseek+read, and write do not use the cancel wrapper and are
unaffected. The much larger open+close regression vs. the pre-Phase-8
baseline is addressed by the companion fd-table O(1) close-path
rework in the following commit.

Tests:
- 271/271 unit tests pass on lima (ASAN+UBSAN debug build),
  including new coverage for the BL target validator, the arch-aware
  fork scanner on aarch64, and updated cancel-wrapper ELF
  expectations.
- 51/51 integration tests pass on lima.
- bench_test correctness verified on arm under --syscall-mode=rewrite.

Change-Id: Iac7bf080e653cc0ac3ef23c59db878ae5780204e
Profiling bench_test under --syscall-mode=rewrite on aarch64 showed
kbox_fd_table_find_by_host_fd() consuming 35.47 % of total CPU time.
Every close() on a tracee-held host FD called forward_close(), which
in turn called find_by_host_fd() to locate the supervisor's shadow
entry for that FD. The old implementation walked all three backing
ranges linearly: 1024 low_fds + 31744 mid_fds + 4096 entries = 36864
slots per call. For bench_test's cached-shadow openat path, which
injects an ADDFD without ever creating an fd_table entry, every
close walked the full table finding nothing before returning -1.

Replace the linear scan with two flat tables maintained alongside
the main fd_table:

  - host_to_vfd[KBOX_HOST_FD_REVERSE_MAX]: O(1) host_fd -> virtual_fd
    reverse map, sized to cover the child's RLIMIT_NOFILE (65536).
    Three states per slot: KBOX_HOST_VFD_NONE (-1) means no entry
    claims this host_fd (authoritative miss, return -1 in O(1));
    KBOX_HOST_VFD_MULTI (-2) means two or more entries share this
    host_fd (fall through to the linear scan); any non-negative
    value is the single holder's vfd. A forward-check guards
    against stale single-holder entries.

  - lkl_fd_refs[KBOX_LKL_FD_REFMAX]: O(1) refcount of how many
    virtual fds currently hold each lkl_fd, replacing the O(n)
    lkl_fd_has_other_ref scan and the still_ref loop inside
    forward_close()'s shadow-socket close path.

The three-state reverse map preserves the "authoritative miss"
optimization for the hot path (map[h] == NONE short-circuits to -1
in O(1)) while correctly handling dup2/dup3-style duplicate holders
via the MULTI sentinel. The invariant is that every positive host_fd
assignment must go through kbox_fd_table_set_host_fd(), which the
codebase already honors; the only direct writes to entry->host_fd
are the two negative sentinel writes in seccomp-dispatch.c, and
they never transition a positive value out.

Three existing helpers were also refactored. kbox_fd_table_insert_at
now releases the old refcount and reverse-map entry when reusing a
live slot (previously a latent refcount leak). close_cloexec_entry
gained a vfd argument so it can clear the reverse map correctly.
Six copies of 6-field slot-initialization boilerplate were
collapsed into two helpers, clear_fd_entry() and init_live_entry(),
without changing any behavior.

The forward_close() still_ref loop that walked all three ranges
looking for another holder of a shadow socket lkl_fd is replaced
with kbox_fd_table_lkl_ref_count() == 0, which reads the refcount
in O(1) after kbox_fd_table_remove() has already decremented it.

Memory cost: struct kbox_fd_table grows from ~1.13 MB to ~1.43 MB
(+288 KB: host_to_vfd adds 256 KB as int32_t, lkl_fd_refs adds 32
KB as uint16_t). Stack-allocated in the supervisor launch paths;
the default 8 MB Linux stack has plenty of headroom.

Out-of-range host_fds (>= 65536) and lkl_fds (>= 16384) fall
through to the original linear-scan implementations as a safety
net. In practice kbox raises RLIMIT_NOFILE to exactly 65536 and
LKL allocates small kernel fd numbers, so the fallback paths are
dead code on the hot path.

Performance on real aarch64 hardware (release build, bench_test
10000 iterations, mean of 5 runs):

  syscall     before    after    delta
  stat          3.5      3.3     noise
  open+close  119.76    47.52    -60.3 %  (2.52x faster)
  lseek+read   42.5     43.4     noise
  write         1.4      1.4     flat
  getpid        0.0      0.0     flat

Combined with the preceding cancel-wrapper fast-path commit, the
total improvement versus the pre-series baseline is:

  open+close   122.97 us  ->  47.52 us    -61.4 %  (2.59x faster)

perf record confirms kbox_fd_table_find_by_host_fd is off the top
chart after this change; the new hotspot is kernel-side
_raw_spin_unlock_irqrestore from futex wake-up in the supervisor
service thread (13.40 %), which is an orthogonal signaling cost
and a separate optimization target.

Tests:
- Two new fd-table unit tests document the hybrid semantics: the
  duplicate-holder test asserts either holder is a valid answer
  after a dup-style set (matching the scan-order tie-break the
  MULTI state exposes), and a positive assertion documents the
  load-bearing invariant that positive host_fd values must be
  installed via the API (direct writes are intentionally not
  findable so the authoritative-NONE fast path is sound).
- 273/273 unit tests pass on lima (ASAN+UBSAN debug build).
- 51/51 integration tests pass on lima.
- bench_test, clone3-test, dup-test, and /bin/ls all work
  correctly on arm under --syscall-mode=rewrite. The static-binary
  gate was re-verified: /bin/ls (dynamic) reports
  cancel_promote_allowed=0, bench_test (static) reports
  cancel_promote_allowed=1.

Change-Id: Ic96c0e862e1e984a0966651ee8beb38eb54e7a85
@jserv jserv merged commit b3e52ec into main Apr 8, 2026
5 checks passed
@jserv jserv deleted the cancel-wrapper branch April 8, 2026 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant