Skip to content

Runtime performance optimizations + Apple Silicon FFI-callback W^X fix#1771

Open
dg1sbg wants to merge 7 commits into
clasp-developers:mainfrom
dg1sbg:perf/runtime-optimizations
Open

Runtime performance optimizations + Apple Silicon FFI-callback W^X fix#1771
dg1sbg wants to merge 7 commits into
clasp-developers:mainfrom
dg1sbg:perf/runtime-optimizations

Conversation

@dg1sbg
Copy link
Copy Markdown
Contributor

@dg1sbg dg1sbg commented May 25, 2026

Summary

A set of independent runtime performance improvements (each its own commit), plus one Apple-Silicon correctness fix. All changes are portable — the platform-specific bits compile to no-ops off Apple Silicon.

commit change measured effect
Default *print-pretty* to NIL clasp's pretty printer is a CLOS Gray stream (C++↔Lisp dispatch per character); defaulting it on made every princ/prin1/format ~A/print pay that cost ~370× faster non-pretty printing (a 50-elt list: ~2.4 ms → ~6 µs). REPL stays pretty (tpl-print rebinds it); pprint/~<~:>/explicit bindings unaffected
Non-atomic per-thread alloc counters the counters live in the THREAD_LOCAL ThreadLocalStateLowLevel and are never accessed cross-thread, so the std::atomics were pure overhead on every allocation (also fixes 3 uninitialized fields) 2–6% on allocation
Bulk StringOutputStream_O::write_string replaces the per-character boxing + virtual vectorPushExtend with a single grow + typed bulk copy + scan-once cursor; safe fallback for narrowing 1.8×–62× (scales with length), output byte-identical
countObjectFileNames O(N²)→O(1) it rescanned _AllObjectFiles per JIT-module registration; replaced with a mutex-guarded name→count index synced at the single add point + the bulk-clear sites removed the #1 self-time function (~15%) in JIT-heavy workloads
Enable Boehm PARALLEL_MARK it was off (config-header template default) although bdwgc defaults it on and GC_THREADS/THREAD_LOCAL_ALLOC are enabled; clasp marks conservatively with no custom mark procs, so parallel markers are safe ~3× faster GC marking on multicore
Cache the TLS pointer in the bytecode VM on Darwin each thread_local access is a _tlv_get_addr thunk; the interpreter hit my_thread per call removes the per-call thunk (no-op load off Darwin)
Fix Apple-Silicon W^X SIGBUS in FFI callbacks setf_jit_lookup_t stored a function pointer into JIT'd (MAP_JIT) callback memory without enabling writes — %defcallback crashed at compile time. Wrapped with the existing JITDataReadWriteMaybeExecute()/JITDataReadExecute() helpers defcallback now works (was a hard SIGBUS)

Testing

Built and tested on macOS Apple Silicon (LLVM 22), boehmprecise variant:

  • Full regression suite: 1877 pass, 0 bus errors / segfaults; the only remaining failures are the pre-existing known ones in set-unexpected-failures.lisp (sbcl-cross-compile, include-level, types-classes).
  • The W^X fix turns the previously-crashing defcallback-native test into a pass (CFFI-DEFCALLBACK — the Lisp callback drives C qsort correctly).
  • Output correctness verified byte-identical against a base/extended-string, tab/newline, fill-pointer-growth, and narrowing golden test.
  • Parallel marking verified stable: full regression byte-identical across 3 runs (no non-determinism).

Commits are independent and can be reviewed/cherry-picked separately. The only user-visible behavior change is the *print-pretty* default (deliberate, matching SBCL/CCL/CLISP).

🤖 Generated with Claude Code

dg1sbg and others added 7 commits May 25, 2026 21:16
clasp's pretty printer is a CLOS Gray stream, so every output character
during pretty printing crosses the C++<->Lisp boundary into a generic-function
dispatch. With *print-pretty* defaulting to T, ordinary princ/prin1/format-~A/
print paid that cost: a moderate (50-element) list took ~2.4 ms to princ,
roughly 370x slower than the non-pretty path (~6 us).

Default it to NIL, as SBCL/CCL/CLISP do. PPRINT, ~<...~:>, the format pretty
directives, and explicit *print-pretty* bindings still pretty-print, and the
interactive REPL re-binds it to T (tpl-print) so interactive output is
unchanged. Programmatic printing is now ~370x faster by default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GlobalAllocationProfiler lives in the THREAD_LOCAL ThreadLocalStateLowLevel
(member _Allocations) and is only ever accessed via
my_thread_low_level->_Allocations, i.e. by the owning thread alone (allocator
fast path, gcFunctions, startRunStop, memoryManagement). There is no shared
instance and no cross-thread read, so the std::atomic counters are pure
overhead on registerAllocation(), which runs on every heap allocation.

Switch them to plain int64_t with in-class zero-init (which also fixes three
counters the constructors never initialized).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default AnsiStream_O::write_string writes one character at a time; each
character pays a boxing (clasp_make_character) and a virtual vectorPushExtend
with a fill-pointer/realloc check. Override it for string-output-streams to (1)
grow the backing string once (geometric), (2) bulk-copy via the underlying
simple-vector with non-virtual typed access, and (3) update the output cursor by
scanning the range once. A safe fallback (the tested unsafe_setf_subseq path)
handles character-source-into-base-string narrowing.

Measured 1.8x (14 chars) to 62x (2000 chars) vs the per-char path; output is
byte-identical (verified against a base/extended/tab/newline/fill-pointer/
narrowing golden test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
countObjectFileNames rescanned the entire _AllObjectFiles list with a memcmp on
each call, and ensureUniqueMemoryBufferName calls it once per JIT-module
registration -- so registering N object files is O(N^2). On a JIT/compilation-
heavy workload it was the single largest self-time function (~15% in one
profile).

_AllObjectFiles is only ever appended to (registerObjectFile, the single add
point) or bulk-cleared, never individually pruned, so keep an auxiliary
mutex-guarded name->count map in sync at those points and answer
countObjectFileNames from it in O(1). The map holds no GC pointers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PARALLEL_MARK was undefined -- the Boehm config-header template default --
even though bdwgc's own CMake defaults it ON and GC_THREADS/THREAD_LOCAL_ALLOC
are already enabled. As a result GC marking (GC_mark_from), the dominant runtime
cost (~25-30% in profiles), ran single-threaded on multicore machines.

clasp's heap marking is Boehm-standard conservative marking (ALL_INTERIOR_
POINTERS=1 + GC_register_displacement for tagged pointers) with no custom mark
procedures or typed descriptors, so parallel markers parallelize Boehm's own
parallel-safe marking loop -- clasp contributes no per-thread marking logic.
bdwgc auto-starts (#CPUs-1) marker threads in GC_thr_init.

Measured ~2.9-3.1x faster full GC on a large live heap. The full regression
suite is byte-identical across 3 runs (no non-determinism).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On Darwin every thread_local access compiles to a _tlv_get_addr thunk call
(there is no ELF initial-exec model; tls_model is a no-op). The bytecode
interpreter hit my_thread on every Lisp call (maybe_step_call's breakstep check)
and in several opcodes.

Resolve my_thread once per VM frame (bytecode_vm and long_dispatch) into a
local and pass it to maybe_step_call, removing the per-call thunk. The thread
does not change during a VM frame. Correct (regression suite identical) and
zero-downside (a no-op load off Darwin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
setf_jit_lookup_t stores a Lisp function pointer into a variable that lives in a
JIT'd dylib (the callback-lisp-function-N global created by make-callback). On
Apple Silicon that memory is MAP_JIT and execute-protected for the thread under
W^X, so the plain store faulted with SIGBUS -- compiling any %defcallback form
crashed (the defcallback-native regression test). It is the only JIT-memory
write unique to callbacks, which is why ordinary compilation was unaffected.

Wrap the store with the existing JITDataReadWriteMaybeExecute()/
JITDataReadExecute() helpers used at the other JIT-literal write sites; the
window contains only the pointer store (no allocation/GC/JIT). No-op off Apple
Silicon. defcallback now works (CFFI-DEFCALLBACK passes; the Lisp callback
drives C qsort correctly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant