Runtime performance optimizations + Apple Silicon FFI-callback W^X fix by dg1sbg · Pull Request #1771 · clasp-developers/clasp

dg1sbg · 2026-05-25T19:17:20Z

Summary

A set of independent runtime performance improvements (each its own commit), plus one Apple-Silicon correctness fix. All changes are portable — the platform-specific bits compile to no-ops off Apple Silicon.

commit	change	measured effect
Default `print-pretty` to NIL	clasp's pretty printer is a CLOS Gray stream (C++↔Lisp dispatch per character); defaulting it on made every `princ`/`prin1`/`format ~A`/`print` pay that cost	~370× faster non-pretty printing (a 50-elt list: ~2.4 ms → ~6 µs). REPL stays pretty (`tpl-print` rebinds it); `pprint`/`~<~:>`/explicit bindings unaffected
Non-atomic per-thread alloc counters	the counters live in the `THREAD_LOCAL` `ThreadLocalStateLowLevel` and are never accessed cross-thread, so the `std::atomic`s were pure overhead on every allocation (also fixes 3 uninitialized fields)	2–6% on allocation
Bulk `StringOutputStream_O::write_string`	replaces the per-character boxing + virtual `vectorPushExtend` with a single grow + typed bulk copy + scan-once cursor; safe fallback for narrowing	1.8×–62× (scales with length), output byte-identical
`countObjectFileNames` O(N²)→O(1)	it rescanned `_AllObjectFiles` per JIT-module registration; replaced with a mutex-guarded `name→count` index synced at the single add point + the bulk-clear sites	removed the #1 self-time function (~15%) in JIT-heavy workloads
Enable Boehm `PARALLEL_MARK`	it was off (config-header template default) although bdwgc defaults it on and `GC_THREADS`/`THREAD_LOCAL_ALLOC` are enabled; clasp marks conservatively with no custom mark procs, so parallel markers are safe	~3× faster GC marking on multicore
Cache the TLS pointer in the bytecode VM	on Darwin each `thread_local` access is a `_tlv_get_addr` thunk; the interpreter hit `my_thread` per call	removes the per-call thunk (no-op load off Darwin)
Fix Apple-Silicon W^X SIGBUS in FFI callbacks	`setf_jit_lookup_t` stored a function pointer into JIT'd (MAP_JIT) callback memory without enabling writes — `%defcallback` crashed at compile time. Wrapped with the existing `JITDataReadWriteMaybeExecute()`/`JITDataReadExecute()` helpers	`defcallback` now works (was a hard SIGBUS)

Testing

Built and tested on macOS Apple Silicon (LLVM 22), boehmprecise variant:

Full regression suite: 1877 pass, 0 bus errors / segfaults; the only remaining failures are the pre-existing known ones in set-unexpected-failures.lisp (sbcl-cross-compile, include-level, types-classes).
The W^X fix turns the previously-crashing defcallback-native test into a pass (CFFI-DEFCALLBACK — the Lisp callback drives C qsort correctly).
Output correctness verified byte-identical against a base/extended-string, tab/newline, fill-pointer-growth, and narrowing golden test.
Parallel marking verified stable: full regression byte-identical across 3 runs (no non-determinism).

Commits are independent and can be reviewed/cherry-picked separately. The only user-visible behavior change is the *print-pretty* default (deliberate, matching SBCL/CCL/CLISP).

🤖 Generated with Claude Code

clasp's pretty printer is a CLOS Gray stream, so every output character during pretty printing crosses the C++<->Lisp boundary into a generic-function dispatch. With *print-pretty* defaulting to T, ordinary princ/prin1/format-~A/ print paid that cost: a moderate (50-element) list took ~2.4 ms to princ, roughly 370x slower than the non-pretty path (~6 us). Default it to NIL, as SBCL/CCL/CLISP do. PPRINT, ~<...~:>, the format pretty directives, and explicit *print-pretty* bindings still pretty-print, and the interactive REPL re-binds it to T (tpl-print) so interactive output is unchanged. Programmatic printing is now ~370x faster by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GlobalAllocationProfiler lives in the THREAD_LOCAL ThreadLocalStateLowLevel (member _Allocations) and is only ever accessed via my_thread_low_level->_Allocations, i.e. by the owning thread alone (allocator fast path, gcFunctions, startRunStop, memoryManagement). There is no shared instance and no cross-thread read, so the std::atomic counters are pure overhead on registerAllocation(), which runs on every heap allocation. Switch them to plain int64_t with in-class zero-init (which also fixes three counters the constructors never initialized). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The default AnsiStream_O::write_string writes one character at a time; each character pays a boxing (clasp_make_character) and a virtual vectorPushExtend with a fill-pointer/realloc check. Override it for string-output-streams to (1) grow the backing string once (geometric), (2) bulk-copy via the underlying simple-vector with non-virtual typed access, and (3) update the output cursor by scanning the range once. A safe fallback (the tested unsafe_setf_subseq path) handles character-source-into-base-string narrowing. Measured 1.8x (14 chars) to 62x (2000 chars) vs the per-char path; output is byte-identical (verified against a base/extended/tab/newline/fill-pointer/ narrowing golden test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

countObjectFileNames rescanned the entire _AllObjectFiles list with a memcmp on each call, and ensureUniqueMemoryBufferName calls it once per JIT-module registration -- so registering N object files is O(N^2). On a JIT/compilation- heavy workload it was the single largest self-time function (~15% in one profile). _AllObjectFiles is only ever appended to (registerObjectFile, the single add point) or bulk-cleared, never individually pruned, so keep an auxiliary mutex-guarded name->count map in sync at those points and answer countObjectFileNames from it in O(1). The map holds no GC pointers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PARALLEL_MARK was undefined -- the Boehm config-header template default -- even though bdwgc's own CMake defaults it ON and GC_THREADS/THREAD_LOCAL_ALLOC are already enabled. As a result GC marking (GC_mark_from), the dominant runtime cost (~25-30% in profiles), ran single-threaded on multicore machines. clasp's heap marking is Boehm-standard conservative marking (ALL_INTERIOR_ POINTERS=1 + GC_register_displacement for tagged pointers) with no custom mark procedures or typed descriptors, so parallel markers parallelize Boehm's own parallel-safe marking loop -- clasp contributes no per-thread marking logic. bdwgc auto-starts (#CPUs-1) marker threads in GC_thr_init. Measured ~2.9-3.1x faster full GC on a large live heap. The full regression suite is byte-identical across 3 runs (no non-determinism). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On Darwin every thread_local access compiles to a _tlv_get_addr thunk call (there is no ELF initial-exec model; tls_model is a no-op). The bytecode interpreter hit my_thread on every Lisp call (maybe_step_call's breakstep check) and in several opcodes. Resolve my_thread once per VM frame (bytecode_vm and long_dispatch) into a local and pass it to maybe_step_call, removing the per-call thunk. The thread does not change during a VM frame. Correct (regression suite identical) and zero-downside (a no-op load off Darwin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

setf_jit_lookup_t stores a Lisp function pointer into a variable that lives in a JIT'd dylib (the callback-lisp-function-N global created by make-callback). On Apple Silicon that memory is MAP_JIT and execute-protected for the thread under W^X, so the plain store faulted with SIGBUS -- compiling any %defcallback form crashed (the defcallback-native regression test). It is the only JIT-memory write unique to callbacks, which is why ordinary compilation was unaffected. Wrap the store with the existing JITDataReadWriteMaybeExecute()/ JITDataReadExecute() helpers used at the other JIT-literal write sites; the window contains only the pointer store (no allocation/GC/JIT). No-op off Apple Silicon. defcallback now works (CFFI-DEFCALLBACK passes; the Lisp callback drives C qsort correctly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dg1sbg and others added 7 commits May 25, 2026 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime performance optimizations + Apple Silicon FFI-callback W^X fix#1771

Runtime performance optimizations + Apple Silicon FFI-callback W^X fix#1771
dg1sbg wants to merge 7 commits into
clasp-developers:mainfrom
dg1sbg:perf/runtime-optimizations

dg1sbg commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dg1sbg commented May 25, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant