Runtime performance optimizations + Apple Silicon FFI-callback W^X fix#1771
Open
dg1sbg wants to merge 7 commits into
Open
Runtime performance optimizations + Apple Silicon FFI-callback W^X fix#1771dg1sbg wants to merge 7 commits into
dg1sbg wants to merge 7 commits into
Conversation
clasp's pretty printer is a CLOS Gray stream, so every output character during pretty printing crosses the C++<->Lisp boundary into a generic-function dispatch. With *print-pretty* defaulting to T, ordinary princ/prin1/format-~A/ print paid that cost: a moderate (50-element) list took ~2.4 ms to princ, roughly 370x slower than the non-pretty path (~6 us). Default it to NIL, as SBCL/CCL/CLISP do. PPRINT, ~<...~:>, the format pretty directives, and explicit *print-pretty* bindings still pretty-print, and the interactive REPL re-binds it to T (tpl-print) so interactive output is unchanged. Programmatic printing is now ~370x faster by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GlobalAllocationProfiler lives in the THREAD_LOCAL ThreadLocalStateLowLevel (member _Allocations) and is only ever accessed via my_thread_low_level->_Allocations, i.e. by the owning thread alone (allocator fast path, gcFunctions, startRunStop, memoryManagement). There is no shared instance and no cross-thread read, so the std::atomic counters are pure overhead on registerAllocation(), which runs on every heap allocation. Switch them to plain int64_t with in-class zero-init (which also fixes three counters the constructors never initialized). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default AnsiStream_O::write_string writes one character at a time; each character pays a boxing (clasp_make_character) and a virtual vectorPushExtend with a fill-pointer/realloc check. Override it for string-output-streams to (1) grow the backing string once (geometric), (2) bulk-copy via the underlying simple-vector with non-virtual typed access, and (3) update the output cursor by scanning the range once. A safe fallback (the tested unsafe_setf_subseq path) handles character-source-into-base-string narrowing. Measured 1.8x (14 chars) to 62x (2000 chars) vs the per-char path; output is byte-identical (verified against a base/extended/tab/newline/fill-pointer/ narrowing golden test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
countObjectFileNames rescanned the entire _AllObjectFiles list with a memcmp on each call, and ensureUniqueMemoryBufferName calls it once per JIT-module registration -- so registering N object files is O(N^2). On a JIT/compilation- heavy workload it was the single largest self-time function (~15% in one profile). _AllObjectFiles is only ever appended to (registerObjectFile, the single add point) or bulk-cleared, never individually pruned, so keep an auxiliary mutex-guarded name->count map in sync at those points and answer countObjectFileNames from it in O(1). The map holds no GC pointers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PARALLEL_MARK was undefined -- the Boehm config-header template default -- even though bdwgc's own CMake defaults it ON and GC_THREADS/THREAD_LOCAL_ALLOC are already enabled. As a result GC marking (GC_mark_from), the dominant runtime cost (~25-30% in profiles), ran single-threaded on multicore machines. clasp's heap marking is Boehm-standard conservative marking (ALL_INTERIOR_ POINTERS=1 + GC_register_displacement for tagged pointers) with no custom mark procedures or typed descriptors, so parallel markers parallelize Boehm's own parallel-safe marking loop -- clasp contributes no per-thread marking logic. bdwgc auto-starts (#CPUs-1) marker threads in GC_thr_init. Measured ~2.9-3.1x faster full GC on a large live heap. The full regression suite is byte-identical across 3 runs (no non-determinism). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On Darwin every thread_local access compiles to a _tlv_get_addr thunk call (there is no ELF initial-exec model; tls_model is a no-op). The bytecode interpreter hit my_thread on every Lisp call (maybe_step_call's breakstep check) and in several opcodes. Resolve my_thread once per VM frame (bytecode_vm and long_dispatch) into a local and pass it to maybe_step_call, removing the per-call thunk. The thread does not change during a VM frame. Correct (regression suite identical) and zero-downside (a no-op load off Darwin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
setf_jit_lookup_t stores a Lisp function pointer into a variable that lives in a JIT'd dylib (the callback-lisp-function-N global created by make-callback). On Apple Silicon that memory is MAP_JIT and execute-protected for the thread under W^X, so the plain store faulted with SIGBUS -- compiling any %defcallback form crashed (the defcallback-native regression test). It is the only JIT-memory write unique to callbacks, which is why ordinary compilation was unaffected. Wrap the store with the existing JITDataReadWriteMaybeExecute()/ JITDataReadExecute() helpers used at the other JIT-literal write sites; the window contains only the pointer store (no allocation/GC/JIT). No-op off Apple Silicon. defcallback now works (CFFI-DEFCALLBACK passes; the Lisp callback drives C qsort correctly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A set of independent runtime performance improvements (each its own commit), plus one Apple-Silicon correctness fix. All changes are portable — the platform-specific bits compile to no-ops off Apple Silicon.
*print-pretty*to NILprinc/prin1/format ~A/printpay that costtpl-printrebinds it);pprint/~<~:>/explicit bindings unaffectedTHREAD_LOCALThreadLocalStateLowLeveland are never accessed cross-thread, so thestd::atomics were pure overhead on every allocation (also fixes 3 uninitialized fields)StringOutputStream_O::write_stringvectorPushExtendwith a single grow + typed bulk copy + scan-once cursor; safe fallback for narrowingcountObjectFileNamesO(N²)→O(1)_AllObjectFilesper JIT-module registration; replaced with a mutex-guardedname→countindex synced at the single add point + the bulk-clear sitesPARALLEL_MARKGC_THREADS/THREAD_LOCAL_ALLOCare enabled; clasp marks conservatively with no custom mark procs, so parallel markers are safethread_localaccess is a_tlv_get_addrthunk; the interpreter hitmy_threadper callsetf_jit_lookup_tstored a function pointer into JIT'd (MAP_JIT) callback memory without enabling writes —%defcallbackcrashed at compile time. Wrapped with the existingJITDataReadWriteMaybeExecute()/JITDataReadExecute()helpersdefcallbacknow works (was a hard SIGBUS)Testing
Built and tested on macOS Apple Silicon (LLVM 22),
boehmprecisevariant:set-unexpected-failures.lisp(sbcl-cross-compile, include-level, types-classes).defcallback-nativetest into a pass (CFFI-DEFCALLBACK— the Lisp callback drives Cqsortcorrectly).Commits are independent and can be reviewed/cherry-picked separately. The only user-visible behavior change is the
*print-pretty*default (deliberate, matching SBCL/CCL/CLISP).🤖 Generated with Claude Code