Skip to content

Enable SMP support in Gerbil#1416

Open
ober wants to merge 17 commits intomighty-gerbils:masterfrom
ober:master
Open

Enable SMP support in Gerbil#1416
ober wants to merge 17 commits intomighty-gerbils:masterfrom
ober:master

Conversation

@ober
Copy link
Collaborator

@ober ober commented Mar 4, 2026

Summary

This PR enables Gerbil to build and run with Gambit's SMP (Symmetric Multi-Processing) mode, where multiple VM processors (OS threads) execute Scheme threads concurrently with stop-the-world garbage collection.

Gambit has had --enable-smp as a configure option for a while, but actually using it with a real workload like building Gerbil's 1184-module standard library exposed a number of race conditions at both the Gambit runtime level and the Gerbil runtime level. This PR contains the Gerbil-side fixes. The corresponding Gambit fixes are in a separate PR against the Gambit repository.

With all fixes applied, Gerbil's full build pipeline (./build.sh — gambit, stage0, stage1, stdlib, libgerbil, lang, r7rs-large, srfi, tools) runs reliably with the default SMP processor count. Single-threaded programs like gsc see zero overhead from the SMP build.

What changed and why

1. Enable SMP in the Gambit configure flags (configure)

--enable-multiple-threaded-vms --enable-smp

This is the top-level change that turns on SMP. Everything else in this PR exists because of what broke when we flipped this switch.

The Gambit submodule is also pointed to a fork (ober/gambit) that contains the necessary Gambit runtime fixes (GC barrier races, hash table spinlocks, move_continuation stack corruption, deferred processor startup, etc.).

2. Hash table locking (src/gerbil/runtime/table.ss)

This is the largest and most important change on the Gerbil side.

The problem: Gerbil's hash tables (raw-table and gc-table) are used pervasively throughout the runtime and compiler — the module registry, symbol tables, expander state, compilation caches, etc. In single-threaded mode, Gambit's scheduler guarantees that only one thread runs at a time, and (declare (not interrupts-enabled)) prevents thread switches during critical sections. In SMP mode, multiple processors run truly concurrently, so these tables get hammered from multiple OS threads simultaneously with no synchronization. The result is corrupted probe sequences, lost entries, and crashes.

The fix — raw tables: A global spinlock (__raw-table-lock) now protects all raw-table operations: ref, set!, update!, delete!, for-each, copy, and clear!. I used a single global lock rather than per-table locks because:

  • Raw tables are small and operations are fast (microseconds), so contention is low in practice.
  • Gerbil's raw tables don't have a convenient place to store a per-table lock without changing the struct layout, which would break the bootstrap.
  • A global lock is simple and correct. We can optimize to per-table locks later if profiling shows contention.

For raw-table-for-each, the lock is held only long enough to snapshot the table vector, then released before calling the user's callback. This prevents deadlocks when the callback itself accesses other tables.

The same locking pattern is applied to the specialized table variants (eq-int-table, etc.) that have their own optimized ref/set/update/del methods.

The fix — gc-tables: GC-managed hash tables (gc-table) have a different problem. Gambit's GC can relocate the underlying gc-hash-table object, which means the gcht pointer you obtained before a mutation might be stale by the time the mutation returns. In single-threaded mode this didn't matter because GC couldn't fire between obtaining the pointer and using it (interrupts were disabled). In SMP mode, another processor can trigger GC at any moment.

The fix adds a verify-retry pattern to gc-table-set! and gc-table-delete!:

(let (gcht (__gc-table-e tab))
  (when (##gc-hash-table-set! gcht key value)
    (__gc-table-rehash! tab)
    (gc-table-set! tab key value))        ;; retry after rehash
  (unless (##eq? gcht (&gc-table-gcht tab))
    (gc-table-set! tab key value)))       ;; retry if GC moved the table

After the mutation, we check if gcht is still the current backing table. If another thread triggered GC and the table was relocated, we retry the operation against the new table.

__gc-table-rehash! is also protected by a spinlock (__gc-table-lock) to prevent concurrent rehashes from corrupting the table. The lazy-creation of the immediate sub-table (for non-heap-allocated keys like fixnums and characters) uses double-checked locking to avoid creating duplicates.

3. Remove (declare (not interrupts-enabled)) (multiple files)

Files: table.ss, control.ss, interface.ss, mop.ss

This is the change that's easiest to get wrong if you don't understand Gambit's SMP internals, so I want to explain it carefully.

In Gambit, the compiler inserts ___POLL() calls at function entry points. ___POLL() does two things: (1) checks if the current thread's quantum has expired, enabling cooperative thread scheduling, and (2) checks if another processor has requested a stop-the-world GC synchronization. When you write (declare (not interrupts-enabled)), the compiler suppresses these ___POLL() calls entirely.

In single-threaded mode, suppressing ___POLL() is a reasonable optimization — it prevents unnecessary thread switches in critical sections and provides a lightweight form of atomicity.

In SMP mode, suppressing ___POLL() is catastrophic. Here's why:

  1. Processor A allocates memory, triggers GC, enters the stop-the-world barrier.
  2. Processor A signals all other processors via INTR_SYNC_OP and waits.
  3. Processor B is running code compiled with (not interrupts-enabled) — there are no ___POLL() calls in this code path.
  4. Processor B never checks for the GC request, never enters the barrier.
  5. Deadlock. A waits for B to arrive at the barrier. B can never see the request.

This is especially bad when Processor B is spinning on a spinlock (like __raw-table-lock) waiting for Processor A — which just triggered GC. Classic deadlock: A waits for B (barrier), B waits for A (lock).

The fix is straightforward: remove (declare (not interrupts-enabled)) everywhere. The atomicity it was providing is now handled by explicit spinlocks, which spin in normal Scheme code that has ___POLL() calls, so processors can always respond to GC barrier requests.

In each case, (declare (not safe)) was kept where it was already present — that's an orthogonal optimization that skips type checks, and doesn't affect GC barrier participation.

Specific locations:

  • table.ss: __eq-hash, __gc-table-e, gc-table-ref, gc-table-set!, gc-table-delete!, gc-table-for-each, __object->eq-hash. These were the most critical because hash table operations are hot paths during compilation.
  • control.ss: The dynamic-wind guards in with-lock and with-unwind-protect. These use ##vector-cas! for atomicity (which is inherently atomic), so the (not interrupts-enabled) declaration was redundant anyway.
  • interface.ss: The defcast macro's generated cast function. Interface dispatch involves type lookups that must not block GC.
  • mop.ss: type-of, class-of, and the type dispatch lambdas in the MOP initialization. These are called on every method dispatch, so blocking GC here was a frequent source of barrier timeouts.

4. Serialized compilation (src/gerbil/compiler/driver.ss)

The problem: Gerbil's parallel build system (make.ss) spawns multiple worker threads that compile modules concurrently. Each worker calls compile-top-module, which accesses shared mutable state: the expander's module registry, the optimizer's binding tables, and the compiler's code generation tables. In single-threaded Gambit, the cooperative scheduler plus strategic with-driver-mutex calls around optimize! was sufficient. In SMP mode, true concurrency means two workers can be inside compile-top-module simultaneously, corrupting shared state.

The fix: Wrap the entire compile-top-module body in with-driver-mutex, effectively serializing compilation. This might seem like it defeats the purpose of parallel builds, but it doesn't — the parallelism in Gerbil's build system is primarily about I/O overlap (one module compiles while another's output is being written to disk, while a third's dependencies are being resolved). The actual compilation of any single module must be serialized because the expander and compiler maintain global mutable state.

The with-driver-mutex macro was also made reentrant:

(defrules with-driver-mutex ()
  ((_ expr)
   (if (eq? (mutex-state +driver-mutex+) (current-thread))
     expr ;; already holding the mutex — reentrant call
     (with-lock +driver-mutex+ (lambda () expr)))))

This is needed because compile-top-module calls optimize!, which was previously wrapped in its own with-driver-mutex. Without reentrancy, this would deadlock — the thread already holds the mutex from compile-top-module and then tries to acquire it again inside optimize!. With the reentrant check, the nested call recognizes the current thread already owns the mutex and proceeds without locking.

+driver-mutex+ is also exported from the module so that make.ss can use it (see below).

5. Unified import/compile mutex (src/std/make.ss)

The problem: make.ss (the parallel build driver) had its own local import-mx mutex for protecting import-module calls, separate from the +driver-mutex+ used by the compiler. This creates a lock ordering problem: a worker thread might hold import-mx while calling import-module, which internally triggers compilation, which tries to acquire +driver-mutex+. Meanwhile another worker holds +driver-mutex+ and tries to import, trying to acquire import-mx. Classic ABBA deadlock.

More fundamentally, import-module mutates the same global __module-registry that compile-top-module reads and writes. Using different mutexes for the same shared state provides no protection at all.

The fix: Remove the local import-mx and use +driver-mutex+ (imported from driver.ss) for import serialization. Now imports and compilations share a single lock, eliminating both the lock ordering problem and the concurrent access to __module-registry.

6. Build configuration (configure, build.sh)

  • configure: Added --enable-multiple-threaded-vms --enable-smp to the Gambit configure flags. Updated the Gambit submodule tag to point to the fork with SMP runtime fixes.
  • build.sh: Removed the ,p1 GAMBOPT setting that was temporarily limiting Gerbil builds to a single VM processor while fixes were being developed. No longer needed.
  • .gitmodules: Pointed to ober/gambit (the fork with SMP fixes) instead of gambit/gambit.

Testing

Test Before SMP fixes After
4-thread arithmetic stress test (p2) ~20% pass rate 50/50 pass
4-thread arithmetic stress test (p16) ~0% pass rate 50/50 pass
Gerbil stdlib build (p2) ~60% pass rate 20/20 pass
Full Gerbil build pipeline (p8) crash 5/5 pass
Full ./build.sh end-to-end N/A (used ,p1 workaround) pass
Single-threaded gsc performance 5+ hours wall time (22min CPU) 22 minutes (zero SMP overhead)

The single-threaded gsc result is worth calling out: with the deferred processor startup fix on the Gambit side, programs that don't create threads (like gsc, which dominates Gerbil's build time) run with zero SMP overhead. The extra VM processor threads are only spawned on the first thread-start! call.

Files changed

File What Why
configure SMP configure flags + submodule tag Enable SMP mode
.gitmodules Point to ober/gambit fork Contains Gambit runtime fixes
src/gerbil/runtime/table.ss Spinlock locking for raw-table and gc-table, verify-retry for concurrent rehash, double-checked locking for immediate table creation Hash tables are accessed concurrently by multiple VM processors
src/gerbil/runtime/control.ss Remove (not interrupts-enabled) Prevents GC barrier deadlock
src/gerbil/runtime/interface.ss Remove (not interrupts-enabled) Prevents GC barrier deadlock
src/gerbil/runtime/mop.ss Remove (not interrupts-enabled) Prevents GC barrier deadlock
src/gerbil/compiler/driver.ss Serialize compile-top-module, reentrant mutex, export +driver-mutex+ Compiler state is not thread-safe
src/std/make.ss Use +driver-mutex+ for imports instead of local mutex Shared state needs a shared lock

Future work

  • Per-table locks for raw-table: The global __raw-table-lock works but could become a bottleneck under heavy concurrent access. If profiling shows contention, we could embed a lock in the table struct (would require a bootstrap cycle).
  • Lock-free reads: Read operations on raw-tables could potentially be made lock-free with careful ordering, since the underlying vector is immutable between rehashes.
  • Finer-grained compiler locking: Currently the entire compile-top-module is serialized. With some refactoring to isolate the mutable expander state, we could allow more overlap between compilation stages.

ober and others added 11 commits March 4, 2026 08:43
Add thread-safe locking to Gerbil's hash table implementation:
- Add __raw-table-lock spinlock protecting all read and write operations
  on raw tables (ref, set!, update!, delete!, for-each, copy, clear!)
- Use snapshot-under-lock for raw-table-for-each to prevent deadlock
  when callbacks access other tables (e.g. __specialize-method in MOP)
- Add __gc-table-lock spinlock for gc-table operations (rehash, immediate
  creation, clear!)
- Fix gc-table-set!/delete! to verify gcht is still current after
  ##gc-hash-table-set! (another thread may have resized concurrently)

Serialize module compilation to prevent concurrent expander access:
- Wrap compile-top-module in with-driver-mutex to serialize compilation
- Make with-driver-mutex reentrant (check if current thread holds mutex)
- Use +driver-mutex+ for import/mx in make.ss instead of separate mutex,
  ensuring imports and compilation share the same lock
- Export +driver-mutex+ from compiler/driver

Results: P1 SMP stdlib build fully passes (339 modules).
P2 improved from ~21 to ~200 modules before crash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In Gambit SMP, (declare (not interrupts-enabled)) suppresses ___POLL()
generation at compile time. ___POLL is the ONLY mechanism by which a
processor detects GC synchronization requests from other processors.
Without it, the GC barrier waits indefinitely (no timeout) for the
processor to respond, causing deadlocks:

  1. Processor A holds __raw-table-lock, allocates during rehash,
     triggers GC, enters barrier waiting for Processor B
  2. Processor B is spinning on __raw-table-lock in (not interrupts-enabled)
     code with no ___POLL checks, cannot respond to GC sync
  3. Deadlock: A waits for B (barrier), B waits for A (spinlock)

The fix: remove (declare (not interrupts-enabled)) from all table
operation functions. The __lock-inline! macro already has its own
local (not interrupts-enabled) for the fast spin loop. Code between
lock/unlock now gets ___POLL checks, allowing the processor to
participate in GC sync while holding the lock. The GC properly
updates stack-referenced objects after moving them, so this is safe.

Results: P2 CORES=2 stdlib build now FULLY PASSES (316 modules).
Previously crashed at ~27 modules with SIGSEGV.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gambit submodule: force lazy rehash (prevents non-eq? table corruption),
fix rehash self-deadlock in striped spinlock array.

Gerbil runtime: remove (declare (not interrupts-enabled)) from control.ss,
interface.ss, and mop.ss — these declarations prevent GC barrier
synchronization in SMP builds, causing deadlocks when other VM processors
need garbage collection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With --enable-smp, the default processor count creates idle VM
processors that cause heartbeat storms and a GC race condition at
higher processor counts (SIGSEGV in compiled Scheme code accessing
stale heap pointers after GC). The crash scales with processor count:
p1 and p2 are reliable, p16 crashes consistently.

Force p1 during build since compilation is single-threaded anyway.
Runtime programs can set their own processor count via GAMBOPT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update Gambit submodule with the fix for a bug in move_continuation()
that caused SIGSEGV with multiple VM processors. The bug wrote to the
wrong stack frame slot when marking the first break frame's msection as
NULL (not a stack overflow), causing ___stack_overflow_undo_if_possible
to read stale data and corrupt stack_break with a code label address.

With this fix, Gerbil builds and runs correctly with the default SMP
processor count, so remove the p1 GAMBOPT workaround from build.sh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match the installed Gerbil's Gambit version (v4.9.7-43-gbe71ec38).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@netlify
Copy link

netlify bot commented Mar 4, 2026

👷 Deploy request for elastic-ritchie-8f47f9 pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit 584d21e

@vyzo
Copy link
Collaborator

vyzo commented Mar 4, 2026

the problems are real, but the fixes incorrect.... glibal locks? WTF.

ober and others added 5 commits March 4, 2026 17:56
Picks up RC subsystem race fixes, GC mark_rc for all processors,
and atomic device refcount for multi-processor builds.
Picks up fca110bb which adds spinlock protection to the device group
doubly-linked list and CAS-based thread-safe foreign object release,
fixing use-after-free crashes during SMP shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The spinlock macro in util.ss still had (declare (not interrupts-enabled))
after d18e633 removed it from mop.ss. This caused a GC barrier deadlock:
a thread spinning to acquire an inline lock could not respond to GC
synchronization requests, stalling the GC and preventing the lock holder
from ever releasing it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix device condvar broadcast deadlock: move btq-deq-remove to the
"no more threads" branch so stranded condvars are not lost on trylock
failure. Also fix execute-pending-compile-jobs! to use thread-send/
thread-receive instead of thread-join! to avoid SMP condvar race.

Changes:
- src/gambit: update submodule to 78105346
- configure: update default_gambit_tag to 78105346
- src/gerbil/compiler/base.ss: use thread-send/receive in execute-pending-compile-jobs!
- src/bootstrap/gerbil/compiler/base~0.scm: matching compiled update
- src/build.sh: add m8G to GAMBOPT for SMP builds
- src/build/build{0,1}.ss: add explicit exit to avoid SMP shutdown hang
- src/build/build-bach.ss, std, lang, srfi, r7rs-large, tools: same

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…removal

Compiled bootstrap counterparts for the __lock-inline! fix:
- src/bootstrap/gerbil/runtime/interface~0.scm
- src/bootstrap/gerbil/runtime/mop~0.scm
- src/bootstrap/gerbil/runtime/util~1.scm

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants