Skip to content

[Critical] Memory leak causes kernel soft lockups (356s, 7/8 CPUs), OOM kills (111GB virt/21GB RSS), and total system death on Linux #13230

@roymecat

Description

@roymecat

Summary

OpenCode's unbounded memory growth causes catastrophic, unrecoverable system failures on a Linux VM: a single opencode process balloons to 116 GB virtual / 21 GB RSS (on a 20 GB RAM machine), triggering OOM kills, kernel soft lockups across 7 of 8 CPUs simultaneously for up to 356 seconds, and RCU subsystem starvation — rendering the entire system completely dead. No SSH, no console input, no recovery without hard power-off.

This is not a gradual degradation. It is a total system kill that escalates with each restart cycle — crash intervals accelerated from 52 hours → 8 hours → 2.5 hours over 4 days.

Environment

Component Detail
OS Debian 13 (trixie)
Kernel 6.12.63+deb13-amd64, PREEMPT_DYNAMIC (voluntary)
CPU AMD Ryzen 7 5800X 8-Core (8 vCPUs allocated)
RAM 20 GB
Swap 10.2 GB partition (/dev/sda5)
Hypervisor VirtualBox 7.x (KVM paravirt, kvm-clock)
OpenCode v1.1.56 (Go binary at /root/.local/bin/opencode)

Crash Timeline (Feb 8–12, 2026)

Over 4 days, the system crashed 4 times with decreasing intervals between failures:

Boot Time Range Survived Failure Mode
-5 Feb 8 12:39–12:42 3 min Immediate crash (suspected)
-4 Feb 8 12:44 – Feb 9 11:30 23 hours Unknown
-3 Feb 9 11:57 – Feb 11 15:55 52 hours 2× OOM Kill — opencode at 111GB virt / 21GB RSS
-2 Feb 11 16:06 – Feb 12 00:08 8 hours 7/8 CPUs soft-locked 356s, RCU starvation, kernel panic-level
-1 Feb 11 23:53 – Feb 12 02:20 2.5 hours CPU#4 soft lockup cascade (21s → 140s → unrecoverable)
0 Feb 12 02:20 – present Running Already showing 74.7 GB virtual after 10 min

The crash interval is accelerating: 52h → 8h → 2.5h.


Detailed Kernel Evidence

Event 1: OOM Kill #1 — Single process at 116 GB virtual memory (Feb 10, 00:29)

The OOM killer was invoked by opencode itself (PID 168718), and killed a sibling opencode process (PID 146787) that had consumed more memory than the entire physical RAM:

opencode invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
CPU: 6 UID: 0 PID: 168718 Comm: opencode Not tainted 6.12.63+deb13-amd64 #1  Debian 6.12.63-1

Process table at time of OOM (5 opencode instances running):

[  PID  ] uid  PID   total_vm      rss    rss_anon rss_file rss_shmem pgtables_bytes swapents  oom_score_adj name
[   3524]   0  3524  18643893    34165    34022       24       119  1433600    52448             0 opencode
[   4768]   0  4768  18671866    17406    17215       90       101  1495040    55104             0 opencode
[  18118]   0 18118  18613474    11315    11128      125        62  1421312    59584             0 opencode
[ 146787]   0 146787 29098165  5317087  5316838        0       249 60379136   914656             0 opencode  ← KILLED
[ 168718]   0 168718 18598956    50332    50115       70       147  1478656    25952             0 opencode

Kill verdict:

oom-kill:constraint=CONSTRAINT_NONE,...task=opencode,pid=146787,uid=0
Out of memory: Killed process 146787 (opencode) total-vm:116392660kB, anon-rss:21267352kB, file-rss:0kB, shmem-rss:996kB, UID:0 pgtables:58964kB oom_score_adj:0

Key numbers for PID 146787:

  • Virtual memory: 116,392,660 KB (111 GB) — 5.5× physical RAM
  • Resident (RSS): 21,267,352 KB (20.3 GB) — exceeds total 20 GB physical RAM
  • Page tables: 58,964 KB (57 MB) — page table overhead alone is enormous
  • Swap entries: 914,656 pages (~3.5 GB in swap)

The 4 "normal" opencode instances each consumed ~75 GB virtual memory. Even without the monster process, the baseline is absurd.


Event 2: OOM Kill #2 — 13 concurrent opencode processes (Feb 11, 12:18)

36 hours later, the same pattern repeated but with 13 opencode processes alive simultaneously:

[   4768]   0  4768  18706683     9129     9129        0         0  1531904    63464             0 opencode
[  18118]   0 18118  18681059    14537    14475        0        62  1458176    55300             0 opencode
[ 210316]   0 210316 18672792    25590    25577        0        13  1413120    43360             0 opencode
[ 211759]   0 211759 18622517     3438     3408        9        21  1388544    72224             0 opencode
[ 223030]   0 223030 18621687     2125     2125        0         0  1433600    70720             0 opencode
[ 223761]   0 223761 18615358     2442     2407       35         0  1478656    69728             0 opencode
[ 256649]   0 256649 18663263     9088     9088        0         0  1241088    53467             0 opencode
[ 331552]   0 331552 26126040  5513885  5513829        0        56 61468672  1612240             0 opencode  ← KILLED
[ 336536]   0 336536 18596828    13919    13918        0         1  1314816    64754             0 opencode
[ 337847]   0 337847 18663382    12878    12864       14         0  1155072    54764             0 opencode
[ 337960]   0 337960 18605576     7985     7976        0         9  1445888    65344             0 opencode
[ 363065]   0 363065 18688901    16147    16147        0         0  1302528    61720             0 opencode
[ 696953]   0 696953 18568868    46722    46722        0         0   909312     1952             0 opencode

Kill verdict:

Out of memory: Killed process 331552 (opencode) total-vm:104504160kB, anon-rss:22055316kB, file-rss:0kB, shmem-rss:224kB, UID:0 pgtables:60028kB oom_score_adj:0

Key numbers for PID 331552:

  • Virtual memory: 104,504,160 KB (99.7 GB)
  • Resident (RSS): 22,055,316 KB (21 GB) — again exceeding total physical RAM
  • Swap entries: 1,612,240 pages (~6.1 GB in swap)

Also present in the OOM dump: multiple chrome-headless-shell, bun, node (MainThread), python3, and git processes — all spawned by or related to opencode's tool ecosystem.

systemd-journald was forced to flush its caches due to memory pressure.


Event 3: Catastrophic multi-CPU soft lockup — 7/8 CPUs dead (Feb 11, 23:52)

This is the most severe event. 7 of 8 CPUs locked simultaneously, followed by escalation to 356-second lockups with opencode explicitly named as the offending process:

Phase 1 — Mass lockup (23:52:34):

watchdog: BUG: soft lockup - CPU#3 stuck for 43s! [systemd:1]
watchdog: BUG: soft lockup - CPU#5 stuck for 43s! [systemd-logind:700]
watchdog: BUG: soft lockup - CPU#2 stuck for 43s! [swapper/2:0]
watchdog: BUG: soft lockup - CPU#4 stuck for 43s! [swapper/4:0]
watchdog: BUG: soft lockup - CPU#7 stuck for 43s! [swapper/7:0]
watchdog: BUG: soft lockup - CPU#6 stuck for 43s! [Worker:103451]
watchdog: BUG: soft lockup - CPU#1 stuck for 43s! [systemd-journal:349]
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

Phase 2 — Escalation (00:00:25), 8 minutes later:

watchdog: BUG: soft lockup - CPU#7 stuck for 356s! [opencode:102278]
CPU: 7 UID: 0 PID: 102278 Comm: opencode Tainted: G             L     6.12.63+deb13-amd64 #1  Debian 6.12.63-1
watchdog: BUG: soft lockup - CPU#2 stuck for 356s! [swapper/2:0]
watchdog: BUG: soft lockup - CPU#3 stuck for 356s! [swapper/3:0]
watchdog: BUG: soft lockup - CPU#4 stuck for 27s! [kworker/4:0+events]
rcu: rcu_preempt kthread starved for 95731 jiffies!
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:

What happened:

  • CPU#7 was held hostage by opencode PID 102278 for 356 seconds (nearly 6 minutes) without yielding
  • The kernel's RCU (Read-Copy-Update) subsystem was starved for 95,731 jiffies (~958 seconds / 16 minutes) — this means the kernel could not perform basic memory reclamation, slab cache cleanup, or deferred freeing for over 16 minutes
  • The kernel explicitly warned: "Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior"
  • Multiple CPUs were stuck in swapper (kernel idle process), meaning they couldn't even return to idle state — a sign of deep kernel-level deadlock caused by memory pressure
  • The Tainted: G L flag confirms the kernel was tainted by the soft lockup (L = SOFTLOCKUP)

Event 4: CPU#4 soft lockup cascade via tmux (Feb 12, 01:35)

The final crash before the current boot. The tmux server process (opencode's host) triggered the initial lockup:

watchdog: BUG: soft lockup - CPU#4 stuck for 21s! [tmux: server:1876]
watchdog: BUG: soft lockup - CPU#4 stuck for 140s! [swapper/4:0]
watchdog: BUG: soft lockup - CPU#4 stuck for 55s! [swapper/4:0]
watchdog: BUG: soft lockup - CPU#4 stuck for 62s! [swapper/4:0]   # 42 minutes later, still stuck

Preceding clocksource warnings (showing progressive CPU starvation):

clocksource: Long readout interval, skipping watchdog check: ... interval=2388442930ns (2.3s)
clocksource: Long readout interval, skipping watchdog check: ... interval=10133624930ns (10.1s)
clocksource: Long readout interval, skipping watchdog check: ... interval=20316688930ns (20.3s)

The clocksource readout intervals escalated from 2.3s → 10.1s → 20.3s before the lockup hit, showing the system was progressively starving for CPU time.


Current Session — The Bomb is Already Ticking

Just 10 minutes after a fresh boot, the current opencode process already shows alarming numbers:

PID 1812: VSZ 74,771,716 kB (74.7 GB virtual), RSS 588,896 kB (589 MB)

74.7 GB virtual memory after 10 minutes of runtime. Based on the observed pattern, RSS will grow unboundedly until it exceeds physical RAM, triggering the same cascade.


Cascading Failure Mechanism

The progression follows a textbook cascading failure pattern:

opencode memory leak (unbounded growth)
    → RSS exceeds physical RAM
        → Kernel starts heavy swapping (10.2 GB swap fills)
            → Swap I/O saturates disk, all processes stall waiting for pages
                → CPU cores stuck in page fault handlers / swap writeback
                    → Kernel watchdog detects soft lockup (no scheduling for 10s+)
                        → RCU grace-period kthread starved (can't run GC)
                            → Memory reclamation impossible
                                → More OOM pressure, more swapping, more lockups
                                    → Total system death (no SSH, no console, no recovery)

The VirtualBox hypervisor layer adds an additional timing distortion — the virtual clock drifts when CPUs are overloaded (evidenced by clocksource: Long readout interval warnings), which makes the watchdog trigger more aggressively and the system less able to self-recover.

Root Cause Analysis

The memory growth pattern is consistent with the leaks identified in PR #10913:

  1. AsyncQueue never terminates (util/queue.ts) — [Symbol.asyncIterator]() loops forever via while (true), preventing GC of completed task objects and their closures
  2. Bash tool unbounded string concatenation (tool/bash.ts:167-189) — command output is accumulated without any size cap; long-running or verbose commands cause unbounded string growth
  3. LSP diagnostics Map never cleared (lsp/client.ts:51) — diagnostic entries accumulate indefinitely across file changes; the Map only grows, never shrinks
  4. Bus subscription leaks — event subscriptions are created but never unsubscribed, holding references to closures and their captured scope

Additional contributing factors observed in this environment:

  • Multiple concurrent opencode processes: Up to 13 instances observed simultaneously. Each "idle" instance consumes ~75 GB virtual memory. The process spawning appears unbounded.
  • Child process accumulation: OOM dumps show chrome-headless-shell, bun, node (MainThread), python3, and git processes — all spawned by opencode's tool/MCP ecosystem and apparently not cleaned up.
  • Kernel preemption model: The kernel runs PREEMPT_DYNAMIC in voluntary mode, meaning it won't forcibly preempt a CPU-bound userspace process. A runaway opencode process can monopolize a CPU core indefinitely.

Impact

  • Complete system unresponsiveness — no SSH, no console input, no Ctrl+C, no SysRq
  • Requires hard power-off to recover (VirtualBox "Power Off" or host kill)
  • Data loss / corruption — unclean shutdowns corrupt systemd journal (systemd-journald: File /var/log/journal/.../system.journal corrupted or uncleanly shut down, renaming and replacing)
  • Crash frequency accelerating — 52h → 8h → 2.5h between failures, suggesting the leak rate increases with accumulated state
  • Host system impact — VirtualBox VM lockup can degrade host system responsiveness

Reproduction Steps

  1. Run opencode on a Linux system with ≤ 20 GB RAM (VM or bare metal)
  2. Use it actively with multiple tool calls, background agents, and LSP active
  3. Monitor memory growth:
    watch -n5 'ps -o pid,vsz,rss,comm -C opencode; echo "---"; free -h'
  4. Observe:
    • Virtual memory climbs past 75 GB within minutes of startup
    • RSS grows steadily without bound during active use
    • After 2–52 hours (depending on workload intensity), RSS exceeds physical RAM
    • System becomes completely unresponsive shortly after

Related Issues

Issue Title Status
#9743 Memory Leak: OOM Killer During Extended Runtime 🔴 Open
#3013 Uses a huge amount of memory 🔴 Open (6 👍, multiple duplicates)
#5700 Too high memory usage 🔴 Open (dupes: #5363, #3995, #4315)
#6172 High CPU (100%+) during LLM streaming in long sessions 🔴 Open
#4804 High CPU usage (increases even when idle) 🟢 Closed
#10913 fix: multiple memory leaks in long-running sessions 🔴 Open PR

This issue provides the most detailed kernel-level evidence of the downstream consequences of these memory leaks, including the exact cascading failure mechanism from memory leak → OOM → soft lockup → RCU starvation → total system death.

Suggested Fixes

Immediate (stop the bleeding):

  1. Merge PR fix: multiple memory leaks in long-running sessions #10913 — addresses 4 confirmed leak sources
  2. Add a self-imposed RSS limit — monitor own RSS via /proc/self/statm and trigger graceful session compaction or restart when approaching a threshold (e.g., 4 GB)
  3. Cap bash tool output buffer — truncate accumulated output after a configurable limit (e.g., 10 MB)

Structural:

  1. Bound concurrent child processes — 13 simultaneous opencode instances is excessive; implement a process pool with a hard cap
  2. Implement LSP diagnostics eviction — LRU or TTL-based eviction for the diagnostics Map
  3. Add periodic forced GCglobal.gc() with --expose-gc at regular intervals during long sessions
  4. Clean up child processes on session end — ensure chrome-headless-shell, bun, node, git subprocesses are terminated when their parent session ends

Defensive:

  1. Ship with a recommended systemd slice config — provide users a resource-limiting unit file that prevents opencode from killing the host system
  2. Add memory usage telemetry — log RSS at regular intervals so users and developers can identify leak patterns before they become catastrophic

Metadata

Metadata

Assignees

Labels

perfIndicates a performance issue or need for optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions