fix(cli): serialize concurrent hivemind update with O_EXCL lock#187
Conversation
`SessionStart` hooks dispatch `hivemind update` detached on every session start, twice per session (once from session-start.ts, once from session-start-setup.ts — both by design, per the comment in src/hooks/shared/autoupdate.ts). With multiple Claude Code sessions starting in the same second, 2-N concurrent `npm install -g @deeplake/hivemind@latest` invocations race on npm's reify step: each one tries to rename the existing install to the SAME deterministic backup path (`.hivemind-<hash>`), all but one fail with ENOTEMPTY, and the winner can still be SIGKILLed mid-extract. Result on disk: `node_modules/` present but `package.json` / `bundle/` missing → bin symlink dangles → `hivemind: command not found`. Also leaves an orphan `.hivemind-<hash>` backup that blocks every subsequent install. Observed in production 2026-05-19 (3 concurrent installs at 17:39:21); reproduced again on 2026-05-20 after the 0.7.37 publish triggered a fresh autoupdate race. The fix is exactly what `src/hooks/shared/autoupdate.ts:32-53` and `:53` already promised as a follow-up: a non-blocking O_EXCL pidfile lock at `~/.deeplake/hivemind-update.lock`, owned by `runUpdate()` for the lifetime of the `npm install -g` + `hivemind install --skip-auth` sequence. Late arrivals log `another hivemind update is already running (pid=N); skipping` and exit 0. Stale-holder reclaim via `process.kill(pid, 0)`. Lock released on every exit path (success, npm-fail, agent-refresh-fail, throw). Lock is NOT acquired for no-op exit paths (up-to-date, registry-fail, npx, local-dev, unknown, dry-run) so a misbehaving caller can't block real updaters. Deliberately NOT changed: - Number of autoUpdate() dispatch sites — autoupdate.ts:32-53 documents the double-fire as intentional; concurrency is the lock's job. - No recency cache — autoupdate.ts:37-54 documents this as explicitly rejected on review (worst case: "publish 13:01, users started 13:00 don't pick up until 17:00 — unacceptable"). Tests (tests/cli/cli-update.test.ts): - alive holder → skip, no spawn, lockfile untouched - stale holder (PID 0x7FFFFFFF) → reclaim and proceed - lock released on success / npm-fail / agent-refresh-fail - lock NOT acquired on any no-op exit path Multi-process smoke (3 real node processes against the same lockfile): exactly 1 acquired and ran the install, 2 logged skip + exited 0 in 1ms, lockfile cleaned up.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds a non-blocking PID-file lock to ChangesConcurrency Lock for npm-global Updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Coverage ReportScope: files changed in this PR. Enforced threshold: 90% per metric (per file via
File Coverage — 1 file changed
Generated for commit bd536b2. |
Real-world test — 5 concurrent
|
| assertion | want | got |
|---|---|---|
processes that ran Upgrading via npm… |
1 | 1 |
processes that logged another hivemind update is already running |
4 | 4 |
~/.deeplake/hivemind-update.lock after run |
absent | absent |
orphan .hivemind-<hash> dir in lib/node_modules/@deeplake/ |
0 | 0 |
lib/node_modules/@deeplake/hivemind/package.json after run |
present | present |
final hivemind --version |
0.7.38 (npm latest) |
0.7.38 |
All 5 subprocesses exited 0. The winner ran the actual npm install -g @deeplake/hivemind@latest; the 4 losers exited within milliseconds with the skip log. No corruption, no leak, no orphan backup left to block future installs.
Restored worktree package.json to 0.7.37 and rebuilt; npm install -g @deeplake/hivemind@latest brought the user's global install back to published latest.
Summary
SessionStarthooks dispatchhivemind updatedetached on every session, twice per session (sync + async) by design. With multiple Claude Code sessions starting in the same second, 2–N concurrentnpm install -g @deeplake/hivemind@latestcalls race on npm's reify step. They all try to rename the existing install to the same deterministic backup path (.hivemind-<hash>), all but one fail withENOTEMPTY, and the winner can still be SIGKILLed mid-extract — leavingnode_modules/populated butpackage.jsonandbundle/missing. End-user symptom:hivemind: command not foundafter autoupdate, plus an orphan.hivemind-<hash>backup that blocks every subsequentnpm i -g.O_EXCLpidfile lock at~/.deeplake/hivemind-update.lock, owned byrunUpdate()for the lifetime ofnpm install -g+hivemind install --skip-auth. Late arrivals loganother hivemind update is already running (pid=N); skippingand exit 0. Stale-holder reclaim viaprocess.kill(pid, 0). Released on every exit path. Not acquired for no-op paths (up-to-date, registry-fail, npx, local-dev, unknown, dry-run) so a misbehaving caller can't block real updaters.src/hooks/shared/autoupdate.ts:32-35,:53— the lock that was intentionally moved out ofautoupdate.ts(because dispatch returns instantly) but never landed inrunUpdate().Deliberately NOT changed
autoUpdate()dispatch sites.autoupdate.ts:32-53documents the double-fire as intentional; concurrency is the lock's job.autoupdate.ts:37-54documents this as explicitly rejected on review ("publish 13:01, users started 13:00 don't pick up until 17:00 — unacceptable").Observed incident
Reproduced 2026-05-20 after the 0.7.37 publish triggered another autoupdate race.
Test plan
npm run typecheck— cleantests/cli/cli-update.test.ts > runUpdate — concurrency lockcovering: alive holder → skip + no spawn + lockfile untouched, stale holder (PID0x7FFFFFFF) → reclaim + proceed, lock released on success / npm-fail / agent-refresh-fail, lock NOT acquired on any no-op exit pathFollow-ups (NOT in this PR)
hivemindbinary (e.g.session-start.ts:137hivemind skillify mine-local) may hit a binary whose plugin version differs from the in-session${CLAUDE_PLUGIN_ROOT}, leading to protocol mismatches. Andplugin-cache-gc(keep-3 by version) can delete the version a long-lived session is still pinned to, producingUserPromptSubmit hook error: Plugin directory does not exist: ...hivemind/<pinned-version>. Neither is fixed here.Summary by CodeRabbit
Release Notes
Bug Fixes
Tests