Skip to content

fix(cli): serialize concurrent hivemind update with O_EXCL lock#187

Merged
kaghni merged 1 commit into
mainfrom
fix/auto-update-race
May 21, 2026
Merged

fix(cli): serialize concurrent hivemind update with O_EXCL lock#187
kaghni merged 1 commit into
mainfrom
fix/auto-update-race

Conversation

@kaghni
Copy link
Copy Markdown
Collaborator

@kaghni kaghni commented May 20, 2026

Summary

  • Bug: SessionStart hooks dispatch hivemind update detached on every session, twice per session (sync + async) by design. With multiple Claude Code sessions starting in the same second, 2–N concurrent npm install -g @deeplake/hivemind@latest calls race on npm's reify step. They all try to rename the existing install to the same deterministic backup path (.hivemind-<hash>), all but one fail with ENOTEMPTY, and the winner can still be SIGKILLed mid-extract — leaving node_modules/ populated but package.json and bundle/ missing. End-user symptom: hivemind: command not found after autoupdate, plus an orphan .hivemind-<hash> backup that blocks every subsequent npm i -g.
  • Fix: non-blocking O_EXCL pidfile lock at ~/.deeplake/hivemind-update.lock, owned by runUpdate() for the lifetime of npm install -g + hivemind install --skip-auth. Late arrivals log another hivemind update is already running (pid=N); skipping and exit 0. Stale-holder reclaim via process.kill(pid, 0). Released on every exit path. Not acquired for no-op paths (up-to-date, registry-fail, npx, local-dev, unknown, dry-run) so a misbehaving caller can't block real updaters.
  • This is the exact follow-up promised in src/hooks/shared/autoupdate.ts:32-35,:53 — the lock that was intentionally moved out of autoupdate.ts (because dispatch returns instantly) but never landed in runUpdate().

Deliberately NOT changed

  • Number of autoUpdate() dispatch sites. autoupdate.ts:32-53 documents the double-fire as intentional; concurrency is the lock's job.
  • No recency cache. autoupdate.ts:37-54 documents this as explicitly rejected on review ("publish 13:01, users started 13:00 don't pick up until 17:00 — unacceptable").

Observed incident

2026-05-19 17:39:21.184  npm install -g @deeplake/hivemind@latest  (Process A)
2026-05-19 17:39:21.214  npm install -g @deeplake/hivemind@latest  (Process B)
2026-05-19 17:39:21.602  npm install -g @deeplake/hivemind@latest  (Process C)

→ B and C: ENOTEMPTY rename ... -> .hivemind-9dx9PBcH
→ A: SIGKILLed mid-tarball-extract
→ /lib/node_modules/@deeplake/hivemind/  has node_modules/ only
→ /bin/hivemind  → dangling symlink → command not found

Reproduced 2026-05-20 after the 0.7.37 publish triggered another autoupdate race.

Test plan

  • npm run typecheck — clean
  • 6 new unit tests in tests/cli/cli-update.test.ts > runUpdate — concurrency lock covering: alive holder → skip + no spawn + lockfile untouched, stale holder (PID 0x7FFFFFFF) → reclaim + proceed, lock released on success / npm-fail / agent-refresh-fail, lock NOT acquired on any no-op exit path
  • Full focused suite (35/35) stable across 5 runs
  • Multi-process smoke: 3 real node processes launched concurrently against the same lockfile —
    npm install -g spawned by:                  1 processes  (want: 1)
    'another update already running' logged by: 2 processes  (want: 2)
    exit code 0 from:                           3 processes  (want: 3)
    ✓ PASS
    
    Winner ran the fake install (~1605ms), losers exited in 1ms with the skip log, lockfile cleaned up.

Follow-ups (NOT in this PR)

  • A separate, distinct bug surfaces an unrelated failure mode B: in-session hooks that shell out to the hivemind binary (e.g. session-start.ts:137 hivemind skillify mine-local) may hit a binary whose plugin version differs from the in-session ${CLAUDE_PLUGIN_ROOT}, leading to protocol mismatches. And plugin-cache-gc (keep-3 by version) can delete the version a long-lived session is still pinned to, producing UserPromptSubmit hook error: Plugin directory does not exist: ...hivemind/<pinned-version>. Neither is fixed here.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced concurrency handling for npm-global updates; duplicate concurrent update attempts are now silently skipped, and lock files are properly cleaned up.
  • Tests

    • Added comprehensive test coverage for concurrent update scenarios and lock management.

Review Change Stack

`SessionStart` hooks dispatch `hivemind update` detached on every session
start, twice per session (once from session-start.ts, once from
session-start-setup.ts — both by design, per the comment in
src/hooks/shared/autoupdate.ts). With multiple Claude Code sessions
starting in the same second, 2-N concurrent `npm install -g
@deeplake/hivemind@latest` invocations race on npm's reify step: each
one tries to rename the existing install to the SAME deterministic
backup path (`.hivemind-<hash>`), all but one fail with ENOTEMPTY, and
the winner can still be SIGKILLed mid-extract. Result on disk:
`node_modules/` present but `package.json` / `bundle/` missing → bin
symlink dangles → `hivemind: command not found`. Also leaves an
orphan `.hivemind-<hash>` backup that blocks every subsequent install.

Observed in production 2026-05-19 (3 concurrent installs at 17:39:21);
reproduced again on 2026-05-20 after the 0.7.37 publish triggered a
fresh autoupdate race.

The fix is exactly what `src/hooks/shared/autoupdate.ts:32-53` and `:53`
already promised as a follow-up: a non-blocking O_EXCL pidfile lock at
`~/.deeplake/hivemind-update.lock`, owned by `runUpdate()` for the
lifetime of the `npm install -g` + `hivemind install --skip-auth`
sequence. Late arrivals log `another hivemind update is already running
(pid=N); skipping` and exit 0. Stale-holder reclaim via
`process.kill(pid, 0)`. Lock released on every exit path (success,
npm-fail, agent-refresh-fail, throw). Lock is NOT acquired for no-op
exit paths (up-to-date, registry-fail, npx, local-dev, unknown,
dry-run) so a misbehaving caller can't block real updaters.

Deliberately NOT changed:
- Number of autoUpdate() dispatch sites — autoupdate.ts:32-53 documents
  the double-fire as intentional; concurrency is the lock's job.
- No recency cache — autoupdate.ts:37-54 documents this as explicitly
  rejected on review (worst case: "publish 13:01, users started 13:00
  don't pick up until 17:00 — unacceptable").

Tests (tests/cli/cli-update.test.ts):
- alive holder → skip, no spawn, lockfile untouched
- stale holder (PID 0x7FFFFFFF) → reclaim and proceed
- lock released on success / npm-fail / agent-refresh-fail
- lock NOT acquired on any no-op exit path

Multi-process smoke (3 real node processes against the same lockfile):
exactly 1 acquired and ran the install, 2 logged skip + exited 0 in 1ms,
lockfile cleaned up.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 74464ad7-a4eb-4e2e-b51a-272f46169100

📥 Commits

Reviewing files that changed from the base of the PR and between 81a7290 and 1578634.

📒 Files selected for processing (2)
  • src/cli/update.ts
  • tests/cli/cli-update.test.ts

📝 Walkthrough

Walkthrough

This PR adds a non-blocking PID-file lock to runUpdate()'s npm-global path to prevent concurrent updater races. defaultLockPath() is exported as the default lock location, UpdateOptions gains lockPathOverride for testing, and lock helpers serialize npm-install and bundle-refresh steps. Concurrency tests verify lock acquisition, stale-lock reclaim, and release across success and failure scenarios.

Changes

Concurrency Lock for npm-global Updates

Layer / File(s) Summary
Lock contract and imports
src/cli/update.ts
Node.js fs utilities are added to support lockfile operations. Exported defaultLockPath() defines the standard lock location under ~/.deeplake/hivemind-update.lock.
Lock acquisition and release mechanism
src/cli/update.ts
UpdateOptions is extended with lockPathOverride for testing. tryAcquireLock() and releaseLock() implement non-blocking O_EXCL acquisition, stale-lock reclaim by unlinking dead PIDs, and best-effort cleanup.
npm-global update flow with lock
src/cli/update.ts
The "npm-global" branch wraps npm-install and bundle-refresh with lock acquisition. If another updater holds the lock, runUpdate() exits silently with code 0. The lock is released in a finally block covering both success and failure paths.
Concurrency lock test coverage
tests/cli/cli-update.test.ts
Test imports are updated for lock assertions. New "runUpdate — concurrency lock" block verifies live-lock skip, stale-lock reclaim, release on success and error, and non-acquisition for non-upgrade paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • activeloopai/hivemind#91: Adds the centralized npm install -g + hivemind install --skip-auth sequence that this PR now serializes with a PID-file lock.
  • activeloopai/hivemind#97: Triggers hivemind update from multiple session-start hooks; this PR's lock prevents those concurrent invocations from racing during npm-install and bundle-refresh.

Suggested reviewers

  • efenocchi

Poem

🐰 A lock so fine, pidfile in place,
No concurrent dances, a synchronized race—
One updater passes while others wait kind,
The refreshed bundles, with stale locks unpined.
In finally we trust, cleanup's pristine. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding an O_EXCL lock to serialize concurrent hivemind updates.
Description check ✅ Passed The description comprehensively covers the bug, fix, test plan (all checkboxes completed), and explicitly notes what was deliberately not changed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/auto-update-race

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot requested a review from efenocchi May 20, 2026 06:05
@github-actions
Copy link
Copy Markdown
Contributor

Coverage Report

Scope: files changed in this PR. Enforced threshold: 90% per metric (per file via vitest.config.ts).

Status Category Percentage Covered / Total
🟢 Lines 97.62% (🎯 90%) 123 / 126
🟢 Statements 95.68% (🎯 90%) 133 / 139
🟢 Functions 100.00% (🎯 90%) 9 / 9
🔴 Branches 89.23% (🎯 90%) 58 / 65
File Coverage — 1 file changed
File Stmts Branches Functions Lines
src/cli/update.ts 🟢 95.7% 🔴 89.2% 🟢 100.0% 🟢 97.6%

Generated for commit bd536b2.

@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented May 20, 2026

Real-world test — 5 concurrent hivemind update against actual npm registry

To exercise the lock under realistic contention (not just the source-level unit tests), I built this branch as a tarball with a downgraded version (0.7.0) to force the upgrade branch, npm install -g'd it as the global hivemind, then fired 5 simultaneous hivemind update invocations.

Setup

worktree version → 0.7.0
npm run build && npm pack  → deeplake-hivemind-0.7.0.tgz
npm install -g ./deeplake-hivemind-0.7.0.tgz
hivemind --version → 0.7.0

Drive

rm -f ~/.deeplake/hivemind-update.lock
for i in 1 2 3 4 5; do (hivemind update > /tmp/out.$i 2>&1) & done
wait

Result

assertion want got
processes that ran Upgrading via npm… 1 1
processes that logged another hivemind update is already running 4 4
~/.deeplake/hivemind-update.lock after run absent absent
orphan .hivemind-<hash> dir in lib/node_modules/@deeplake/ 0 0
lib/node_modules/@deeplake/hivemind/package.json after run present present
final hivemind --version 0.7.38 (npm latest) 0.7.38

All 5 subprocesses exited 0. The winner ran the actual npm install -g @deeplake/hivemind@latest; the 4 losers exited within milliseconds with the skip log. No corruption, no leak, no orphan backup left to block future installs.

Restored worktree package.json to 0.7.37 and rebuilt; npm install -g @deeplake/hivemind@latest brought the user's global install back to published latest.

@kaghni kaghni merged commit 83a878a into main May 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants