Skip to content

refactor: extract SyncStorage interface for async backends#75

Merged
HrushiBorhade merged 1 commit into
mainfrom
phase-2/sync-storage
Mar 24, 2026
Merged

refactor: extract SyncStorage interface for async backends#75
HrushiBorhade merged 1 commit into
mainfrom
phase-2/sync-storage

Conversation

@HrushiBorhade
Copy link
Copy Markdown
Owner

@HrushiBorhade HrushiBorhade commented Mar 24, 2026

Summary

Extracts a SyncStorage interface from sync.ts so the Merkle sync logic works with both SQLite (CLI) and async Drizzle/Neon (cloud tasks).

Changes

  • SyncStorage interfacegetFileHash, setFileHash, getDirHash, setDirHash, clearDirHashes, getAllFileHashes, transaction
  • SqliteSyncStorage class — wraps existing better-sqlite3 calls, backward compatible
  • computeChanges + persistMerkleState — now accept optional SyncStorage parameter (defaults to SqliteSyncStorage)
  • Pure Merkle logic exportedbuildMerkleTree, hashFiles for reuse
  • All functions now async — returns Promise for compatibility with async backends

Test plan

  • All 10 sync tests pass
  • pnpm typecheck passes
  • Backward compatible — existing CLI works unchanged (defaults to SqliteSyncStorage)
  • DrizzleSyncStorage implementation (Step 3, in apps/trigger)

Closes #70

Summary by CodeRabbit

  • New Features

    • Extended the public API with new utilities for building Merkle trees and hashing files.
    • Added configurable storage abstraction enabling custom storage implementations.
  • Breaking Changes

    • persistMerkleState is now asynchronous and requires await when called.

- SyncStorage interface abstracts DB I/O (getFileHash, setFileHash, getDirHash, etc.)
- SqliteSyncStorage wraps existing better-sqlite3 calls (backward compatible)
- computeChanges and persistMerkleState now accept optional SyncStorage parameter
- Pure Merkle logic (buildMerkleTree, hashFiles) extracted and exported
- All sync functions are now async (returns Promise)
- All 10 sync tests pass with no modifications to test logic

Closes #70
@HrushiBorhade HrushiBorhade added phase-2 Phase 2: Indexing Pipeline indexing Indexing pipeline tasks labels Mar 24, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
code-indexer-web Ready Ready Preview, Comment Mar 24, 2026 11:06am

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

The changes introduce a SyncStorage interface abstracting database persistence operations for file and directory hashes, replacing direct SQLite calls with dependency injection. SqliteSyncStorage implements this interface for CLI usage. Core functions computeChanges() and persistMerkleState() now accept optional storage parameters and operate asynchronously via promises.

Changes

Cohort / File(s) Summary
Storage Abstraction Layer
packages/core/src/lib/sync.ts
Introduced SyncStorage interface defining async methods for hash persistence (getFileHash, setFileHash, getDirHash, setDirHash, getAllFileHashes, transaction). Added SqliteSyncStorage implementation wrapping existing SQLite operations. Refactored computeChanges() and persistMerkleState() to accept optional storage parameter and converted to async/await patterns.
Test Updates
packages/core/src/lib/sync.test.ts
Added await to all persistMerkleState() calls to properly handle async control flow in test assertions.
Public API Expansion
packages/core/src/index.ts
Extended named exports with buildMerkleTree, hashFiles, SqliteSyncStorage. Updated type exports to include SyncStorage and MerkleTree alongside SyncResult.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop! The storage interface now sets us free,
No sync-locked bounds, async you see!
SqliteStorage hops with the CLI,
While Drizzle awaits in the cloud up high—
One pattern, many backends, pure glee! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main refactoring objective: extracting a SyncStorage interface to support async backends, which aligns with the primary changeset across sync.ts, tests, and exports.
Linked Issues check ✅ Passed All coding requirements from issue #70 are met: SyncStorage interface defined, pure Merkle logic (computeChanges, persistMerkleState) extracted, SqliteSyncStorage implemented, sync.ts refactored to accept SyncStorage parameter, exports updated, and all tests pass.
Out of Scope Changes check ✅ Passed All changes are directly scoped to requirements: new interface definition, function refactoring, export updates, and test adjustments for async behavior. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch phase-2/sync-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
packages/core/src/lib/sync.ts (3)

86-100: Transaction safety concern with async operations inside synchronous SQLite transaction.

The comment acknowledges that better-sqlite3 transactions are synchronous, but the current implementation may not provide true atomicity guarantees. If fn() contains multiple await calls (as it does in persistMerkleState), other code could potentially execute database operations between the microtasks, since JavaScript's event loop may interleave other tasks.

For the current single-threaded CLI use case this is likely acceptable, but for robustness, consider using better-sqlite3's native db.transaction() wrapper which ensures atomicity:

♻️ Alternative using better-sqlite3 native transaction
  async transaction<T>(fn: () => Promise<T>): Promise<T> {
-   // better-sqlite3 transactions are synchronous. Since SqliteSyncStorage
-   // methods wrap sync calls in async, all awaits resolve in the same
-   // microtask. We use manual BEGIN/COMMIT to support the async fn.
-   const db = getDb();
-   db.exec('BEGIN');
-   try {
-     const result = await fn();
-     db.exec('COMMIT');
-     return result;
-   } catch (err) {
-     db.exec('ROLLBACK');
-     throw err;
-   }
+   // Since SqliteSyncStorage methods are thin async wrappers over sync calls,
+   // all awaits resolve immediately. The result is effectively synchronous.
+   // For true atomicity with better-sqlite3, we could batch operations,
+   // but for CLI single-threaded use, manual BEGIN/COMMIT suffices.
+   const db = getDb();
+   db.exec('BEGIN IMMEDIATE');
+   try {
+     const result = await fn();
+     db.exec('COMMIT');
+     return result;
+   } catch (err) {
+     db.exec('ROLLBACK');
+     throw err;
+   }
  }

Using BEGIN IMMEDIATE instead of BEGIN acquires a write lock immediately, preventing other connections from writing during the transaction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/lib/sync.ts` around lines 86 - 100, The transaction
implementation in transaction() uses manual 'BEGIN'/'COMMIT' which can be
interleaved by async microtasks and doesn't guarantee atomicity; replace it with
better-sqlite3's native transaction wrapper or at minimum use 'BEGIN IMMEDIATE'
to acquire the write lock immediately: get the DB via getDb(), create a
synchronous transaction function via db.transaction(fnSync) and invoke it (or
change db.exec('BEGIN') to db.exec('BEGIN IMMEDIATE')) so that calls like
persistMerkleState are executed atomically without other DB operations
interleaving.

305-311: Redundant re-read of file hashes within the same transaction.

The code writes file hashes (lines 296-301), then immediately re-reads them from storage (lines 306-311) to build persistedHashMap. Since both operations occur in the same transaction, you already know which files were successfully persisted—they're the ones in successfulFiles with a hash in fileHashMap.

♻️ Avoid redundant storage reads
-   // Rebuild tree using only files with persisted hashes to avoid
-   // caching dir hashes that include failed files
-   const persistedHashMap = new Map<string, string>();
-   for (const file of files) {
-     const stored = await storage.getFileHash(file);
-     if (stored) {
-       persistedHashMap.set(file, stored);
-     }
-   }
+   // Rebuild tree using only files with persisted hashes to avoid
+   // caching dir hashes that include failed files.
+   // Use the hashes we just wrote (successfulFiles) plus any previously
+   // stored hashes for files not in this batch.
+   const persistedHashMap = new Map<string, string>();
+   for (const file of files) {
+     if (successfulFiles.has(file)) {
+       const hash = fileHashMap.get(file);
+       if (hash) persistedHashMap.set(file, hash);
+     } else {
+       const stored = await storage.getFileHash(file);
+       if (stored) persistedHashMap.set(file, stored);
+     }
+   }

This reduces storage reads from files.length to files.length - successfulFiles.size, which could be significant for large file sets.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/lib/sync.ts` around lines 305 - 311, The code redundantly
re-reads hashes from storage into persistedHashMap by calling
storage.getFileHash for every file; instead, for files written in the same
transaction use the already-known hashes from fileHashMap for those in
successfulFiles and only call storage.getFileHash for files not in
successfulFiles. Update the loop that builds persistedHashMap to: iterate files,
if file is in successfulFiles and has an entry in fileHashMap use that hash,
otherwise await storage.getFileHash(file) and set if present; reference
persistedHashMap, files, successfulFiles, fileHashMap and storage.getFileHash
when making the change.

253-264: Sequential awaits in hot path may impact performance with async backends.

Each file hash lookup is awaited sequentially inside the nested loop. For SqliteSyncStorage this resolves immediately, but a true async backend (like the planned DrizzleSyncStorage) could incur network latency per call, making this O(n) in round-trips.

Consider batching lookups or prefetching hashes for changed directories:

💡 Optimization idea for async backends
// For async backends, consider prefetching all file hashes for a directory
// or adding a batch method to SyncStorage:
// getFileHashes(filePaths: string[]): Promise<Map<string, string | null>>

This is acceptable for Phase 2 with SQLite, but worth noting for the DrizzleSyncStorage implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/lib/sync.ts` around lines 253 - 264, The loop in sync.ts
iterates dirFiles and awaits storage.getFileHash(file) sequentially, causing
many round-trips for async backends; update the SyncStorage interface to add a
batched method (e.g., getFileHashes(filePaths: string[]):
Promise<Map<string,string|null>>) and change the logic around dirFiles /
fileHashMap to call that batch method once per dir, then compare returned hashes
to currentHash from fileHashMap and populate added and modified accordingly;
ensure SqliteSyncStorage implements the new batch method (delegating to the fast
path) so behavior is unchanged for SQLite while DrizzleSyncStorage can implement
a single efficient query.
packages/core/src/lib/sync.test.ts (1)

152-169: Consider adding a test for custom SyncStorage implementations.

The test suite validates SqliteSyncStorage behavior (via default parameter), but doesn't test the interface contract itself. Adding a test with a mock SyncStorage would help validate that custom implementations (like the planned DrizzleSyncStorage) will work correctly.

💡 Example mock storage test
it('works with custom SyncStorage implementation', async () => {
  const mockStorage: SyncStorage = {
    getFileHash: vi.fn().mockResolvedValue(null),
    setFileHash: vi.fn().mockResolvedValue(undefined),
    getDirHash: vi.fn().mockResolvedValue(null),
    setDirHash: vi.fn().mockResolvedValue(undefined),
    clearDirHashes: vi.fn().mockResolvedValue(undefined),
    getAllFileHashes: vi.fn().mockResolvedValue(new Map()),
    transaction: vi.fn().mockImplementation((fn) => fn()),
  };

  await fs.writeFile(path.join(tmpDir, 'test.ts'), 'const x = 1;');
  const files = [path.join(tmpDir, 'test.ts')];
  
  const result = await computeChanges(files, tmpDir, mockStorage);
  expect(result.added).toHaveLength(1);
  expect(mockStorage.getDirHash).toHaveBeenCalled();
});

This can be deferred to when DrizzleSyncStorage is implemented (Step 3).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/lib/sync.test.ts` around lines 152 - 169, Add a unit test
that exercises the SyncStorage interface by passing a mock implementation into
computeChanges (instead of relying on the default SqliteSyncStorage) and
asserting behavior and calls; specifically, create a mock SyncStorage object
implementing getFileHash, setFileHash, getDirHash, setDirHash, clearDirHashes,
getAllFileHashes, and transaction (transaction should invoke the provided
callback), write a test file in tmpDir, call computeChanges(files, tmpDir,
mockStorage) and assert expected results (e.g., added length) and that
mockStorage.getDirHash / getFileHash / setFileHash (or other pertinent methods)
were called to validate custom storage integrations (refer to computeChanges and
persistMerkleState to locate where the storage is used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/core/src/lib/sync.test.ts`:
- Around line 152-169: Add a unit test that exercises the SyncStorage interface
by passing a mock implementation into computeChanges (instead of relying on the
default SqliteSyncStorage) and asserting behavior and calls; specifically,
create a mock SyncStorage object implementing getFileHash, setFileHash,
getDirHash, setDirHash, clearDirHashes, getAllFileHashes, and transaction
(transaction should invoke the provided callback), write a test file in tmpDir,
call computeChanges(files, tmpDir, mockStorage) and assert expected results
(e.g., added length) and that mockStorage.getDirHash / getFileHash / setFileHash
(or other pertinent methods) were called to validate custom storage integrations
(refer to computeChanges and persistMerkleState to locate where the storage is
used).

In `@packages/core/src/lib/sync.ts`:
- Around line 86-100: The transaction implementation in transaction() uses
manual 'BEGIN'/'COMMIT' which can be interleaved by async microtasks and doesn't
guarantee atomicity; replace it with better-sqlite3's native transaction wrapper
or at minimum use 'BEGIN IMMEDIATE' to acquire the write lock immediately: get
the DB via getDb(), create a synchronous transaction function via
db.transaction(fnSync) and invoke it (or change db.exec('BEGIN') to
db.exec('BEGIN IMMEDIATE')) so that calls like persistMerkleState are executed
atomically without other DB operations interleaving.
- Around line 305-311: The code redundantly re-reads hashes from storage into
persistedHashMap by calling storage.getFileHash for every file; instead, for
files written in the same transaction use the already-known hashes from
fileHashMap for those in successfulFiles and only call storage.getFileHash for
files not in successfulFiles. Update the loop that builds persistedHashMap to:
iterate files, if file is in successfulFiles and has an entry in fileHashMap use
that hash, otherwise await storage.getFileHash(file) and set if present;
reference persistedHashMap, files, successfulFiles, fileHashMap and
storage.getFileHash when making the change.
- Around line 253-264: The loop in sync.ts iterates dirFiles and awaits
storage.getFileHash(file) sequentially, causing many round-trips for async
backends; update the SyncStorage interface to add a batched method (e.g.,
getFileHashes(filePaths: string[]): Promise<Map<string,string|null>>) and change
the logic around dirFiles / fileHashMap to call that batch method once per dir,
then compare returned hashes to currentHash from fileHashMap and populate added
and modified accordingly; ensure SqliteSyncStorage implements the new batch
method (delegating to the fast path) so behavior is unchanged for SQLite while
DrizzleSyncStorage can implement a single efficient query.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03f5496b-3326-46db-afc5-591ef6c39786

📥 Commits

Reviewing files that changed from the base of the PR and between d172d99 and 6b5941d.

📒 Files selected for processing (3)
  • packages/core/src/index.ts
  • packages/core/src/lib/sync.test.ts
  • packages/core/src/lib/sync.ts

@HrushiBorhade HrushiBorhade merged commit 6bcd506 into main Mar 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

indexing Indexing pipeline tasks phase-2 Phase 2: Indexing Pipeline

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: extract Merkle sync logic with async Drizzle SyncStorage

1 participant