Enhance performance with parallel prefetching of blob paths and add configurable thread count

andrew · andrew · commit c5e58af07554 · 2026-01-04T23:14:07.000Z
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,8 @@
 ## [Unreleased]
 
 - `git pkgs init` now installs git hooks by default (use `--no-hooks` to skip)
+- Parallel prefetching of git diffs for ~2x speedup on large repositories (1500+ commits)
+- Performance tuning via environment variables: `GIT_PKGS_BATCH_SIZE`, `GIT_PKGS_SNAPSHOT_INTERVAL`, `GIT_PKGS_THREADS`
 - Fix N+1 queries in `blame`, `stale`, `stats`, and `log` commands
 - Configuration via git config: `pkgs.ecosystems`, `pkgs.ignoredDirs`, `pkgs.ignoredFiles`
 - `git pkgs info --ecosystems` to show available ecosystems and their status
diff --git a/README.md b/README.md
@@ -52,7 +52,7 @@ Processing commit 5191/5191...
 Done!
 Analyzed 5191 commits
 Found 2531 commits with dependency changes
-Stored 28239 snapshots (every 20 changes)
+Stored 28239 snapshots (every 50 changes)
 Blob cache: 3141 unique blobs, 2349 had cache hits
 ```
 
@@ -86,7 +86,7 @@ Snapshot Coverage
 ----------------------------------------
   Commits with dependency changes: 2531
   Commits with snapshots: 127
-  Coverage: 5.0% (1 snapshot per ~20 changes)
+  Coverage: 2.0% (1 snapshot per ~50 changes)
 ```
 
 ### List dependencies
@@ -404,10 +404,10 @@ git config --add pkgs.ignoredFiles test/fixtures/package.json
 Benchmarked on an M1 MacBook Pro analyzing [octobox](https://github.com/octobox/octobox) (5191 commits, 8 years of history): init takes about 18 seconds at roughly 300 commits/sec, producing an 8.3 MB database. About half the commits (2531) had dependency changes.
 
 Optimizations:
-- Bulk inserts with transaction batching (100 commits per transaction)
+- Bulk inserts with transaction batching (500 commits per transaction)
 - Blob SHA caching (75% cache hit rate for repeated manifest content)
 - Deferred index creation during bulk load
-- Sparse snapshots (every 20 dependency-changing commits) for storage efficiency
+- Sparse snapshots (every 50 dependency-changing commits) for storage efficiency
 - SQLite WAL mode for write performance
 
 ## Supported ecosystems
diff --git a/docs/internals.md b/docs/internals.md
@@ -25,6 +25,8 @@ Snapshots exist because replaying thousands of change records to answer "what de
 
 [`Git::Pkgs::Repository`](../lib/git/pkgs/repository.rb) wraps [rugged](https://github.com/libgit2/rugged) (Ruby bindings for [libgit2](https://libgit2.org/)) for git operations. The `walk` method yields commits in topological order. `blob_paths` returns changed files for a commit. `content_at_commit` and `blob_oid_at_commit` fetch file contents and their object IDs. The OID matters for caching: if two commits have the same blob OID for a file, we don't parse it twice.
 
+For large repositories, `prefetch_blob_paths` computes diffs in parallel using multiple threads. This provides roughly 2x speedup on git operations for repos with 1500+ commits. Smaller repos use serial processing because thread creation and mutex synchronization overhead exceeds any parallel gains.
+
 ## Manifest Analysis
 
 [`Git::Pkgs::Analyzer`](../lib/git/pkgs/analyzer.rb) does the actual work of detecting and parsing manifests. It uses the [ecosystems-bibliothecary](https://github.com/ecosyste-ms/bibliothecary) gem, which supports 30+ package managers. Bibliothecary is expensive to call, so the analyzer has a `QUICK_MANIFEST_PATTERNS` regex that filters files before attempting real parsing. This cuts out most commits that touch only source code.
@@ -45,14 +47,14 @@ When you run `git pkgs init` (see [`commands/init.rb`](../lib/git/pkgs/commands/
 
 1. Creates the database schema
 2. Switches to bulk write mode (WAL, synchronous off, large cache)
-3. Walks commits chronologically
+3. Loads all commits and prefetches git diffs in parallel
 4. For each commit with manifest changes, calls `analyzer.analyze_commit`
 5. Batches inserts in transactions of 500 commits
 6. Creates dependency snapshots every 50 commits that changed dependencies
 7. Creates indexes after all data is loaded
 8. Switches back to normal sync mode
 
-Deferring index creation until the end speeds things up considerably. Both batch size and snapshot interval are configurable via environment variables (see Performance Notes below).
+Deferring index creation until the end speeds things up considerably. Batch size, snapshot interval, and thread count are configurable via environment variables (see Performance Notes below).
 
 ## Incremental Updates
 
@@ -141,7 +143,8 @@ ActiveRecord models live in [`lib/git/pkgs/models/`](../lib/git/pkgs/models/). T
 
 Typical init speed is around 75-300 commits per second depending on the repository. The main bottlenecks are git blob reads and bibliothecary parsing. The blob OID cache helps a lot: if a Gemfile hasn't changed in 50 commits, we parse it once and reuse the result. The manifest path regex filter also helps by skipping commits that only touch source files.
 
-For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Two environment variables let you tune this:
+For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Three environment variables let you tune performance:
 
 - `GIT_PKGS_BATCH_SIZE` - Number of commits per database transaction (default: 500). Larger batches reduce transaction overhead but use more memory.
 - `GIT_PKGS_SNAPSHOT_INTERVAL` - Store full dependency state every N commits with changes (default: 50). Lower values speed up point-in-time queries but increase database size.
+- `GIT_PKGS_THREADS` - Number of threads for parallel git diff prefetching (default: 4). Set to 1 to disable parallelism. On large repositories (1500+ commits), parallel prefetching provides roughly 2x speedup on git operations.
diff --git a/lib/git/pkgs.rb b/lib/git/pkgs.rb
@@ -44,14 +44,15 @@ class NotInitializedError < Error; end
     class NotInGitRepoError < Error; end
 
     class << self
-      attr_accessor :quiet, :git_dir, :work_tree, :db_path, :batch_size, :snapshot_interval
+      attr_accessor :quiet, :git_dir, :work_tree, :db_path, :batch_size, :snapshot_interval, :threads
 
       def configure_from_env
         @git_dir ||= presence(ENV["GIT_DIR"])
         @work_tree ||= presence(ENV["GIT_WORK_TREE"])
         @db_path ||= presence(ENV["GIT_PKGS_DB"])
         @batch_size ||= int_presence(ENV["GIT_PKGS_BATCH_SIZE"])
         @snapshot_interval ||= int_presence(ENV["GIT_PKGS_SNAPSHOT_INTERVAL"])
+        @threads ||= int_presence(ENV["GIT_PKGS_THREADS"])
       end
 
       def reset_config!
@@ -61,6 +62,7 @@ def reset_config!
         @db_path = nil
         @batch_size = nil
         @snapshot_interval = nil
+        @threads = nil
       end
 
       def int_presence(value)
diff --git a/lib/git/pkgs/commands/branch.rb b/lib/git/pkgs/commands/branch.rb
@@ -64,9 +64,13 @@ def add_branch
 
           info "Analyzing branch: #{branch_name}"
 
+          print "Loading commits..." unless Git::Pkgs.quiet
           walker = repo.walk(branch_name)
           commits = walker.to_a
           total = commits.size
+          print "\rPrefetching diffs..." unless Git::Pkgs.quiet
+          repo.prefetch_blob_paths(commits)
+          print "\r#{' ' * 20}\r" unless Git::Pkgs.quiet
 
           stats = bulk_process_commits(commits, branch, analyzer, total, repo)
 
diff --git a/lib/git/pkgs/commands/init.rb b/lib/git/pkgs/commands/init.rb
@@ -47,6 +47,8 @@ def run
           walker = repo.walk(branch_name, @options[:since])
           commits = walker.to_a
           total = commits.size
+          print "\rPrefetching diffs..." unless Git::Pkgs.quiet
+          repo.prefetch_blob_paths(commits)
           print "\r#{' ' * 20}\r" unless Git::Pkgs.quiet
 
           stats = bulk_process_commits(commits, branch, analyzer, total)
diff --git a/lib/git/pkgs/commands/update.rb b/lib/git/pkgs/commands/update.rb
@@ -50,6 +50,8 @@ def run
           walker = repo.walk(branch_name, since_sha)
           commits = walker.to_a
           total = commits.size
+          repo.prefetch_blob_paths(commits)
+
           processed = 0
           dependency_commits = 0
           last_position = Models::BranchCommit.where(branch: branch).maximum(:position) || 0
diff --git a/lib/git/pkgs/repository.rb b/lib/git/pkgs/repository.rb
@@ -10,6 +10,7 @@ class Repository
       def initialize(path = nil)
         @path = path || Git::Pkgs.git_dir || Git::Pkgs.work_tree || Dir.pwd
         @rugged = Rugged::Repository.new(@path)
+        @blob_paths_cache = {}
       end
 
       def git_dir
@@ -62,7 +63,40 @@ def lookup(sha)
         @rugged.lookup(sha)
       end
 
+      DEFAULT_THREADS = 4
+
+      def prefetch_blob_paths(commits)
+        thread_count = Git::Pkgs.threads || DEFAULT_THREADS
+
+        # Serial is faster for small repos due to thread overhead
+        if commits.size < 1500 || thread_count <= 1
+          commits.each { |c| @blob_paths_cache[c.oid] = compute_blob_paths(c) }
+          return
+        end
+
+        queue = Queue.new
+        mutex = Mutex.new
+
+        commits.each { |c| queue << c }
+        thread_count.times { queue << nil }
+
+        thread_pool = thread_count.times.map do
+          Thread.new do
+            while (commit = queue.pop)
+              paths = compute_blob_paths(commit)
+              mutex.synchronize { @blob_paths_cache[commit.oid] = paths }
+            end
+          end
+        end
+
+        thread_pool.each(&:join)
+      end
+
       def blob_paths(rugged_commit)
+        @blob_paths_cache[rugged_commit.oid] || compute_blob_paths(rugged_commit)
+      end
+
+      def compute_blob_paths(rugged_commit)
         paths = []
 
         if rugged_commit.parents.empty?