Skip to content
This repository was archived by the owner on Jan 22, 2026. It is now read-only.

Commit c5e58af

Browse files
committed
Enhance performance with parallel prefetching of blob paths and add configurable thread count
1 parent 8e59957 commit c5e58af

File tree

8 files changed

+57
-8
lines changed

8 files changed

+57
-8
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
## [Unreleased]
22

33
- `git pkgs init` now installs git hooks by default (use `--no-hooks` to skip)
4+
- Parallel prefetching of git diffs for ~2x speedup on large repositories (1500+ commits)
5+
- Performance tuning via environment variables: `GIT_PKGS_BATCH_SIZE`, `GIT_PKGS_SNAPSHOT_INTERVAL`, `GIT_PKGS_THREADS`
46
- Fix N+1 queries in `blame`, `stale`, `stats`, and `log` commands
57
- Configuration via git config: `pkgs.ecosystems`, `pkgs.ignoredDirs`, `pkgs.ignoredFiles`
68
- `git pkgs info --ecosystems` to show available ecosystems and their status

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ Processing commit 5191/5191...
5252
Done!
5353
Analyzed 5191 commits
5454
Found 2531 commits with dependency changes
55-
Stored 28239 snapshots (every 20 changes)
55+
Stored 28239 snapshots (every 50 changes)
5656
Blob cache: 3141 unique blobs, 2349 had cache hits
5757
```
5858

@@ -86,7 +86,7 @@ Snapshot Coverage
8686
----------------------------------------
8787
Commits with dependency changes: 2531
8888
Commits with snapshots: 127
89-
Coverage: 5.0% (1 snapshot per ~20 changes)
89+
Coverage: 2.0% (1 snapshot per ~50 changes)
9090
```
9191

9292
### List dependencies
@@ -404,10 +404,10 @@ git config --add pkgs.ignoredFiles test/fixtures/package.json
404404
Benchmarked on an M1 MacBook Pro analyzing [octobox](https://github.com/octobox/octobox) (5191 commits, 8 years of history): init takes about 18 seconds at roughly 300 commits/sec, producing an 8.3 MB database. About half the commits (2531) had dependency changes.
405405

406406
Optimizations:
407-
- Bulk inserts with transaction batching (100 commits per transaction)
407+
- Bulk inserts with transaction batching (500 commits per transaction)
408408
- Blob SHA caching (75% cache hit rate for repeated manifest content)
409409
- Deferred index creation during bulk load
410-
- Sparse snapshots (every 20 dependency-changing commits) for storage efficiency
410+
- Sparse snapshots (every 50 dependency-changing commits) for storage efficiency
411411
- SQLite WAL mode for write performance
412412

413413
## Supported ecosystems

docs/internals.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ Snapshots exist because replaying thousands of change records to answer "what de
2525

2626
[`Git::Pkgs::Repository`](../lib/git/pkgs/repository.rb) wraps [rugged](https://github.com/libgit2/rugged) (Ruby bindings for [libgit2](https://libgit2.org/)) for git operations. The `walk` method yields commits in topological order. `blob_paths` returns changed files for a commit. `content_at_commit` and `blob_oid_at_commit` fetch file contents and their object IDs. The OID matters for caching: if two commits have the same blob OID for a file, we don't parse it twice.
2727

28+
For large repositories, `prefetch_blob_paths` computes diffs in parallel using multiple threads. This provides roughly 2x speedup on git operations for repos with 1500+ commits. Smaller repos use serial processing because thread creation and mutex synchronization overhead exceeds any parallel gains.
29+
2830
## Manifest Analysis
2931

3032
[`Git::Pkgs::Analyzer`](../lib/git/pkgs/analyzer.rb) does the actual work of detecting and parsing manifests. It uses the [ecosystems-bibliothecary](https://github.com/ecosyste-ms/bibliothecary) gem, which supports 30+ package managers. Bibliothecary is expensive to call, so the analyzer has a `QUICK_MANIFEST_PATTERNS` regex that filters files before attempting real parsing. This cuts out most commits that touch only source code.
@@ -45,14 +47,14 @@ When you run `git pkgs init` (see [`commands/init.rb`](../lib/git/pkgs/commands/
4547

4648
1. Creates the database schema
4749
2. Switches to bulk write mode (WAL, synchronous off, large cache)
48-
3. Walks commits chronologically
50+
3. Loads all commits and prefetches git diffs in parallel
4951
4. For each commit with manifest changes, calls `analyzer.analyze_commit`
5052
5. Batches inserts in transactions of 500 commits
5153
6. Creates dependency snapshots every 50 commits that changed dependencies
5254
7. Creates indexes after all data is loaded
5355
8. Switches back to normal sync mode
5456

55-
Deferring index creation until the end speeds things up considerably. Both batch size and snapshot interval are configurable via environment variables (see Performance Notes below).
57+
Deferring index creation until the end speeds things up considerably. Batch size, snapshot interval, and thread count are configurable via environment variables (see Performance Notes below).
5658

5759
## Incremental Updates
5860

@@ -141,7 +143,8 @@ ActiveRecord models live in [`lib/git/pkgs/models/`](../lib/git/pkgs/models/). T
141143

142144
Typical init speed is around 75-300 commits per second depending on the repository. The main bottlenecks are git blob reads and bibliothecary parsing. The blob OID cache helps a lot: if a Gemfile hasn't changed in 50 commits, we parse it once and reuse the result. The manifest path regex filter also helps by skipping commits that only touch source files.
143145

144-
For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Two environment variables let you tune this:
146+
For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Three environment variables let you tune performance:
145147

146148
- `GIT_PKGS_BATCH_SIZE` - Number of commits per database transaction (default: 500). Larger batches reduce transaction overhead but use more memory.
147149
- `GIT_PKGS_SNAPSHOT_INTERVAL` - Store full dependency state every N commits with changes (default: 50). Lower values speed up point-in-time queries but increase database size.
150+
- `GIT_PKGS_THREADS` - Number of threads for parallel git diff prefetching (default: 4). Set to 1 to disable parallelism. On large repositories (1500+ commits), parallel prefetching provides roughly 2x speedup on git operations.

lib/git/pkgs.rb

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,14 +44,15 @@ class NotInitializedError < Error; end
4444
class NotInGitRepoError < Error; end
4545

4646
class << self
47-
attr_accessor :quiet, :git_dir, :work_tree, :db_path, :batch_size, :snapshot_interval
47+
attr_accessor :quiet, :git_dir, :work_tree, :db_path, :batch_size, :snapshot_interval, :threads
4848

4949
def configure_from_env
5050
@git_dir ||= presence(ENV["GIT_DIR"])
5151
@work_tree ||= presence(ENV["GIT_WORK_TREE"])
5252
@db_path ||= presence(ENV["GIT_PKGS_DB"])
5353
@batch_size ||= int_presence(ENV["GIT_PKGS_BATCH_SIZE"])
5454
@snapshot_interval ||= int_presence(ENV["GIT_PKGS_SNAPSHOT_INTERVAL"])
55+
@threads ||= int_presence(ENV["GIT_PKGS_THREADS"])
5556
end
5657

5758
def reset_config!
@@ -61,6 +62,7 @@ def reset_config!
6162
@db_path = nil
6263
@batch_size = nil
6364
@snapshot_interval = nil
65+
@threads = nil
6466
end
6567

6668
def int_presence(value)

lib/git/pkgs/commands/branch.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,13 @@ def add_branch
6464

6565
info "Analyzing branch: #{branch_name}"
6666

67+
print "Loading commits..." unless Git::Pkgs.quiet
6768
walker = repo.walk(branch_name)
6869
commits = walker.to_a
6970
total = commits.size
71+
print "\rPrefetching diffs..." unless Git::Pkgs.quiet
72+
repo.prefetch_blob_paths(commits)
73+
print "\r#{' ' * 20}\r" unless Git::Pkgs.quiet
7074

7175
stats = bulk_process_commits(commits, branch, analyzer, total, repo)
7276

lib/git/pkgs/commands/init.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ def run
4747
walker = repo.walk(branch_name, @options[:since])
4848
commits = walker.to_a
4949
total = commits.size
50+
print "\rPrefetching diffs..." unless Git::Pkgs.quiet
51+
repo.prefetch_blob_paths(commits)
5052
print "\r#{' ' * 20}\r" unless Git::Pkgs.quiet
5153

5254
stats = bulk_process_commits(commits, branch, analyzer, total)

lib/git/pkgs/commands/update.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ def run
5050
walker = repo.walk(branch_name, since_sha)
5151
commits = walker.to_a
5252
total = commits.size
53+
repo.prefetch_blob_paths(commits)
54+
5355
processed = 0
5456
dependency_commits = 0
5557
last_position = Models::BranchCommit.where(branch: branch).maximum(:position) || 0

lib/git/pkgs/repository.rb

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ class Repository
1010
def initialize(path = nil)
1111
@path = path || Git::Pkgs.git_dir || Git::Pkgs.work_tree || Dir.pwd
1212
@rugged = Rugged::Repository.new(@path)
13+
@blob_paths_cache = {}
1314
end
1415

1516
def git_dir
@@ -62,7 +63,40 @@ def lookup(sha)
6263
@rugged.lookup(sha)
6364
end
6465

66+
DEFAULT_THREADS = 4
67+
68+
def prefetch_blob_paths(commits)
69+
thread_count = Git::Pkgs.threads || DEFAULT_THREADS
70+
71+
# Serial is faster for small repos due to thread overhead
72+
if commits.size < 1500 || thread_count <= 1
73+
commits.each { |c| @blob_paths_cache[c.oid] = compute_blob_paths(c) }
74+
return
75+
end
76+
77+
queue = Queue.new
78+
mutex = Mutex.new
79+
80+
commits.each { |c| queue << c }
81+
thread_count.times { queue << nil }
82+
83+
thread_pool = thread_count.times.map do
84+
Thread.new do
85+
while (commit = queue.pop)
86+
paths = compute_blob_paths(commit)
87+
mutex.synchronize { @blob_paths_cache[commit.oid] = paths }
88+
end
89+
end
90+
end
91+
92+
thread_pool.each(&:join)
93+
end
94+
6595
def blob_paths(rugged_commit)
96+
@blob_paths_cache[rugged_commit.oid] || compute_blob_paths(rugged_commit)
97+
end
98+
99+
def compute_blob_paths(rugged_commit)
66100
paths = []
67101

68102
if rugged_commit.parents.empty?

0 commit comments

Comments
 (0)