You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 22, 2026. It is now read-only.
Benchmarked on an M1 MacBook Pro analyzing [octobox](https://github.com/octobox/octobox) (5191 commits, 8 years of history): init takes about 18 seconds at roughly 300 commits/sec, producing an 8.3 MB database. About half the commits (2531) had dependency changes.
405
405
406
406
Optimizations:
407
-
- Bulk inserts with transaction batching (100 commits per transaction)
407
+
- Bulk inserts with transaction batching (500 commits per transaction)
408
408
- Blob SHA caching (75% cache hit rate for repeated manifest content)
409
409
- Deferred index creation during bulk load
410
-
- Sparse snapshots (every 20 dependency-changing commits) for storage efficiency
410
+
- Sparse snapshots (every 50 dependency-changing commits) for storage efficiency
Copy file name to clipboardExpand all lines: docs/internals.md
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,8 @@ Snapshots exist because replaying thousands of change records to answer "what de
25
25
26
26
[`Git::Pkgs::Repository`](../lib/git/pkgs/repository.rb) wraps [rugged](https://github.com/libgit2/rugged) (Ruby bindings for [libgit2](https://libgit2.org/)) for git operations. The `walk` method yields commits in topological order. `blob_paths` returns changed files for a commit. `content_at_commit` and `blob_oid_at_commit` fetch file contents and their object IDs. The OID matters for caching: if two commits have the same blob OID for a file, we don't parse it twice.
27
27
28
+
For large repositories, `prefetch_blob_paths` computes diffs in parallel using multiple threads. This provides roughly 2x speedup on git operations for repos with 1500+ commits. Smaller repos use serial processing because thread creation and mutex synchronization overhead exceeds any parallel gains.
29
+
28
30
## Manifest Analysis
29
31
30
32
[`Git::Pkgs::Analyzer`](../lib/git/pkgs/analyzer.rb) does the actual work of detecting and parsing manifests. It uses the [ecosystems-bibliothecary](https://github.com/ecosyste-ms/bibliothecary) gem, which supports 30+ package managers. Bibliothecary is expensive to call, so the analyzer has a `QUICK_MANIFEST_PATTERNS` regex that filters files before attempting real parsing. This cuts out most commits that touch only source code.
@@ -45,14 +47,14 @@ When you run `git pkgs init` (see [`commands/init.rb`](../lib/git/pkgs/commands/
45
47
46
48
1. Creates the database schema
47
49
2. Switches to bulk write mode (WAL, synchronous off, large cache)
48
-
3.Walks commits chronologically
50
+
3.Loads all commits and prefetches git diffs in parallel
49
51
4. For each commit with manifest changes, calls `analyzer.analyze_commit`
50
52
5. Batches inserts in transactions of 500 commits
51
53
6. Creates dependency snapshots every 50 commits that changed dependencies
52
54
7. Creates indexes after all data is loaded
53
55
8. Switches back to normal sync mode
54
56
55
-
Deferring index creation until the end speeds things up considerably. Both batch size and snapshot interval are configurable via environment variables (see Performance Notes below).
57
+
Deferring index creation until the end speeds things up considerably. Batch size, snapshot interval, and thread count are configurable via environment variables (see Performance Notes below).
56
58
57
59
## Incremental Updates
58
60
@@ -141,7 +143,8 @@ ActiveRecord models live in [`lib/git/pkgs/models/`](../lib/git/pkgs/models/). T
141
143
142
144
Typical init speed is around 75-300 commits per second depending on the repository. The main bottlenecks are git blob reads and bibliothecary parsing. The blob OID cache helps a lot: if a Gemfile hasn't changed in 50 commits, we parse it once and reuse the result. The manifest path regex filter also helps by skipping commits that only touch source files.
143
145
144
-
For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Two environment variables let you tune this:
146
+
For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Three environment variables let you tune performance:
145
147
146
148
-`GIT_PKGS_BATCH_SIZE` - Number of commits per database transaction (default: 500). Larger batches reduce transaction overhead but use more memory.
147
149
-`GIT_PKGS_SNAPSHOT_INTERVAL` - Store full dependency state every N commits with changes (default: 50). Lower values speed up point-in-time queries but increase database size.
150
+
-`GIT_PKGS_THREADS` - Number of threads for parallel git diff prefetching (default: 4). Set to 1 to disable parallelism. On large repositories (1500+ commits), parallel prefetching provides roughly 2x speedup on git operations.
0 commit comments