Skip to content

Commit

Permalink
backfill: filter by pathspec
Browse files Browse the repository at this point in the history
The 'git backfill' command already assumes the '--sparse' option when
the repository uses the sparse-checkout feature. If the sparse-checkout
patterns are in cone mode, then the path-walk API will restrict the set
of trees it visits to only those necessary to reach the blobs that are
matched in the sparse-checkout.

In some cases, users will want a more restrictive set of blobs to
download. Augment the 'git backfill' command to parse pathspecs from the
user and filter the blobs that are downloaded to this set.

While this implementation benefits from skipping the most expensive step
of the process (downloading missing blobs), it still requires the
path-walk API to track all tree and blob IDs and then the filter matches
the pathspec only at the final filter.

I attempted to filter the pathspec using the existing pattern_list
mechanisms that power the --sparse option, as that would restrict the
path-walk to only the objects that are required to reach the matching
blob paths. However, my initial attempt used a match of every path at
HEAD, leading to cubic behavior when given a recursive pathspec such
as "t/*" in the Git repository; this becomes cubic when comparing N
paths against M sparse-checkout patterns across T versions in history.
This could be solved by more carefully constructing the pattern list to
include recursive matches when the pathspec is recognized as working in
that way. The problem is that we need to add patterns that lead the
parent directories to match that recursive pattern. This becomes even
more difficult when we recognize that some pathspecs don't follow a
simple recursive match ("*.c", "t/*/*.sh").

For now, this simple implementation is more clearly correct. Later
attempts to optimize this walk could be attempted, but should be built
when the user need for that performance improvement is necessary.

Note that using the --sparse option with a cone mode sparse-checkout is
one way to reduce the size of the object walk and is compatible with
pathspec matches, so selecting a restrictive sparse-checkout could help
with any performance issues.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
  • Loading branch information
derrickstolee committed Dec 18, 2024
1 parent e388b66 commit 918e68c
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 4 deletions.
2 changes: 1 addition & 1 deletion Documentation/git-backfill.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
SYNOPSIS
--------
[verse]
'git backfill' [--batch-size=<n>] [--[no-]sparse]
'git backfill' [--batch-size=<n>] [--[no-]sparse] [[--] <pathspec>]

DESCRIPTION
-----------
Expand Down
25 changes: 22 additions & 3 deletions builtin/backfill.c
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,13 @@
#include "commit.h"
#include "dir.h"
#include "environment.h"
#include "hashmap.h"
#include "hex.h"
#include "list-objects.h"
#include "tree.h"
#include "tree-walk.h"
#include "object.h"
#include "object-name.h"
#include "object-store-ll.h"
#include "oid-array.h"
#include "oidset.h"
Expand All @@ -24,9 +27,10 @@
#include "progress.h"
#include "packfile.h"
#include "path-walk.h"
#include "pathspec.h"

static const char * const builtin_backfill_usage[] = {
N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
N_("git backfill [--batch-size=<n>] [--[no-]sparse] [[--] <pathspec>]"),
NULL
};

Expand All @@ -35,6 +39,10 @@ struct backfill_context {
struct oid_array current_batch;
size_t min_batch_size;
int sparse;

int use_pathspec;
struct pathspec ps;
struct string_list matching_paths;
};

static void backfill_context_clear(struct backfill_context *ctx)
Expand All @@ -56,7 +64,7 @@ static void download_batch(struct backfill_context *ctx)
reprepare_packed_git(ctx->repo);
}

static int fill_missing_blobs(const char *path UNUSED,
static int fill_missing_blobs(const char *path,
struct oid_array *list,
enum object_type type,
void *data)
Expand All @@ -66,6 +74,11 @@ static int fill_missing_blobs(const char *path UNUSED,
if (type != OBJ_BLOB)
return 0;

if (ctx->use_pathspec &&
!match_pathspec(ctx->repo->index, &ctx->ps, path, strlen(path),
0, NULL, 0))
return 0;

for (size_t i = 0; i < list->nr; i++) {
off_t size = 0;
struct object_info info = OBJECT_INFO_INIT;
Expand Down Expand Up @@ -144,8 +157,14 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit

repo_config(repo, git_default_config, NULL);

if (ctx.sparse < 0)
if (argc) {
parse_pathspec(&ctx.ps, 0, 0, prefix, argv);
ctx.use_pathspec = 1;
if (ctx.sparse > 0)
warning(_("ignoring --sparse option due to presence of pathspec"));
} else if (ctx.sparse < 0) {
ctx.sparse = core_apply_sparse_checkout;
}

return do_backfill(&ctx);
}

0 comments on commit 918e68c

Please sign in to comment.