Skip to content

Maintenance II: prefetch, loose-objects, incremental-repack tasks #696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Documentation/config/core.txt
Original file line number Diff line number Diff line change
Expand Up @@ -606,8 +606,8 @@ core.useReplaceRefs::

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> diff --git a/repository.h b/repository.h
> index 3c1f7d54bd..3901ce0b65 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -37,6 +37,8 @@ struct repo_settings {
>  
>  	int pack_use_sparse;
>  	enum fetch_negotiation_setting fetch_negotiation_algorithm;
> +
> +	int core_multi_pack_index;

The only thing I noticed here is why this isn't a bit ("unsigned
core_multi_pack_index : 1"), but I guess it is consistent with "int
pack_use_sparse".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/22/2020 7:16 PM, Jonathan Tan wrote:
>> diff --git a/repository.h b/repository.h
>> index 3c1f7d54bd..3901ce0b65 100644
>> --- a/repository.h
>> +++ b/repository.h
>> @@ -37,6 +37,8 @@ struct repo_settings {
>>  
>>  	int pack_use_sparse;
>>  	enum fetch_negotiation_setting fetch_negotiation_algorithm;
>> +
>> +	int core_multi_pack_index;
> 
> The only thing I noticed here is why this isn't a bit ("unsigned
> core_multi_pack_index : 1"), but I guess it is consistent with "int
> pack_use_sparse".

The _real_ reason is that we initialize these to -1 in
prepare_repo_settings() so UPDATE_DEFAULT_BOOL() knows that
the setting was not populated by a config value.

Thanks,
-Stolee

core.multiPackIndex::
Use the multi-pack-index file to track multiple packfiles using a
single index. See link:technical/multi-pack-index.html[the
multi-pack-index design document].
single index. See linkgit:git-multi-pack-index[1] for more
information. Defaults to true.

core.sparseCheckout::
Enable "sparse checkout" feature. See linkgit:git-sparse-checkout[1]
Expand Down
18 changes: 18 additions & 0 deletions Documentation/config/maintenance.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,21 @@ maintenance.commit-graph.auto::
reachable commits that are not in the commit-graph file is at least
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> +test_expect_success 'maintenance.loose-objects.auto' '
> +	git repack -adk &&
> +	GIT_TRACE2_EVENT="$(pwd)/trace-lo1.txt" \
> +		git -c maintenance.loose-objects.auto=1 maintenance \
> +		run --auto --task=loose-objects 2>/dev/null &&
> +	test_subcommand ! git prune-packed --quiet <trace-lo1.txt &&
> +	for i in 1 2

Any reason why this is run twice?

> +	do
> +		printf data-A-$i | git hash-object -t blob --stdin -w &&
> +		GIT_TRACE2_EVENT="$(pwd)/trace-loA-$i" \
> +			git -c maintenance.loose-objects.auto=2 \
> +			maintenance run --auto --task=loose-objects 2>/dev/null &&
> +		test_subcommand ! git prune-packed --quiet <trace-loA-$i &&

OK - there is only 1 loose object so the loose-objects task doesn't
get run.

> +		printf data-B-$i | git hash-object -t blob --stdin -w &&
> +		GIT_TRACE2_EVENT="$(pwd)/trace-loB-$i" \
> +			git -c maintenance.loose-objects.auto=2 \
> +			maintenance run --auto --task=loose-objects 2>/dev/null &&
> +		test_subcommand git prune-packed --quiet <trace-loB-$i &&

OK - there are 2 loose objects so the loose-objects task gets run. But
we need to remember that the first time it is run, only the packfile
gets created - the loose objects aren't deleted. "prune-packed" here is
to show that the loose-objects task is run, but it has no effect.

> +		GIT_TRACE2_EVENT="$(pwd)/trace-loC-$i" \
> +			git -c maintenance.loose-objects.auto=2 \
> +			maintenance run --auto --task=loose-objects 2>/dev/null &&
> +		test_subcommand git prune-packed --quiet <trace-loC-$i || return 1

OK - the 2 loose objects still exist, so the loose-objects task gets
run. "prune-packed" here shows that the loose-objects task is run.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/22/2020 7:15 PM, Jonathan Tan wrote:
>> +test_expect_success 'maintenance.loose-objects.auto' '
>> +	git repack -adk &&
>> +	GIT_TRACE2_EVENT="$(pwd)/trace-lo1.txt" \
>> +		git -c maintenance.loose-objects.auto=1 maintenance \
>> +		run --auto --task=loose-objects 2>/dev/null &&
>> +	test_subcommand ! git prune-packed --quiet <trace-lo1.txt &&
>> +	for i in 1 2
> 
> Any reason why this is run twice?

I think the original reason was to demonstrate how two runs interact,
but then that was done in the middle of the loop body so the loop is
not necessary.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index 5c08afc19a..6f878b0141 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -220,4 +220,35 @@ test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
>  		 --no-progress --batch-size=2147483647 <run-2g.txt
>  '
>  
> +test_expect_success 'maintenance.incremental-repack.auto' '
> +	git repack -adk &&
> +	git config core.multiPackIndex true &&
> +	git multi-pack-index write &&

After this, there is exactly one pack, and it is in the MIDX.

> +	GIT_TRACE2_EVENT="$(pwd)/midx-init.txt" git \
> +		-c maintenance.incremental-repack.auto=1 \
> +		maintenance run --auto --task=incremental-repack 2>/dev/null &&
> +	test_subcommand ! git multi-pack-index write --no-progress <midx-init.txt &&
> +	for i in 1 2

Same comment as in the earlier patch - why 2 iterations?

> +	do
> +		test_commit A-$i &&
> +		git pack-objects --revs .git/objects/pack/pack <<-\EOF &&
> +		HEAD
> +		^HEAD~1
> +		EOF
> +		GIT_TRACE2_EVENT=$(pwd)/trace-A-$i git \
> +			-c maintenance.incremental-repack.auto=2 \
> +			maintenance run --auto --task=incremental-repack 2>/dev/null &&
> +		test_subcommand ! git multi-pack-index write --no-progress <trace-A-$i &&

OK - one pack not in the MIDX, so this does not get run.

> +		test_commit B-$i &&
> +		git pack-objects --revs .git/objects/pack/pack <<-\EOF &&
> +		HEAD
> +		^HEAD~1
> +		EOF
> +		GIT_TRACE2_EVENT=$(pwd)/trace-B-$i git \
> +			-c maintenance.incremental-repack.auto=2 \
> +			maintenance run --auto --task=incremental-repack 2>/dev/null &&
> +		test_subcommand git multi-pack-index write --no-progress <trace-B-$i || return 1

OK - 2 packs not in the MIDX, so this gets run.

As I said in my review of patch 1, apart from my minor comments in this
and the preceding patches, these patches look good to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index b3fc7c8670..27565c55a2 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -127,4 +127,26 @@ test_expect_success 'loose-objects task' '
>  	test_cmp packs-between packs-after
>  '
>  
> +test_expect_success 'maintenance.loose-objects.auto' '
> +	git repack -adk &&
> +	GIT_TRACE2_EVENT="$(pwd)/trace-lo1.txt" \
> +		git -c maintenance.loose-objects.auto=1 maintenance \
> +		run --auto --task=loose-objects 2>/dev/null &&
> +	test_subcommand ! git prune-packed --quiet <trace-lo1.txt &&
> +	printf data-A | git hash-object -t blob --stdin -w &&
> +	GIT_TRACE2_EVENT="$(pwd)/trace-loA" \
> +		git -c maintenance.loose-objects.auto=2 \
> +		maintenance run --auto --task=loose-objects 2>/dev/null &&
> +	test_subcommand ! git prune-packed --quiet <trace-loA &&
> +	printf data-B | git hash-object -t blob --stdin -w &&

Is it essential for the purpose of the test somehow that the data
used for the test are all incomplete files that lack the end-of-line
at the end of the file?  Use of 'printf' sends such a signal to
confuse readers.

Use of test_write_lines to write a single line may feel overkill,
but it may be less cryptic, as newer parts of testsuite are
encouraged to use it over 'echo' and raw 'printf'.

> +	GIT_TRACE2_EVENT="$(pwd)/trace-loB" \
> +		git -c maintenance.loose-objects.auto=2 \
> +		maintenance run --auto --task=loose-objects 2>/dev/null &&
> +	test_subcommand git prune-packed --quiet <trace-loB &&
> +	GIT_TRACE2_EVENT="$(pwd)/trace-loC" \
> +		git -c maintenance.loose-objects.auto=2 \
> +		maintenance run --auto --task=loose-objects 2>/dev/null &&
> +	test_subcommand git prune-packed --quiet <trace-loC
> +'

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/25/2020 2:00 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +	printf data-B | git hash-object -t blob --stdin -w &&
> 
> Is it essential for the purpose of the test somehow that the data
> used for the test are all incomplete files that lack the end-of-line
> at the end of the file?  Use of 'printf' sends such a signal to
> confuse readers.
> 
> Use of test_write_lines to write a single line may feel overkill,
> but it may be less cryptic, as newer parts of testsuite are
> encouraged to use it over 'echo' and raw 'printf'.

I suppose it could be more standard. It's not particularly
important what the data is here, so lacking a newline seems
innocuous enough to me. I'm happy to agree with standards
elsewhere to avoid being a bad example.

A simple s/printf/test_write_lines/ appears to do the trick:

diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index f259485990..a1bd0029c6 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -147,12 +147,12 @@ test_expect_success 'maintenance.loose-objects.auto' '
                git -c maintenance.loose-objects.auto=1 maintenance \
                run --auto --task=loose-objects 2>/dev/null &&
        test_subcommand ! git prune-packed --quiet <trace-lo1.txt &&
-       printf data-A | git hash-object -t blob --stdin -w &&
+       test_write_lines data-A | git hash-object -t blob --stdin -w &&
        GIT_TRACE2_EVENT="$(pwd)/trace-loA" \
                git -c maintenance.loose-objects.auto=2 \
                maintenance run --auto --task=loose-objects 2>/dev/null &&
        test_subcommand ! git prune-packed --quiet <trace-loA &&
-       printf data-B | git hash-object -t blob --stdin -w &&
+       test_write_lines data-B | git hash-object -t blob --stdin -w &&
        GIT_TRACE2_EVENT="$(pwd)/trace-loB" \
                git -c maintenance.loose-objects.auto=2 \
                maintenance run --auto --task=loose-objects 2>/dev/null &&

Thanks,
-Stolee

the value of `maintenance.commit-graph.auto`. The default value is
100.

maintenance.loose-objects.auto::
This integer config option controls how often the `loose-objects` task
should be run as part of `git maintenance run --auto`. If zero, then
the `loose-objects` task will not run with the `--auto` option. A
negative value will force the task to run every time. Otherwise, a
positive value implies the command should run when the number of
loose objects is at least the value of `maintenance.loose-objects.auto`.
The default value is 100.

maintenance.incremental-repack.auto::
This integer config option controls how often the `incremental-repack`
task should be run as part of `git maintenance run --auto`. If zero,
then the `incremental-repack` task will not run with the `--auto`
option. A negative value will force the task to run every time.
Otherwise, a positive value implies the command should run when the
number of pack-files not in the multi-pack-index is at least the value
of `maintenance.incremental-repack.auto`. The default value is 10.
48 changes: 48 additions & 0 deletions Documentation/git-maintenance.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,21 @@ commit-graph::
`commit-graph-chain` file. They will be deleted by a later run based
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> The task is called "prefetch" because it is work done in advance
> of a foreground fetch to make that 'git fetch' command much faster.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose

-> foreground

I have some more minor comments that I will send as individual replies,
but overall, the patch set looks good to me.

> +static int append_remote(struct remote *remote, void *cbdata)
> +{
> +	struct string_list *remotes = (struct string_list *)cbdata;
> +
> +	string_list_append(remotes, remote->name);
> +	return 0;
> +}
> +
> +static int maintenance_task_prefetch(struct maintenance_run_opts *opts)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(append_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	for_each_string_list_item(item, &remotes)
> +		result |= fetch_remote(item->string, opts);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}

I was wondering why the generation of the list and the iteration was
split up, but I see that you want to attempt to fetch each remote even
if one of them fails.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> Create a 'loose-objects' task for the 'git maintenance run' command.
> This helps clean up loose objects without disrupting concurrent Git
> commands using the following sequence of events:
> 
> 1. Run 'git prune-packed' to delete any loose objects that exist
>    in a pack-file. Concurrent commands will prefer the packed
>    version of the object to the loose version. (Of course, there
>    are exceptions for commands that specifically care about the
>    location of an object. These are rare for a user to run on
>    purpose, and we hope a user that has selected background
>    maintenance will not be trying to do foreground maintenance.)
> 
> 2. Run 'git pack-objects' on a batch of loose objects. These
>    objects are grouped by scanning the loose object directories in
>    lexicographic order until listing all loose objects -or-
>    reaching 50,000 objects. This is more than enough if the loose
>    objects are created only by a user doing normal development.
>    We noticed users with _millions_ of loose objects because VFS
>    for Git downloads blobs on-demand when a file read operation
>    requires populating a virtual file. 

[snip]

> This has potential of
>    happening in partial clones if someone runs 'git grep' or
>    otherwise evades the batch-download feature for requesting
>    promisor objects.

This part is not strictly true, as even when Git lazy-fetches one
object, it fetches it in the form of a packfile - so maybe remove this
sentence.

This is nevertheless a good feature to have - loose objects may not be
created during lazy fetches, but they definitely are created during
normal operation (e.g. commits). Git, as a whole, prefers packfiles over
loose objects, and just packing the loose objects themselves instead of
running repack (which goes through all reachable objects) is definitely
better for large repositories.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/22/2020 7:09 PM, Jonathan Tan wrote:
>> Create a 'loose-objects' task for the 'git maintenance run' command.
>> This helps clean up loose objects without disrupting concurrent Git
>> commands using the following sequence of events:
>>
>> 1. Run 'git prune-packed' to delete any loose objects that exist
>>    in a pack-file. Concurrent commands will prefer the packed
>>    version of the object to the loose version. (Of course, there
>>    are exceptions for commands that specifically care about the
>>    location of an object. These are rare for a user to run on
>>    purpose, and we hope a user that has selected background
>>    maintenance will not be trying to do foreground maintenance.)
>>
>> 2. Run 'git pack-objects' on a batch of loose objects. These
>>    objects are grouped by scanning the loose object directories in
>>    lexicographic order until listing all loose objects -or-
>>    reaching 50,000 objects. This is more than enough if the loose
>>    objects are created only by a user doing normal development.
>>    We noticed users with _millions_ of loose objects because VFS
>>    for Git downloads blobs on-demand when a file read operation
>>    requires populating a virtual file. 
> 
> [snip]
> 
>> This has potential of
>>    happening in partial clones if someone runs 'git grep' or
>>    otherwise evades the batch-download feature for requesting
>>    promisor objects.
> 
> This part is not strictly true, as even when Git lazy-fetches one
> object, it fetches it in the form of a packfile - so maybe remove this
> sentence.

This is a good point. I just did some testing and we do store these
single-object downloads as pack-files. My misunderstanding is due to
my own bias and experience with the GVFS protocol.

I have also heard that "git fetch" might explode some small pack-files
into loose objects, and I guess I expected the same here. However, that
is not the case for partial clone. I'll remove this.

> This is nevertheless a good feature to have - loose objects may not be
> created during lazy fetches, but they definitely are created during
> normal operation (e.g. commits). Git, as a whole, prefers packfiles over
> loose objects, and just packing the loose objects themselves instead of
> running repack (which goes through all reachable objects) is definitely
> better for large repositories.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> +incremental-repack::
> +	The `incremental-repack` job repacks the object directory
> +	using the `multi-pack-index` feature. In order to prevent race
> +	conditions with concurrent Git commands, it follows a two-step
> +	process.

[snip]

> First, it deletes any pack-files included in the
> +	`multi-pack-index` where none of the objects in the
> +	`multi-pack-index` reference those pack-files; this only happens
> +	if all objects in the pack-file are also stored in a newer
> +	pack-file. Second, it selects a group of pack-files whose "expected
> +	size" is below the batch size until the group has total expected
> +	size at least the batch size; see the `--batch-size` option for
> +	the `repack` subcommand in linkgit:git-multi-pack-index[1]. The
> +	default batch-size is zero, which is a special case that attempts
> +	to repack all pack-files into a single pack-file.

This lacks the detail of what happens to the selected group of packfiles
(in the second step) - in particular, that a new packfile is created and
the MIDX is rewritten so that all references to the selected group are
updated to refer to the new packfile instead, thus making it possible to
delete the selected group of packfiles in a subsequent first step. All
this is explained in the documentation of git-multi-pack-index (expire
and repack), though, so it might be better to refer to that. E.g.

  First, it calls `multi-pack-index expire` to delete packfiles
  unreferenced by the MIDX file. Second, it calls `multi-pack-index
  repack` to select several small packfiles and repack them into a
  bigger one, and then update the MIDX entries that refer to the small
  packfiles to refer to the big one instead, thus preparing it for
  deletion upon a subsequent `multi-pack-index expire` invocation. The
  selection of the small packfiles is such that the expected size of the
  big packfile is at least the batch size; see the ...

> diff --git a/midx.c b/midx.c
> index aa37d5da86..66d7053d83 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -37,7 +37,7 @@
>  
>  #define PACK_EXPIRED UINT_MAX
>  
> -static char *get_midx_filename(const char *object_dir)
> +char *get_midx_filename(const char *object_dir)
>  {
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
> diff --git a/midx.h b/midx.h
> index b18cf53bc4..baeecc70c9 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -37,6 +37,7 @@ struct multi_pack_index {
>  
>  #define MIDX_PROGRESS     (1 << 0)
>  
> +char *get_midx_filename(const char *object_dir);
>  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
>  int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
>  int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);

Do we need get_midx_filename() to be global?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/22/2020 7:26 PM, Jonathan Tan wrote:
>> +incremental-repack::
>> +	The `incremental-repack` job repacks the object directory
>> +	using the `multi-pack-index` feature. In order to prevent race
>> +	conditions with concurrent Git commands, it follows a two-step
>> +	process.
> 
> [snip]
> 
>> First, it deletes any pack-files included in the
>> +	`multi-pack-index` where none of the objects in the
>> +	`multi-pack-index` reference those pack-files; this only happens
>> +	if all objects in the pack-file are also stored in a newer
>> +	pack-file. Second, it selects a group of pack-files whose "expected
>> +	size" is below the batch size until the group has total expected
>> +	size at least the batch size; see the `--batch-size` option for
>> +	the `repack` subcommand in linkgit:git-multi-pack-index[1]. The
>> +	default batch-size is zero, which is a special case that attempts
>> +	to repack all pack-files into a single pack-file.
> 
> This lacks the detail of what happens to the selected group of packfiles
> (in the second step) - in particular, that a new packfile is created and
> the MIDX is rewritten so that all references to the selected group are
> updated to refer to the new packfile instead, thus making it possible to
> delete the selected group of packfiles in a subsequent first step. All
> this is explained in the documentation of git-multi-pack-index (expire
> and repack), though, so it might be better to refer to that. E.g.

Here is my attempt to incorporate your recommendations into this doc:

incremental-repack::
	The `incremental-repack` job repacks the object directory
	using the `multi-pack-index` feature. In order to prevent race
	conditions with concurrent Git commands, it follows a two-step
	process. First, it calls `git multi-pack-index expire` to delete
	pack-files unreferenced by the `multi-pack-index` file. Second, it
	calls `git multi-pack-index repack` to select several small
	pack-files and repack them into a bigger one, and then update the
	`multi-pack-index` entries that refer to the small pack-files to
	refer to the new pack-file. This prepares those small pack-files
	for deletion upon the next run of `git multi-pack-index expire`.
	The selection of the small pack-files is such that the expected
	size of the big pack-file is at least the batch size; see the
	`--batch-size` option for the `repack` subcommand in
	linkgit:git-multi-pack-index[1]. The default batch-size is zero,
	which is a special case that attempts to repack all pack-files
	into a single pack-file.

> Do we need get_midx_filename() to be global?

No, that does not appear to be important (to this patch). It _was_
necessary when doing the "verify, then delete if problematic" mode.
Thanks for catching it now that it is not necessary.

Thanks,
-Stolee
 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> Here is my attempt to incorporate your recommendations into this doc:
> 
> incremental-repack::
> 	The `incremental-repack` job repacks the object directory
> 	using the `multi-pack-index` feature. In order to prevent race
> 	conditions with concurrent Git commands, it follows a two-step
> 	process. First, it calls `git multi-pack-index expire` to delete
> 	pack-files unreferenced by the `multi-pack-index` file. Second, it
> 	calls `git multi-pack-index repack` to select several small
> 	pack-files and repack them into a bigger one, and then update the
> 	`multi-pack-index` entries that refer to the small pack-files to
> 	refer to the new pack-file. This prepares those small pack-files
> 	for deletion upon the next run of `git multi-pack-index expire`.
> 	The selection of the small pack-files is such that the expected
> 	size of the big pack-file is at least the batch size; see the
> 	`--batch-size` option for the `repack` subcommand in
> 	linkgit:git-multi-pack-index[1]. The default batch-size is zero,
> 	which is a special case that attempts to repack all pack-files
> 	into a single pack-file.

Thanks, this looks good.

on the expiration delay.

prefetch::
The `prefetch` task updates the object directory with the latest
objects from all registered remotes. For each remote, a `git fetch`
command is run. The refmap is custom to avoid updating local or remote
branches (those in `refs/heads` or `refs/remotes`). Instead, the
remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
not updated.
+
This is done to avoid disrupting the remote-tracking branches. The end users
expect these refs to stay unmoved unless they initiate a fetch. With prefetch
task, however, the objects necessary to complete a later real fetch would
already be obtained, so the real fetch would go faster. In the ideal case,
it will just become an update to bunch of remote-tracking branches without
any object transfer.

gc::
Clean up unnecessary files and optimize the local repository. "GC"
stands for "garbage collection," but this task performs many
Expand All @@ -55,6 +70,39 @@ gc::
be disruptive in some situations, as it deletes stale data. See
linkgit:git-gc[1] for more details on garbage collection in Git.

loose-objects::
The `loose-objects` job cleans up loose objects and places them into
pack-files. In order to prevent race conditions with concurrent Git
commands, it follows a two-step process. First, it deletes any loose
objects that already exist in a pack-file; concurrent Git processes
will examine the pack-file for the object data instead of the loose
object. Second, it creates a new pack-file (starting with "loose-")
containing a batch of loose objects. The batch size is limited to 50
thousand objects to prevent the job from taking too long on a
repository with many loose objects. The `gc` task writes unreachable
objects as loose objects to be cleaned up by a later step only if
they are not re-added to a pack-file; for this reason it is not
advisable to enable both the `loose-objects` and `gc` tasks at the
same time.

incremental-repack::
The `incremental-repack` job repacks the object directory
using the `multi-pack-index` feature. In order to prevent race
conditions with concurrent Git commands, it follows a two-step
process. First, it calls `git multi-pack-index expire` to delete
pack-files unreferenced by the `multi-pack-index` file. Second, it
calls `git multi-pack-index repack` to select several small
pack-files and repack them into a bigger one, and then update the
`multi-pack-index` entries that refer to the small pack-files to
refer to the new pack-file. This prepares those small pack-files
for deletion upon the next run of `git multi-pack-index expire`.
The selection of the small pack-files is such that the expected
size of the big pack-file is at least the batch size; see the
`--batch-size` option for the `repack` subcommand in
linkgit:git-multi-pack-index[1]. The default batch-size is zero,
which is a special case that attempts to repack all pack-files
into a single pack-file.

OPTIONS
-------
--auto::
Expand Down
Loading