Skip to content

Commit a933471

Browse files
committed
bundle-uri: add example bundle organization
The previous change introduced the bundle URI design document. It creates a flexible set of options that allow bundle providers many ways to organize Git object data and speed up clones and fetches. It is particularly important that we have flexibility so we can apply future advancements as new ideas for efficiently organizing Git data are discovered. However, the design document does not provide even an example of how bundles could be organized, and that makes it difficult to envision how the feature should work at the end of the implementation plan. Add a section that details how a bundle provider could work, including using the Git server advertisement for multiple geo-distributed servers. This organization is based on the GVFS Cache Servers which have successfully used similar ideas to provide fast object access and reduced server load for very large repositories. Signed-off-by: Derrick Stolee <derrickstolee@github.com>
1 parent e0f003e commit a933471

File tree

1 file changed

+105
-0
lines changed

1 file changed

+105
-0
lines changed

Documentation/technical/bundle-uri.txt

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -349,6 +349,111 @@ error conditions:
349349
should not use bundle URIs for fetch unless the server has explicitly
350350
recommended it through a `bundle.heuristic` value.
351351

352+
Example Bundle Provider organization
353+
------------------------------------
354+
355+
The bundle URI feature is intentionally designed to be flexible to
356+
different ways a bundle provider wants to organize the object data.
357+
However, it can be helpful to have a complete organization model described
358+
here so providers can start from that base.
359+
360+
This example organization is a simplified model of what is used by the
361+
GVFS Cache Servers (see section near the end of this document) which have
362+
been beneficial in speeding up clones and fetches for very large
363+
repositories, although using extra software outside of Git.
364+
365+
The bundle provider deploys servers across multiple geographies. Each
366+
server manages its own bundle set. The server can track a number of Git
367+
repositories, but provides a bundle list for each based on a pattern. For
368+
example, when mirroring a repository at `https://<domain>/<org>/<repo>`
369+
the bundle server could have its bundle list available at
370+
`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
371+
list all of these servers under the "any" mode:
372+
373+
[bundle]
374+
version = 1
375+
mode = any
376+
377+
[bundle "eastus"]
378+
uri = https://eastus.example.com/<domain>/<org>/<repo>
379+
380+
[bundle "europe"]
381+
uri = https://europe.example.com/<domain>/<org>/<repo>
382+
383+
[bundle "apac"]
384+
uri = https://apac.example.com/<domain>/<org>/<repo>
385+
386+
This "list of lists" is static and only changes if a bundle server is
387+
added or removed.
388+
389+
Each bundle server manages its own set of bundles. The initial bundle list
390+
contains only a single bundle, containing all of the objects received from
391+
cloning the repository from the origin server. The list uses the
392+
`creationToken` heuristic and a `creationToken` is made for the bundle
393+
based on the server's timestamp.
394+
395+
The bundle server runs regularly-scheduled updates for the bundle list,
396+
such as once a day. During this task, the server fetches the latest
397+
contents from the origin server and generates a bundle containing the
398+
objects reachable from the latest origin refs, but not contained in a
399+
previously-computed bundle. This bundle is added to the list, with care
400+
that the `creationToken` is strictly greater than the previous maximum
401+
`creationToken`.
402+
403+
When the bundle list grows too large, say more than 30 bundles, then the
404+
oldest "_N_ minus 30" bundles are combined into a single bundle. This
405+
bundle's `creationToken` is equal to the maximum `creationToken` among the
406+
merged bundles.
407+
408+
An example bundle list is provided here, although it only has two daily
409+
bundles and not a full list of 30:
410+
411+
[bundle]
412+
version = 1
413+
mode = all
414+
heuristic = creationToken
415+
416+
[bundle "2022-02-13-1644770820-daily"]
417+
uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
418+
creationToken = 1644770820
419+
420+
[bundle "2022-02-09-1644442601-daily"]
421+
uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
422+
creationToken = 1644442601
423+
424+
[bundle "2022-02-02-1643842562"]
425+
uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
426+
creationToken = 1643842562
427+
428+
To avoid storing and serving object data in perpetuity despite becoming
429+
unreachable in the origin server, this bundle merge can be more careful.
430+
Instead of taking an absolute union of the old bundles, instead the bundle
431+
can be created by looking at the newer bundles and ensuring that their
432+
necessary commits are all available in this merged bundle (or in another
433+
one of the newer bundles). This allows "expiring" object data that is not
434+
being used by new commits in this window of time. That data could be
435+
reintroduced by a later push.
436+
437+
The intention of this data organization has two main goals. First, initial
438+
clones of the repository become faster by downloading precomputed object
439+
data from a closer source. Second, `git fetch` commands can be faster,
440+
especially if the client has not fetched for a few days. However, if a
441+
client does not fetch for 30 days, then the bundle list organization would
442+
cause redownloading a large amount of object data.
443+
444+
One way to make this organization more useful to users who fetch frequently
445+
is to have more frequent bundle creation. For example, bundles could be
446+
created every hour, and then once a day those "hourly" bundles could be
447+
merged into a "daily" bundle. The daily bundles are merged into the
448+
oldest bundle after 30 days.
449+
450+
It is recommened that this bundle strategy is repeated with the `blob:none`
451+
filter if clients of this repository are expecting to use blobless partial
452+
clones. This list of blobless bundles stays in the same list as the full
453+
bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
454+
For very large repositories, the bundle provider may want to _only_ provide
455+
blobless bundles.
456+
352457
Implementation Plan
353458
-------------------
354459

0 commit comments

Comments
 (0)