Skip to content

Scaling registry updates #2452

Closed
Closed
@SimonSapin

Description

@SimonSapin

TL;DR: This is a problem we don’t have yet. I mostly want to record some information in case we do in the long term.


This comment: CocoaPods/CocoaPods#4989 (comment) explains how the CocoaPods/Specs repository gets so much traffic that GitHub rate-limits it severely, causing fetches to take a very long time or fail.

We understand that part of the CocoaPods workflow is that its end users (i.e., not just the people contributing to CocoaPods/Specs) fetch regularly from GitHub

This sounds exactly like rust-lang/crates.io-index.

Rate-limiting from GitHub has not been a problem for us as far as I know, but there may be some precautions we can take to avoid it.

Apparently, most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term. It is usually preferable to pay the price of a full clone once, then incrementally fetch into the repository, because then Git is better able to negotiate the minimum set of changes that have to be transferred to bring the clone up to date.

I think we’re OK here since Cargo uses libgit2 which does not support shallow clones anyway.

Finally, the layout of the repo itself doesn't help. Specifically, the Specs directory, which contains 16k+ subdirectories, causes some Git operations to be unexpectedly expensive, further driving up CPU usage.

Here as well we’re doing pretty good since rust-lang/crates.io-index already has two levels of directory nesting, each (roughly) with two characters from the start of crates’s names. 26^4 is 456,976; npm has 249,825 packages right now.

Another comment CocoaPods/CocoaPods#4989 (comment) suggests:

this new, preview API should help: https://developer.github.com/changes/2016-02-24-commit-reference-sha-api/. It's helped Homebrew dramatically reduce the number of no-op git fetchs which also will make things better for your users as a no-op API HTTP call is significantly faster for you (and less expensive for GitHub) than a no-op git fetch.

This sounds beneficial even if we don’t hit rate-limiting. I’ve filed #2451 separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions