Skip to content

Improve Cargo's load time of crates.io index information #6866

Closed
@alexcrichton

Description

@alexcrichton

Cargo is in theory a pretty fast CLI tool that adds negligible overhead when executing various commands like cargo build. Ideally if you cargo build and everything is fresh, this should be instantaneous and only a small handful of milliseconds.

Cargo does, however, actually do work and it's sometimes nontrivial. Currently when profiling null builds in Cargo one of the most expensive operations (40% of runtime) is loading crates.io index information. As a brief recap, the crates.io index has a file-per-crate model in a git repository. Each file is a number of lines where each line is a JSON blob. Currently if you crate uses 4 crates from crates.io it means that Cargo will parse 4 files and parse all JSON blobs on each line.

For a null build, however, this is a ton more work then we would otherwise need to do. We only actually need to parse exact JSON blobs that are used already in a lock file, this is just discovered later. There's two main things slowing us down right now:

  • We don't actually read each file from disk, but rather from libgit2. We don't check out the index to conserve disk space and it's much faster on an initial clone as well. Reading from the index however involves decompression and seeking that is somewhat expensive in libgit2.

  • We parse a ton of JSON that we immediately throw away. Typically only one JSON blob is relevant per file.

I think that we can optimize both of these strategies to improve Cargo's null build performance. While we can probably make each step in isolation faster and more optimized, I think there's benefit from simply reducing the number of operations we do in the first place.

Concretely what I think we can do is keep a per-crate-file cache on disk. Cargo could, when it loads a crate, originally read it from libgit2 and then cache the information on disk in a more easily machine readable format. For example it could at the very least just write the file out itself to disk, invalidating the file whenever the index updates (which does not happen during a null build). On another extreme we could have a database-like file synthesized from the git file contents. This could be easily indexable by Cargo and require less JSON parsing or something like that.

By caching something on disk we should avoid the libgit2 pitfall where decompressing the registry data is expensive. By using some more database-like or Cargo-friendly format than JSON we could not only parse less JSON (by knowing which lines we don't need to parse) but we could also have a faster parser than JSON (in theory). I think the amortized time of managing all this will still be quite good because it only happens when the index changes or is updated, never on a null build or after a lock file exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions