Skip to content

[performance+memory] Beating git in index-pack (as used for clones and fetches) βœ… πŸš€Β #5

Closed
@Byron

Description

@Byron

git index-pack is streaming a pack and creates an index from it. The difficulty arises from having to decompress every entry in the pack stream, which can be composed of many small objects. These are placed in some sort of index to accelerate the next stage that is all about resolving the deltas in order to produce a SHA1. Per pack entry, the SHA1, pack offset and CRC32 are written into the index file to complete the operation.

The indexing phase in inherently single-threaded with little potential for improvements, whereas the resolving phase is fully multithreaded and entirely lock free. The first phase could be improved by writing the pack file in parallel - right now it happens after reading it (the pack file is used later for lookup to not hold everything in memory). However, IO doesn't appear to be the bottleneck at all.

Compared to gitoxide, git is considerably faster when creating the index, averaging 54MB/s of reading uncompressed bytes. gitoxide clocks in at about 45MB/s 50MB/s, and slows down considerably during the end. Part of that slowdown might be attributed to this issue with resetting miniz_oxide's decompressor.

Luckily gitoxide is way faster when resolving deltas, which already gives it a good first place in the race, with some room for more if it manages to get as fast as git when decompressing and indexing objects.

The picture below shows the fastest git run I could produce, probably with everything being properly cached:

Screenshot 2020-08-04 at 12 04 36

Without cache, it seems to look different:

Screenshot 2020-08-04 at 12 04 36

The fastest gitoxide runs, which are pretty comparable in the amount of work done, as they also write out the pack and the index. The only difference is that they use the packfile directly instead of reading it from stdin, it's streamed nonetheless though, and merely an oversight.

Screenshot 2020-08-04 at 12 28 45

Memory consumption of git hovers consistently around 650MB (for the kernel pack), and is lower higher than the 1.2GB 750MB 580MB that gitoxide uses. However, gitoxide can temporarily use more memory as it keeps intermediate decompressed objects per thread, whose maximum sizes depend on the amount of children and the base size. So I have seen this go up to 850MB for small fractions of time because of that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions