Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: Experimental support for git commit graph files #6701

Closed
wants to merge 2 commits into from

Conversation

filipnavara
Copy link
Contributor

@filipnavara filipnavara commented Apr 21, 2019

This is a proof of concept code for accelerating repository listing using the git commit graph format that was introduced with Git 2.18. Currently I do NOT plan to work on it and it is available for anyone who wants to try it or pick it up.

There's a series of articles about the feature:

The gist is that Git can generate an optional index file that can be used to speed up history lookups. This file can be generated by git commit-graph write. This PR also includes native Go code to do the same.

Unfortunately, go-git doesn't understand this file format yet. I have previously worked on a proof of concept to add support of it to go-git. This includes a slightly reworked version of that conceptual code. The technical details of the implementation remain largely unchanged except for places which required changes to decouple the code from go-git internals.

The code changes are mostly restricted to adding commitgraph module for manipulating the file format. GetCommitsInfo method in the git module is extended to support the commit graph files if they are present. In case the commit graph index is not present in the repository the code behaves the same as it did before. New BuildCommitGraph method is added to Repository structure in git module to build the commit graph file.

I'm not familiar with Gitea code enough to actually wire up the BuildCommitGraph method to be called at some reasonable place. It's a repository maintenance operation and if we decided to support it, it should be explicitly enabled and run periodically. Note that even if the commit graph file does not cover the entire history it is still used for the part that is covered. It's not necessary to regenerate it very often and there are certain heuristics which may help decide whether it is needed or not.

The series of articles above describes a clever optimization used on Azure DevOps to use the Bloom filter probabilistic data structure to accelerate the history traversal when looking for file changes. Unlike the rest of the commit graph structures, this part is NOT implemented in official Git as of today. It is unlikely to be included any time soon and alternatives based on the Git bitmaps are considered. This proof of concept includes implementation of the Bloom filters but since it's not standardized it is NOT recommended to be used and it received very little testing. I will post the technical details in a separate comment for anyone interested.

(@derrickstolee, JFYI the images in the first blog post are broken.)

Signed-off-by: Filip Navara <filip.navara@gmail.com>
@filipnavara
Copy link
Contributor Author

filipnavara commented Apr 21, 2019

Technical details about the commitgraph package
(taken from the linked go-git issue and updated)

Since go-git currently defines Commit objects as struct I was left with no other choice but to introduce a CommitNode interface, which is watered down version of Commit as present in the serialized commit graph. In ideal world there would be only one Commit interface and the commit graph implementation of it would lazy-load the real Commit objects if necessary.

Here's a rough approch of my implementation:

At the lowest level there is commitgraph package (modules/commitgraph/plumbing/format/commitgraph), which provides the Node and Index interfaces representing the data at the file level (https://github.com/git/git/blob/2d3b1c576c85b7f5db1f418907af00ab88e0c303/Documentation/technical/commit-graph-format.txt). There is implementation of the interface using random-access files / memory-mapped files (FileIndex), in-memory implementation (MemoryIndex) and an Encoder, which can write down the memory index into new file.

An example of building a memory index from all commits reachable from an existing commit is included in repo_commitgraph.go file.

The commit graph files currently have to be opened manually from the repository directory like this:

indexPath := path.Join(r.Path, "objects", "info", "commit-graph")
file, err := mmap.Open(indexPath)
index, err := commitgraph.OpenFileIndex(file)
}

In the object package CommitNode and CommitNodeIndex interfaces are added. There are two implementations of the CommitNode interface - an objectCommitNode wrapping existing Commit structure and a new lightweight graphCommitNode. The CommitNodeIndex interface provides methods for looking up commit parents and getting a full Commit object from CommitNode. Two implementations of CommitNodeIndex exist. The first one is objectCommitNodeIndex, which only uses Commit objects and implements the interfaces to behave exactly as if no serialized commit graph existed. Second one, graphCommitNodeIndex, takes the additional commitgraph.Index object and implements the lookup methods by trying the commit graph first and falling back to loading full Commit objects if the commit is not present in the commit graph file.

I added NewCommitNodeIterCTime iterator as a counterpart to NewCommitIterCTime, which operates on top of the CommitNode and CommitNodeIndex interfaces. Similar thing could be done for other NewCommit*Iter* methods. In fact, it is easily possible to reimplement NewCommit*Iter* on top of NewCommitNode*Iter* and switch between the lookup implementations (graphCommitNodeIndex / objectCommitNodeIndex) based on the paricular workload at hand. When full commit information, such as Message or Author is needed, then it's more useful to load the objects directly. When only summary information is needed (eg. counting distance between two commits) then the commit graph implementation can be used.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Apr 21, 2019
@filipnavara
Copy link
Contributor Author

filipnavara commented Apr 21, 2019

Technical details about the bloom filters

The implementation allows calculating per-commit bloom filter that can tell whether a particular file was likely changed in a given commit (assuming non-merge commits with single parent). These filters are stored in the commit graph file and take 640 bytes per commit.

Each bloom filter is represented by the BloomPathFilter structure which holds the 640 byte array and provides convenience methods to manipulate it. The implementation uses a standard bloom filter with n=512, m=10, k=7 parameters using the 64-bit SipHash hash function with zero key.

The commit graph format allows storing arbitrary named chunks. Two chunk types in the commit graph file are used to store the bloom filters:

  • XGGB chunk stores the raw filter data in the same indexed order as the commits in the commit graph file. Thus for a particular commit in the graph file the bloom filter is located at the start of chunk + (commit index * 640) bytes boundary (unless "sparse" format is used, see below). This allows very efficient lookups in the file.

  • XGSB chunk is used for a "sparse" optimization applied in cases where the bloom filters are not present for a significant number of commits.

In repositories with significant number of merge commits or commits with large number of changed files in one commit it may not make sense to generate the bloom filter for each commit. If only the XGGB chunk in the commit graph was used then every commit with no bloom filter still has to store 640 bytes (with all bits set to 1) for that particular commit.

The "sparse" optimization solves the above problem by storing a bitmap where each bit represents whether a bloom filter is present for a particular commit. This allows building a map between commit index number in the graph file and position of the bloom filter in the file.

desc := fmt.Sprintf("Failed to build commit graph (%s): %v", repoPath, err)
log.Warn(desc)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inserted here only to allow people to play with the feature.

Running a health check for all the repositories will also rebuild the commit graph files (http://gitea/admin?op=9). It is entirely possible to generate the commit graph file using the command line git commit-graph write tool instead. The bloom filter experiment is enabled by changing BuildCommitGraph(false) to BuildCommitGraph(true) in the above code. It will significantly increase the size of the commit graph files and the time to build it, but in many cases it will also significantly speed up hhistory queries on large repositories (unless I broke it :D).

@filipnavara
Copy link
Contributor Author

filipnavara commented Apr 21, 2019

State of the code

An earlier iteration of the code was running on our production server since December with no reported failures related to this code. I refactored it a bit before opening the PR and tested it on couple of big repositories. Cursory testing shows that everything still works as expected.

Surprisingly the bloom filters didn't provide big improvements on the rails/rails repository so it is entirely possible that I broke something in that part. The commit graph itself helps significantly on that repository though.


  • The code seriously lacks tests.
  • Code for writing the commit graph may lack some error handling.
  • There are scattered TODO comments throughout the code for places where I felt an improvement is needed or where it could be beneficial for performance.
  • The CI will inevitably fail because I cannot update the vendor module directory from my current Go setup.

It may be worth exploring the option to separate the commit graph and bloom filters and upstream the commit graph part to go-git (and related projects). I originally didn't do it because I reused the hash-to-index lookup tables from the commit graph. That's no longer such a big win with the sparse bloom filters.


  • Code for reading/writing octopus merges is definitely broken.
  • fileIndex.Hashes has likely some crashing issue. It's not used by current code though.

@filipnavara
Copy link
Contributor Author

Closing this for now, it is too buggy for edge cases. I submitted the low-level file handling to go-git instead (src-d/go-git#1128) and I will rebuild this on top of it.

@lunny
Copy link
Member

lunny commented Jun 27, 2019

@filipnavara Do you have any plan to continue this work in a new PR? :)

@filipnavara
Copy link
Contributor Author

@lunny yes... likely by Monday next week. I have cleaned up and rebased all my changes. My plan is to submit one PR for the basic commit-graph usage, if the file already exists in the GIT repository. I'll also release my experiment with bloom filter cache that allows some additional speed-ups.

Additional work that still has to be done is to configure GIT to produce the commit-graph files, or to do it directly through go-git. I can provide code snippets or guidance but I do not plan to tackle this myself.

@derrickstolee
Copy link

@filipnavara I appreciate the Bloom filter experiment, but I would recommend not creating a feature using it until the feature has been adopted and released into git/git. That is, as long as you want go-git and go-gitea to be able to read data written by git/git.

While the feature has not moved forward on the mailing list, it is not due to lack of interest. Mostly, my priorities have been to work on other features and no one has taken up the feature themselves. The most-recent work in progress was reported here: https://public-inbox.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/ In that message, I mention the work that would need to be done before those patches are worth reviewing and merging. The commit-graph code has advanced significantly since then, so any new use of the feature would need to adapt to those changes (and changes in progress, including the incremental file format).

If you want to contribute this feature to git/git, then I would be very happy to review your work.

@filipnavara
Copy link
Contributor Author

@derrickstolee Thanks for chiming in.

I do not intend to clash with the official git/git code and features. I've initially reimplemented your version of the Bloom filter code in Go. It is official policy of go-git to take only features that are available in official git so it was never a possibility that it would be merged upstream as-is. My later experiments explicitly stored the bloom filters separately from the commit-graph files in order to avoid any unwanted interactions but at the same time to allow evaluating the performance and other metrics. However, I do keep the same internal file format for the experiments.

I am trying to follow all the commit-graph and bloom filter threads including the discussions about file format v2 and incremental files. My intention is to maintain the parity of the go-git implementation with any work that is merged into git/git.

Keep up the good work! I really appreciate it.

@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants