-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept: Experimental support for git commit graph files #6701
Conversation
Signed-off-by: Filip Navara <filip.navara@gmail.com>
Technical details about the Since Here's a rough approch of my implementation: At the lowest level there is An example of building a memory index from all commits reachable from an existing commit is included in repo_commitgraph.go file. The commit graph files currently have to be opened manually from the repository directory like this:
In the I added |
Technical details about the bloom filters The implementation allows calculating per-commit bloom filter that can tell whether a particular file was likely changed in a given commit (assuming non-merge commits with single parent). These filters are stored in the commit graph file and take 640 bytes per commit. Each bloom filter is represented by the The commit graph format allows storing arbitrary named chunks. Two chunk types in the commit graph file are used to store the bloom filters:
In repositories with significant number of merge commits or commits with large number of changed files in one commit it may not make sense to generate the bloom filter for each commit. If only the The "sparse" optimization solves the above problem by storing a bitmap where each bit represents whether a bloom filter is present for a particular commit. This allows building a map between commit index number in the graph file and position of the bloom filter in the file. |
desc := fmt.Sprintf("Failed to build commit graph (%s): %v", repoPath, err) | ||
log.Warn(desc) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inserted here only to allow people to play with the feature.
Running a health check for all the repositories will also rebuild the commit graph files (http://gitea/admin?op=9). It is entirely possible to generate the commit graph file using the command line git commit-graph write
tool instead. The bloom filter experiment is enabled by changing BuildCommitGraph(false)
to BuildCommitGraph(true)
in the above code. It will significantly increase the size of the commit graph files and the time to build it, but in many cases it will also significantly speed up hhistory queries on large repositories (unless I broke it :D).
State of the code An earlier iteration of the code was running on our production server since December with no reported failures related to this code. I refactored it a bit before opening the PR and tested it on couple of big repositories. Cursory testing shows that everything still works as expected. Surprisingly the bloom filters didn't provide big improvements on the rails/rails repository so it is entirely possible that I broke something in that part. The commit graph itself helps significantly on that repository though.
It may be worth exploring the option to separate the commit graph and bloom filters and upstream the commit graph part to go-git (and related projects). I originally didn't do it because I reused the hash-to-index lookup tables from the commit graph. That's no longer such a big win with the sparse bloom filters.
|
Closing this for now, it is too buggy for edge cases. I submitted the low-level file handling to go-git instead (src-d/go-git#1128) and I will rebuild this on top of it. |
@filipnavara Do you have any plan to continue this work in a new PR? :) |
@lunny yes... likely by Monday next week. I have cleaned up and rebased all my changes. My plan is to submit one PR for the basic commit-graph usage, if the file already exists in the GIT repository. I'll also release my experiment with bloom filter cache that allows some additional speed-ups. Additional work that still has to be done is to configure GIT to produce the commit-graph files, or to do it directly through go-git. I can provide code snippets or guidance but I do not plan to tackle this myself. |
@filipnavara I appreciate the Bloom filter experiment, but I would recommend not creating a feature using it until the feature has been adopted and released into git/git. That is, as long as you want go-git and go-gitea to be able to read data written by git/git. While the feature has not moved forward on the mailing list, it is not due to lack of interest. Mostly, my priorities have been to work on other features and no one has taken up the feature themselves. The most-recent work in progress was reported here: https://public-inbox.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/ In that message, I mention the work that would need to be done before those patches are worth reviewing and merging. The commit-graph code has advanced significantly since then, so any new use of the feature would need to adapt to those changes (and changes in progress, including the incremental file format). If you want to contribute this feature to git/git, then I would be very happy to review your work. |
@derrickstolee Thanks for chiming in. I do not intend to clash with the official git/git code and features. I've initially reimplemented your version of the Bloom filter code in Go. It is official policy of go-git to take only features that are available in official git so it was never a possibility that it would be merged upstream as-is. My later experiments explicitly stored the bloom filters separately from the commit-graph files in order to avoid any unwanted interactions but at the same time to allow evaluating the performance and other metrics. However, I do keep the same internal file format for the experiments. I am trying to follow all the commit-graph and bloom filter threads including the discussions about file format v2 and incremental files. My intention is to maintain the parity of the go-git implementation with any work that is merged into git/git. Keep up the good work! I really appreciate it. |
This is a proof of concept code for accelerating repository listing using the git commit graph format that was introduced with Git 2.18. Currently I do NOT plan to work on it and it is available for anyone who wants to try it or pick it up.
There's a series of articles about the feature:
The gist is that Git can generate an optional index file that can be used to speed up history lookups. This file can be generated by
git commit-graph write
. This PR also includes native Go code to do the same.Unfortunately, go-git doesn't understand this file format yet. I have previously worked on a proof of concept to add support of it to go-git. This includes a slightly reworked version of that conceptual code. The technical details of the implementation remain largely unchanged except for places which required changes to decouple the code from go-git internals.
The code changes are mostly restricted to adding
commitgraph
module for manipulating the file format.GetCommitsInfo
method in thegit
module is extended to support the commit graph files if they are present. In case the commit graph index is not present in the repository the code behaves the same as it did before. NewBuildCommitGraph
method is added toRepository
structure ingit
module to build the commit graph file.I'm not familiar with Gitea code enough to actually wire up the
BuildCommitGraph
method to be called at some reasonable place. It's a repository maintenance operation and if we decided to support it, it should be explicitly enabled and run periodically. Note that even if the commit graph file does not cover the entire history it is still used for the part that is covered. It's not necessary to regenerate it very often and there are certain heuristics which may help decide whether it is needed or not.The series of articles above describes a clever optimization used on Azure DevOps to use the Bloom filter probabilistic data structure to accelerate the history traversal when looking for file changes. Unlike the rest of the commit graph structures, this part is NOT implemented in official Git as of today. It is unlikely to be included any time soon and alternatives based on the Git bitmaps are considered. This proof of concept includes implementation of the Bloom filters but since it's not standardized it is NOT recommended to be used and it received very little testing. I will post the technical details in a separate comment for anyone interested.
(@derrickstolee, JFYI the images in the first blog post are broken.)