revset: improve perf of revset intersections #1591
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements a simple optimization for revset functions
intersection()anddifference()that significantly improves the eval time of revsets likestack() & paths.changed(foo).When revsets are evaluated, their arguments are eagerly evaluated – totally independently – before each particular dag/revset operation is applied to them. The issue I have found is that expressions like
stack() & paths.changed(foo)have to evaluate bothstack()andpaths.changed(foo)before the intersection can be calculated, but the "matching" revsets (likepaths.changed()andauthor.email()) operate only on all visible commits. This means that, in cases likestack() & paths.changed(foo), it examines way more commits that it really needs to and often feels very slow. See #1586, for example.The solution here accomplishes this by using the result of the left-hand-side during the evaluation of the right-hand-side. This was already sort of happening for the
branches()revset (from when we added the ability to filter the matched branch by name), so this just extends this "which commits to match on" plumbing into the rest of the revsets, and then updates theintersection()anddifference()functions to use that plumbing. (Both of these functions will always limit the results to – at most – the result of the lhs.)The other 2-arg revsets (
union(),only(), andrange()) are not affected, though I wonder if a related optimization could be applied toonly()andrange(), both of which should only care about the descendants of of the lhs. (I think.) In any event, I have not updated those because I haven't experienced slow performance when I use them.Original description
When we evaluate binary revset expressions (for example
stack() & branches()) we currently evaluate both revsets on their own and then evaluate the intersection of the two. This change passes the left hand side revset result into the right hand side function as a "hint", so that the rhs can reduce the scope of it's evaluation to just those commits from the lhs result. This is implemented to only affectintersection()anddifference()(because they are the only binary functions that reduce one set by another) and has the most dramatic effect on revsets that have historically been evaluated against all public commits likepaths.changed().The main caveat of this approach is that – because the hints are passed left-to-right only – 2 equivalent revsets may evaluate in 2 very different amounts of time:
stack() & paths.changed(foo)andpaths.changed(foo) & stack()will (should!) produce identical results, but – with this change – the former will be very fast while the latter will be possibly slow.I have not done any systematic benchmarks, but my experience of using a revset like
stack() & paths.changed(foo)is that the evaluation time goes from "hmm, that's taking longer than I expected" to "oh, that was so fast that I should check that it's correct". So I'm satisfied.I'm certain that there are vast libraries full of rigorous academic research and battle-tested, real-world reports about how to optimize order-of-execution for things like this. I've looked into none of it. This solution was fairly simple, easy to implement, and provides a significant performance boost in many cases. I don't believe that this will prevent us from making further improvements in the future, either.
Alternate approaches
That said, one alternate approach may be to record hints (different kinds of hints) at compile time, and then to use those to decide in which order to evaluate binary expressions. For example, if something like
stack()as tagged or hinted asHint::ProbablySmallandpaths.changed()was hinted asHint::PossiblyLarge, we could use that info to dynamically determine the order in which to evaluate revsets, evaluating theProbablySmallfirst and then using that to evaluate thePossiblyLargefunction.. That being said, it seems fairly common for arguments to be evaluated left to right, so it's not clear if this would be beneficial or just possibly confusing or counter-intuitive.