Skip to content

Conversation

@claytonrcarter
Copy link
Collaborator

@claytonrcarter claytonrcarter commented Aug 26, 2025

This implements a simple optimization for revset functions intersection() and difference() that significantly improves the eval time of revsets like stack() & paths.changed(foo).

When revsets are evaluated, their arguments are eagerly evaluated – totally independently – before each particular dag/revset operation is applied to them. The issue I have found is that expressions like stack() & paths.changed(foo) have to evaluate both stack() and paths.changed(foo) before the intersection can be calculated, but the "matching" revsets (like paths.changed() and author.email()) operate only on all visible commits. This means that, in cases like stack() & paths.changed(foo), it examines way more commits that it really needs to and often feels very slow. See #1586, for example.

The solution here accomplishes this by using the result of the left-hand-side during the evaluation of the right-hand-side. This was already sort of happening for the branches() revset (from when we added the ability to filter the matched branch by name), so this just extends this "which commits to match on" plumbing into the rest of the revsets, and then updates the intersection() and difference() functions to use that plumbing. (Both of these functions will always limit the results to – at most – the result of the lhs.)

The other 2-arg revsets (union(), only(), and range()) are not affected, though I wonder if a related optimization could be applied to only() and range(), both of which should only care about the descendants of of the lhs. (I think.) In any event, I have not updated those because I haven't experienced slow performance when I use them.

Original description

When we evaluate binary revset expressions (for example stack() & branches()) we currently evaluate both revsets on their own and then evaluate the intersection of the two. This change passes the left hand side revset result into the right hand side function as a "hint", so that the rhs can reduce the scope of it's evaluation to just those commits from the lhs result. This is implemented to only affect intersection() and difference() (because they are the only binary functions that reduce one set by another) and has the most dramatic effect on revsets that have historically been evaluated against all public commits like paths.changed().

The main caveat of this approach is that – because the hints are passed left-to-right only – 2 equivalent revsets may evaluate in 2 very different amounts of time: stack() & paths.changed(foo) and paths.changed(foo) & stack() will (should!) produce identical results, but – with this change – the former will be very fast while the latter will be possibly slow.

I have not done any systematic benchmarks, but my experience of using a revset like stack() & paths.changed(foo) is that the evaluation time goes from "hmm, that's taking longer than I expected" to "oh, that was so fast that I should check that it's correct". So I'm satisfied.

I'm certain that there are vast libraries full of rigorous academic research and battle-tested, real-world reports about how to optimize order-of-execution for things like this. I've looked into none of it. This solution was fairly simple, easy to implement, and provides a significant performance boost in many cases. I don't believe that this will prevent us from making further improvements in the future, either.

Alternate approaches
That said, one alternate approach may be to record hints (different kinds of hints) at compile time, and then to use those to decide in which order to evaluate binary expressions. For example, if something like stack() as tagged or hinted as Hint::ProbablySmall and paths.changed() was hinted as Hint::PossiblyLarge, we could use that info to dynamically determine the order in which to evaluate revsets, evaluating the ProbablySmall first and then using that to evaluate the PossiblyLarge function.. That being said, it seems fairly common for arguments to be evaluated left to right, so it's not clear if this would be beneficial or just possibly confusing or counter-intuitive.

@claytonrcarter claytonrcarter force-pushed the revset-perf-hints branch 3 times, most recently from 9a67137 to 57011ca Compare August 27, 2025 01:57
@claytonrcarter claytonrcarter marked this pull request as ready for review August 27, 2025 02:39
@claytonrcarter claytonrcarter marked this pull request as draft August 27, 2025 20:54
@claytonrcarter
Copy link
Collaborator Author

claytonrcarter commented Aug 27, 2025

Oops. I "refactored" this last night and must have broken it; it's no longer fast. And apparently not as simple as I thought. Fixed

@claytonrcarter claytonrcarter marked this pull request as ready for review August 27, 2025 22:23
@claytonrcarter claytonrcarter marked this pull request as draft August 28, 2025 00:47
`make_pattern_matcher_for_set()` was added in
8ca6b55 to operate on a given set (instead of
all visible commits), so this just adds plumbing so that *all* revset functions
are able to receive an `Option<CommitSet>` which they may use to optimize (or
not) their own evaluation.

Having done that, though, no such hints are actually passed around yet.
Now that we can pass revsets around to limit how many commits a rhs examines
during it's evaluation, this change actually starts passing such hints, but
only for `intersection()` and `difference()`.

This is a crude approach, of course, but it gets the job done and it
significantly improves performance for `intersection()` and `difference()`.
@claytonrcarter claytonrcarter marked this pull request as ready for review September 7, 2025 12:45
@claytonrcarter claytonrcarter merged commit 49447a7 into arxanas:master Sep 7, 2025
13 checks passed
@claytonrcarter claytonrcarter deleted the revset-perf-hints branch September 7, 2025 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant