Skip to content

release-25.3: execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins #150016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jul 11, 2025

Backport 3/3 commits from #148186 on behalf of @yuzefovich.


sql: use correct context in recursive CTE iterations

Previously, the execbuilder.Builder that we use in recursive CTE
iterations referenced the same context that the Builder captured on the
main query path. This is problematic since that context might have
a tracing span that's already been finished. This commit fixes this
issue by explicitly passing the context argument into the iteration
function. This was only exposed because we added some logging in the
following commit but has been present for a while now.

logictest: extend a few EXPLAIN tests a bit

This commit adjusts a few existing EXPLAIN tests to use VERBOSE as well as
adds another RBR setup with CASCADE FKs. These will be used to highlight
the parallelization change in the multi-key lookup joins in the
following commit.

execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins

The DistSender API forces its users to make a choice between cross-range
parallelism (which is needed for performance) and setting memory limits
(which is needed for stability). Streamer was introduced to address this
limitation, but it comes with some requirements, one of which is that it
needs to have access to LeafTxns. However, mutation statements must run
in the RootTxn, so we never use the Streamer for those and fall back to
using the DistSender API directly (via txnKVFetcher). There, we
currently have the following heuristic:

  • if we know that each input row results in at most one lookup row, then
    we consider such a lookup to be "safe" parallelization, so we disable
    usage of memory limits on the BatchHeader. This is the case for index
    joins (when we always expect to get exactly one looked up row) as well
    as lookup joins that have "equality columns are key" property (when we
    expect at most one looked up row).
  • otherwise, if we have a multi-key lookup join, we use the default
    fetcher memory limits (TargetBytes of 10MiB), which disables cross-range
    parallelism.

Most commonly this will affect mutation statements and will have a more
pronouanced effect on the multi-region tables, so this commit extends the
heuristic for when we consider it to be "safe" for parallelization.

Namely, we now calculate the average lookup ratio based on the lookup
equality columns and the available table / column statistics, and if the
ratio doesn't exceed the allowed limit, then we'll enable the
parallelism. To a certain degree, this heuristic resembles the "equality
columns are key" heuristic that we already utilize, but instead of
a guaranteed maximum on the lookup ratio we use the estimated average.

What we're trying to prevent with the existing and the new heuristics is
the case when we construct such a KV batch that the KV response will
overwhelm (read "will OOM") the node issuing the KV batch. In the
existing heuristic we say that "if lookup ratio is guaranteed to not
exceed one, then it should be safe".

I believe that the new heuristic should be safe in practice for most
deployments due to the following reasons:

  • we already have an implicit limiting behavior in the join reader due
    to its execution model (it first buffers some number of rows, up to
    2MiB in size when not using the streamer, deduplicates the lookup spans,
    and performs the lookup of all those spans in a single KV batch).
    Empirical testing shows that we expect to have at most 25k lookups in
    that single KV batch.
  • this will have impact only when the streamer is not used, which most
    commonly will mean we're executing a mutation, and in our docs we
    advocate for not performing large mutations. (I'm stretching things
    a bit here since even if we modify small amount of data, to compute that
    we might read a lot, which could be destabilizing if we disable KV
    limits. Yet a similar argument could be made that our current
    "equality columns are key" heuristic is not safe - it's possible to
    construct a scenario where we look up large amounts of data.)

In order to prevent this new heuristic from exploding in some edge
cases, two guardrails are added:

  • in order to handle a scenario where the lookup ratio is not evenly
    distributed (i.e. different input rows can result in vastly different
    number of looked up rows), we'll disable the heuristic if the max lookup
    ratio exceeds the allowed limit.
  • in order to handle a scenario where looked rows are very large, we'll
    disable the heuristic if the estimated average lookup row size exceeds
    the allowed limit. (Note that we don't have this kind of protection in
    the existing heuristics.)

I plan to do some more empirical runs to fine-tune the default values of
the newly added session variables, but the current defaults are:

  • parallelize_multi_key_lookup_joins_avg_lookup_ratio = 10
  • parallelize_multi_key_lookup_joins_max_lookup_ratio = 10000
  • parallelize_multi_key_lookup_joins_avg_lookup_row_size = 100 KiB.

In order to de-risk rollout of this feature, we will initially apply the
new heuristic only to mutations of multi-region tables. New session
variable parallelize_multi_key_lookup_joins_only_on_mr_mutations can
be set to false to apply the heuristic to all statements, regardless
of the table being multi-region.

Fixes: #134351.
Epic: CRDB-44104

Release note (performance improvement): Mutation statements (UPDATEs and
DELETEs) that perform lookup joins into multi-region tables (perhaps as
part of a CASCADE) are now more likely to parallelize the lookups across
ranges which improves their performance.


Release justification: performance improvement for innovation release with limited initial rollout.

… lookup joins

The DistSender API forces its users to make a choice between cross-range
parallelism (which is needed for performance) and setting memory limits
(which is needed for stability). Streamer was introduced to address this
limitation, but it comes with some requirements, one of which is that it
needs to have access to LeafTxns. However, mutation statements must run
in the RootTxn, so we never use the Streamer for those and fall back to
using the DistSender API directly (via `txnKVFetcher`). There, we
currently have the following heuristic:
- if we know that each input row results in at most one lookup row, then
we consider such a lookup to be "safe" parallelization, so we disable
usage of memory limits on the BatchHeader. This is the case for index
joins (when we always expect to get exactly one looked up row) as well
as lookup joins that have "equality columns are key" property (when we
expect at most one looked up row).
- otherwise, if we have a multi-key lookup join, we use the default
fetcher memory limits (TargetBytes of 10MiB), which disables cross-range
parallelism.

Most commonly this will affect mutation statements and will have a more
pronouanced effect on the multi-region tables, so this commit extends the
heuristic for when we consider it to be "safe" for parallelization.

Namely, we now calculate the average lookup ratio based on the lookup
equality columns and the available table / column statistics, and if the
ratio doesn't exceed the allowed limit, then we'll enable the
parallelism. To a certain degree, this heuristic resembles the "equality
columns are key" heuristic that we already utilize, but instead of
a guaranteed maximum on the lookup ratio we use the estimated average.

What we're trying to prevent with the existing and the new heuristics is
the case when we construct such a KV batch that the KV response will
overwhelm (read "will OOM") the node issuing the KV batch. In the
existing heuristic we say that "if lookup ratio is guaranteed to not
exceed one, then it should be safe".

I believe that the new heuristic should be safe in practice for most
deployments due to the following reasons:
- we already have an implicit limiting behavior in the join reader due
to its execution model (it first buffers some number of rows, up to
2MiB in size when not using the streamer, deduplicates the lookup spans,
and performs the lookup of all those spans in a single KV batch).
Empirical testing shows that we expect to have at most 25k lookups in
that single KV batch.
- this will have impact only when the streamer is not used, which most
commonly will mean we're executing a mutation, and in our docs we
advocate for not performing large mutations. (I'm stretching things
a bit here since even if we modify small amount of data, to compute that
we might read a lot, which could be destabilizing if we disable KV
limits. Yet a similar argument could be made that our current
"equality columns are key" heuristic is not safe - it's possible to
construct a scenario where we look up large amounts of data.)

In order to prevent this new heuristic from exploding in some edge
cases, two guardrails are added:
- in order to handle a scenario where the lookup ratio is not evenly
distributed (i.e. different input rows can result in vastly different
number of looked up rows), we'll disable the heuristic if the max lookup
ratio exceeds the allowed limit.
- in order to handle a scenario where looked rows are very large, we'll
disable the heuristic if the estimated average lookup row size exceeds
the allowed limit. (Note that we don't have this kind of protection in
the existing heuristics.)

I plan to do some more empirical runs to fine-tune the default values of
the newly added session variables, but the current defaults are:
- `parallelize_multi_key_lookup_joins_avg_lookup_ratio = 10`
- `parallelize_multi_key_lookup_joins_max_lookup_ratio = 10000`
- `parallelize_multi_key_lookup_joins_avg_lookup_row_size = 100 KiB`.

In order to de-risk rollout of this feature, we will initially apply the
new heuristic only to mutations of multi-region tables. New session
variable `parallelize_multi_key_lookup_joins_only_on_mr_mutations` can
be set to `false` to apply the heuristic to all statements, regardless
of the table being multi-region.

Release note (performance improvement): Mutation statements (UPDATEs and
DELETEs) that perform lookup joins into multi-region tables (perhaps as
part of a CASCADE) are now more likely to parallelize the lookups across
ranges which improves their performance.
Previously, the `execbuilder.Builder` that we use in recursive CTE
iterations referenced the same context that the Builder captured on the
main query path. This is problematic since that context might have
a tracing span that's already been finished. This commit fixes this
issue by explicitly passing the context argument into the iteration
function. This was only exposed because we added some logging in the
following commit but has been present for a while now.

Release note: None
This commit adjusts a few existing EXPLAIN tests to use VERBOSE as well as
adds another RBR setup with CASCADE FKs. These will be used to highlight
the parallelization change in the multi-key lookup joins in the
following commit.

Release note: None
@yuzefovich yuzefovich requested a review from a team as a code owner July 11, 2025 21:11
@yuzefovich yuzefovich requested review from DrewKimball and removed request for a team July 11, 2025 21:11
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jul 11, 2025
@blathers-crl blathers-crl bot requested a review from michae2 July 11, 2025 21:11
@yuzefovich yuzefovich added the blathers-backport This is a backport that Blathers created automatically. label Jul 11, 2025
Copy link

blathers-crl bot commented Jul 11, 2025

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jul 11, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuzefovich yuzefovich merged commit de177de into cockroachdb:release-25.3 Jul 13, 2025
24 of 25 checks passed
@yuzefovich yuzefovich deleted the blathers/backport-release-25.3-148186 branch July 13, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. target-release-25.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants